CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Beginner:
Compare each and every value from two files


First page Previous page 1 2 Next page Last page  View All

Veteran / Moderator

Jun 10, 2014, 1:06 PM

Post #26 of 28 (5562 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

If it is only 5 minutes, then don't change anything. You won't get a significantly better result using the other method I suggested (parallel reading), which is more complicated to implement. Actually, probably not even a better result at all.

If the hash method works properly (and a 5-minute run time indicates that it is very very probably working properly for 1.5 GB files, even though I don't know anything about your hardware), then I very strongly believe no other method will be faster (and I have really quite a bit of experience with comparisons of files having this type of size) than the hash method. It is only when the hash method fails due to lack of memory for files that are just too large for the available memory that the other method (parallel reading) makes sense. In your case, a run time of 5 minutes indicates that it is most probably working fine.

Using the other method I mentioned, just sorting the two files will most probably take more than 5 minutes (you can easily try that bit).

I think that you have to bite the bullet, 1.5 GB starts to be quite a lot of data, comparing two files of that size just takes time.

I would revisit this opinion only if you told me that your ultimate goal is to compare 15 GB files, in which case the hash method is likely to fail. But not in what you have explained so far.


Jun 24, 2014, 3:15 AM

Post #27 of 28 (4520 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post

Just curious to learn
If the file size is 20 gb , then what would be the best approach >


Veteran / Moderator

Jun 24, 2014, 10:43 AM

Post #28 of 28 (4397 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

I said it already. You first need to sort the files on their comparison key, say in ascending order (using the Unix sort utility, for example).

Then you open both new sorted files, read the first line of each. If the keys compare equal, then you have a common record. Store the line in a file of common records (if you need one) and move to the next line for both files. And repeat the key comparison.

If they don't compare equal, then the smallest of the two corresponds to an "orphan", i.e. a record that is in the file wxhere you found it and not in the other. Write that out to an orphan file. Get the next line of the file where the orphan was found, keeping the line from the other file. And repeat the comparison.

And so on until the end of one file, at which point any remaining lines in the other file are also orphans.

I have written a generic module to do that (and a number of other things on large files), and I am using it regularly , but have not uploaded it to the CPAN so far, because uploading a module requires to do a few additional steps (preparing an install procedure, etc.) that I don't know (yet) how to do.

But if you are trying to do it and don't succeed (and show how you've tried), I would gladly post the core algorithm.

The file comparison is extremely fast, but the initial sorting of the files has an overhead, which is why I was discouraged you from trying this approach given that your hash approach is giving good results in view of the data size.

First page Previous page 1 2 Next page Last page  View All

Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives