Jun 24, 2014, 10:43 AM
Post #28 of 28
I said it already. You first need to sort the files on their comparison key, say in ascending order (using the Unix sort utility, for example).
Re: [Tejas] Compare each and every value from two files
[In reply to]
Then you open both new sorted files, read the first line of each. If the keys compare equal, then you have a common record. Store the line in a file of common records (if you need one) and move to the next line for both files. And repeat the key comparison.
If they don't compare equal, then the smallest of the two corresponds to an "orphan", i.e. a record that is in the file wxhere you found it and not in the other. Write that out to an orphan file. Get the next line of the file where the orphan was found, keeping the line from the other file. And repeat the comparison.
And so on until the end of one file, at which point any remaining lines in the other file are also orphans.
I have written a generic module to do that (and a number of other things on large files), and I am using it regularly , but have not uploaded it to the CPAN so far, because uploading a module requires to do a few additional steps (preparing an install procedure, etc.) that I don't know (yet) how to do.
But if you are trying to do it and don't succeed (and show how you've tried), I would gladly post the core algorithm.
The file comparison is extremely fast, but the initial sorting of the files has an overhead, which is why I was discouraged you from trying this approach given that your hash approach is giving good results in view of the data size.