CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Beginner: Re: [Laurent_R] Merging the data in two files using a hash: Edit Log


Sep 3, 2014, 10:55 AM

Views: 41737
Re: [Laurent_R] Merging the data in two files using a hash

then you could read the files in parallel and remove the duplicates as you go

I can get the files sorted by the keys in perl itself.

Bbut, Do u want me to open both the files at a time and check for the keys ?

Do u have a snippet for that. ?

The Worst case scenario would be that both file will not have any matching key and ultimately all the data in both the files have to be stored in a seperate file

I said it already. You first need to sort the files on their comparison key, say in ascending order (using the Unix sort utility, for example).

Then you open both new sorted files, read the first line of each. If the keys compare equal, then you have a common record. Store the line in a file of common records (if you need one) and move to the next line for both files. And repeat the key comparison.

If they don't compare equal, then the smallest of the two corresponds to an "orphan", i.e. a record that is in the file wxhere you found it and not in the other. Write that out to an orphan file. Get the next line of the file where the orphan was found, keeping the line from the other file. And repeat the comparison.

And so on until the end of one file, at which point any remaining lines in the other file are also orphans.

I have written a generic module to do that (and a number of other things on large files), and I am using it regularly , but have not uploaded it to the CPAN so far, because uploading a module requires to do a few additional steps (preparing an install procedure, etc.) that I don't know (yet) how to do.

But if you are trying to do it and don't succeed (and show how you've tried), I would gladly post the core algorithm.

The file comparison is extremely fast, but the initial sorting of the files has an overhead, which is why I was discouraged you from trying this approach given that your hash approach is giving good results in view of the data size.

Here's what u ve suggested when i had a similar problem last time
And My Keys are nt just numbers , there are alhanumeric keys too


(This post was edited by Tejas on Sep 3, 2014, 11:08 AM)

Edit Log:
Post edited by Tejas (User) on Sep 3, 2014, 11:04 AM
Post edited by Tejas (User) on Sep 3, 2014, 11:08 AM

Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives