CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Merging the data in two files using a hash

 



Tejas
User

Sep 2, 2014, 5:42 AM

Post #1 of 43 (41607 views)
Merging the data in two files using a hash Can't Post

 

Can SOMEONE please comment on this and can you tell me whether this is good or ugly.
And can the code be shrinked


Quote
File1
File1
28045071,1,56,DAD,418756991,0,-9.02,01-AUG-14,01-AUG-14,1
28045281,1,19,DAD,12701012015,0,-261.02,01-AUG-14,01-AUG-14,1
28045991,1,19,DAD,379031901,0,-22.42,01-AUG-14,01-AUG-14,1
2213506106,1,24,DAD,1374249100,0,-20,01-AUG-14,01-AUG-14,1
2213506116,1,24,DAD,1374249100,0,-20,01-AUG-14,01-AUG-14,1
2264530076,1,24,DAD,1377063511,0,-350,01-AUG-14,01-AUG-14,1
2613542516,1,24,DAD,501029031,0,-30,01-AUG-14,01-AUG-14,1
2634699316,1,24,DAD,512242996,0,-100,01-AUG-14,01-AUG-14,1
2639141256,1,24,DAD,13496038905,0,-25,01-AUG-14,01-AUG-14,1
2641900466,1,24,DAD,56276190,0,-50,01-AUG-14,01-AUG-14,1
28053391,1,19,DAD,766709012,0,-70,01-AUG-14,01-AUG-14,1



Quote
28051341,1,56,DAD,199610116,0,-12.74,02-AUG-14,02-AUG-14,1
28051961,1,19,DAD,6735124615,0,-36.45,02-AUG-14,02-AUG-14,1
28052061,1,19,DAD,394104487,0,-48.61,02-AUG-14,02-AUG-14,1
28053391,1,19,DAD,766709012,0,-60,02-AUG-14,02-AUG-14,1
2399932016,1,24,DAD,567508320,0,-50,02-AUG-14,02-AUG-14,1
2451060666,1,24,DAD,499140250,0,-50,02-AUG-14,02-AUG-14,1
2495205736,1,24,DAD,774256411,0,-20,02-AUG-14,02-AUG-14,1
2604153876,1,24,DAD,7378719,0,-50,02-AUG-14,02-AUG-14,1
2638779256,1,24,DAD,240129917,0,-50,02-AUG-14,02-AUG-14,1
2646215356,1,24,DAD,1036846291,0,-40,02-AUG-14,02-AUG-14,1


Quote
OUTPUT
OUTPUT
28045071,1,56,DAD,418756991,0,-9.02,01-AUG-14,01-AUG-14,1
28045281,1,19,DAD,12701012015,0,-261.02,01-AUG-14,01-AUG-14,1
28045991,1,19,DAD,379031901,0,-22.42,01-AUG-14,01-AUG-14,1
2213506106,1,24,DAD,1374249100,0,-20,01-AUG-14,01-AUG-14,1
2213506116,1,24,DAD,1374249100,0,-20,01-AUG-14,01-AUG-14,1
2264530076,1,24,DAD,1377063511,0,-350,01-AUG-14,01-AUG-14,1
2613542516,1,24,DAD,501029031,0,-30,01-AUG-14,01-AUG-14,1
2634699316,1,24,DAD,512242996,0,-100,01-AUG-14,01-AUG-14,1
2639141256,1,24,DAD,13496038905,0,-25,01-AUG-14,01-AUG-14,1
2641900466,1,24,DAD,56276190,0,-50,01-AUG-14,01-AUG-14,1
28051341,1,56,DAD,199610116,0,-12.74,02-AUG-14,02-AUG-14,1
28051961,1,19,DAD,6735124615,0,-36.45,02-AUG-14,02-AUG-14,1
28052061,1,19,DAD,394104487,0,-48.61,02-AUG-14,02-AUG-14,1
28053391,1,19,DAD,766709012,0,-60,02-AUG-14,02-AUG-14,1 This Txn is repeated and the latest has to be considered, Second File being the latest.
2399932016,1,24,DAD,567508320,0,-50,02-AUG-14,02-AUG-14,1
2451060666,1,24,DAD,499140250,0,-50,02-AUG-14,02-AUG-14,1
2495205736,1,24,DAD,774256411,0,-20,02-AUG-14,02-AUG-14,1
2604153876,1,24,DAD,7378719,0,-50,02-AUG-14,02-AUG-14,1
2638779256,1,24,DAD,240129917,0,-50,02-AUG-14,02-AUG-14,1
2646215356,1,24,DAD,1036846291,0,-40,02-AUG-14,02-AUG-14,1


We can see that
28053391,1,19,DAD,766709012,0,-70,01-AUG-14,01-AUG-14,1
is repeated in both the files, but the latest should be considered and should be printed.
So, In the output the second file's data is printed


Output has All First Files Txns and All Second Files Txns and if the Txn Repeats (Key is first column) in second file,
Second file's data has to be considered.

Code
#! /usr/bin/perl 

my $pwd = `pwd`;
chomp($pwd);
my $clr_txns= "$pwd/File1.txt";
my $temp_file = "$pwd/File2.txt";
my $final_output= "$pwd/Final_List.txt";
open (FIRST,"< $clr_txns")or die "could not open $clr_txns $!";
open (SECOND,"< $temp_file")or die "could not open $cto_txns $!";
open (MATCH,"> $final_output")or die "could not open $final_output$!";

my %hash = ();
my %hash1 = ();
while (my $line = <FIRST>) {
my @elements = split ',', $line;
my $key = $elements[0];
print "$key\n";
$hash{$key} = 1;
$hash2{$key} = $line;
}
#open SECOND, "< $secondFile" or die "could not open second file...\n";
while (my $line = <SECOND>) {
my @elements = split ',', $line;
my $key = $elements[0]; # Perl arrays are zero-indexed
if ($hash{$key}) {
#print "($hash{$key} \n";
print MATCH "$line";
$hash{$key} = 0;
}
else {
print MATCH "$line" ; #Also Print unmatched, as we need all the txns from both the files
}
}
while( my( $key, $value ) = each %hash2 ){
if($hash{$key} != 0) {
print MATCH "$value"; # Print the values of other files, and eliminate the matched ones
}


}

close (FIRST);
close (SECOND);


Thanks
Tejas


FishMonger
Veteran / Moderator

Sep 2, 2014, 6:55 AM

Post #2 of 43 (41601 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

Using your choice of words, I must say that it's ugly.

You should ALWAYS include the strict and warnings pragmas.

Use a lexical var for the filehandle.

Use the 3 arg form of open.

Use descriptive names for the vars. %hash and %hash1 are very poor var name choices.

You only need 1 hash, not 2.

Use proper vertical and horizontal whichspace (line spacing and indentation) and be consistent.

Don't create vars which are not needed/used such as your @elements array.

With 1 or 2 exceptions the first arg for the split function should be a regex pattern, not a string.


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Cwd;

my $cwd = getcwd();
my $clr_txns = "$cwd/File1.txt";
my $temp_file = "$cwd/File2.txt";
my $final_output = "$cwd/Final_List.txt";

open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>";
open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>";
open my $out_fh, '<', $final_output or die "could not open $final_output <$!>";

# Since I don't know what your data represents, I can't come up with a better name
# so I kept your hash name
my %hash;

while (my $line = <$in_fh1>) {
my $key = (split /,/, $line)[0];
$hash{$key} = $line;
}
close $in_fh1;

while (my $line = <$in_fh2>) {
my $key = (split /,/, $line)[0];
$hash{$key} = $line;
}
close $in_fh2;

foreach my $value (values %hash) {
print $out_fh $value;
}
close $out_fh;



(This post was edited by FishMonger on Sep 2, 2014, 7:17 AM)


Laurent_R
Veteran / Moderator

Sep 2, 2014, 10:02 AM

Post #3 of 43 (41589 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

Hi Tejas,

I have answered on your other post and see now that you posted the same question twice and Fishmonger has already given you a more detailed answer. Posting twice lead to duplication of work, please don't do it.


Tejas
User

Sep 2, 2014, 11:28 AM

Post #4 of 43 (41586 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Sorry for that.
i have mistakenly done that in the other post
Will delete it.


Thanks
Tejas


Tejas
User

Sep 2, 2014, 11:46 AM

Post #5 of 43 (41580 views)
Re: [FishMonger] Merging the data in two files using a hash [In reply to] Can't Post

Thanks.
Iam actully in the test phase and did not really look into that.
Now i will change the script accordinglky.

Will this script work if i have two files with 50 lakh lines each
And assume there are no duplictae's and output will have 1 crore lines .

I cant think anything except hash here, but am afraid this wont work.
Even Sorting Both the files and running this code will not help as , in the worst case scenario, there would nt be any duplicates.

Any new ways of implenting this .
Thanks
Tejsa


FishMonger
Veteran / Moderator

Sep 2, 2014, 11:51 AM

Post #6 of 43 (41577 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

What are the file sizes in KB or MB, not the number of lines?

How much RAM do you have?


Tejas
User

Sep 2, 2014, 11:55 AM

Post #7 of 43 (41573 views)
Re: [FishMonger] Merging the data in two files using a hash [In reply to] Can't Post

I hve afile with 1 gb .
Some files are around 200 to 300 mb

1 gb ram

its just not that, last time whn i used hash for a file with 50lakh keys.
my system hanged

Thanks
Tejas


(This post was edited by Tejas on Sep 2, 2014, 11:56 AM)


FishMonger
Veteran / Moderator

Sep 2, 2014, 12:33 PM

Post #8 of 43 (41563 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

1GB of RAM is considered very low these days, especially if you're using Windows.

The files in the 200-300MB range should be much of a problem, but the GB files will be a problem due to your limited RAM.

My first recommendation is to add more RAM. IMO, 4GB should the minimum when doing this kind of work on Windows.

If you can't upgrade the the RAM then you could filter the data through a database rather than storing everything in memory via a hash. You could parse each line as you're currently doing but instead of assigning a hash value, you store that data in the DB. You could even access your csv files with sql statements as if they were database tables. Once the 2 input files have been processed, you execute another query that dumps the data directly to a new csv file.

Going the DB will be a little more complex coding but it will also reduce the memory footprint and won't hang the system like your previous experience.


Tejas
User

Sep 2, 2014, 10:24 PM

Post #9 of 43 (41555 views)
Re: [FishMonger] Merging the data in two files using a hash [In reply to] Can't Post


Quote
access your csv files with sql statements as if they were database tables


Does that i dont need to have sql at all and task can be performed as if they were tables.
Do u have a snippet for this ?

Thanks
Tejas


FishMonger
Veteran / Moderator

Sep 3, 2014, 8:20 AM

Post #10 of 43 (41542 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post


Code
#!/usr/bin/perl 

use strict;
use warnings;
use DBI;
use DBD::CSV;
use Data::Dumper;

# connect to "csv database" using default parameters
my $dbh = DBI->connect("DBI:CSV:") or die $DBI::errstr;

# prepare and execute select statement to fetch 1 row
my $sth = $dbh->prepare("select * from file1.csv limit 1") or die $dbh->errstr();
$sth->execute;

# fetch the row
my @row = $sth->fetchrow_array;

# dump out the row (array)
print Dumper \@row;

# disconnect from the "csv database"
$dbh->disconnect;


Output using your first file.

c:\test>csv2sql_example.pl

Code
$VAR1 = [ 
'28045281',
'1',
'19',
'DAD',
'12701012015',
'0',
'-261.02',
'01-AUG-14',
'01-AUG-14',
'1'
];


http://search.cpan.org/~timb/DBI-1.631/DBI.pm
http://search.cpan.org/~jzucker/DBD-CSV-0.22/lib/DBD/CSV.pm


(This post was edited by FishMonger on Sep 3, 2014, 8:22 AM)


FishMonger
Veteran / Moderator

Sep 3, 2014, 8:50 AM

Post #11 of 43 (41537 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

I should mention that the most efficient database approach would be to use the "load data inifile" sql statement to load the csv files into the database.

1) create the database and table structure (that's 2 separate sql statements).

2) insert file1.txt via a "load data inifile" statement.

3) insert file2.txt via a slightly adjusted "load data inifile" statement i.e., add the REPLACE keyword so that when a duplicate ID (primary key) is seen, it will update/replace that row from file1 with the row from file2.

4) once both files are loaded execute a "select into outfile" statement to dump out the data to a new csv file.

5) delete the database if not needed.

http://dev.mysql.com/doc/refman/5.1/en/load-data.html
http://dev.mysql.com/doc/refman/5.1/en/select-into.html


Laurent_R
Veteran / Moderator

Sep 3, 2014, 9:56 AM

Post #12 of 43 (41530 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

Hi Tejas, is your data sorted? It seems to be, but I don't understand fully how.

Update: the reason I am asking is that if the data is sorted one way or another in accordance with the comparison key, then you could read the files in parallel and remove the duplicates as you go. The good thing about this approach is that it will work for files of just about any size, irrespective of RAM size, and it will be much faster than a database approach. The downside is that it requires a bit of cleverness, or rather care and attention, to get the algorithm really right.


(This post was edited by Laurent_R on Sep 3, 2014, 10:21 AM)


Tejas
User

Sep 3, 2014, 10:55 AM

Post #13 of 43 (41518 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post


Quote
then you could read the files in parallel and remove the duplicates as you go


I can get the files sorted by the keys in perl itself.

Bbut, Do u want me to open both the files at a time and check for the keys ?

Do u have a snippet for that. ?

The Worst case scenario would be that both file will not have any matching key and ultimately all the data in both the files have to be stored in a seperate file


Quote
I said it already. You first need to sort the files on their comparison key, say in ascending order (using the Unix sort utility, for example).

Then you open both new sorted files, read the first line of each. If the keys compare equal, then you have a common record. Store the line in a file of common records (if you need one) and move to the next line for both files. And repeat the key comparison.

If they don't compare equal, then the smallest of the two corresponds to an "orphan", i.e. a record that is in the file wxhere you found it and not in the other. Write that out to an orphan file. Get the next line of the file where the orphan was found, keeping the line from the other file. And repeat the comparison.

And so on until the end of one file, at which point any remaining lines in the other file are also orphans.

I have written a generic module to do that (and a number of other things on large files), and I am using it regularly , but have not uploaded it to the CPAN so far, because uploading a module requires to do a few additional steps (preparing an install procedure, etc.) that I don't know (yet) how to do.

But if you are trying to do it and don't succeed (and show how you've tried), I would gladly post the core algorithm.

The file comparison is extremely fast, but the initial sorting of the files has an overhead, which is why I was discouraged you from trying this approach given that your hash approach is giving good results in view of the data size.

Here's what u ve suggested when i had a similar problem last time
And My Keys are nt just numbers , there are alhanumeric keys too

Thanks
Tejas


(This post was edited by Tejas on Sep 3, 2014, 11:08 AM)


Tejas
User

Sep 4, 2014, 5:40 AM

Post #14 of 43 (41471 views)
Re: [FishMonger] Merging the data in two files using a hash [In reply to] Can't Post

Hi
Here is the Script with some minute changes


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Cwd;

my $cwd = getcwd();
my $clr_txns = "$cwd/File1.txt";
my $temp_file = "$cwd/File2.txt";
my $final_output = "$cwd/Final_List.txt";

open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>";
open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>";
open my $out_fh, '>', $final_output or die "could not open $final_output <$!>";

my %Unbal_Hash;

while (my $line = <$in_fh1>) {
my $key = (split /,/, $line)[0];
$Unbal_Hash{$key} = $line;
}
close $in_fh1;

while (my $line = <$in_fh2>) {
my ($key,$balance) = (split /\t/, $line)[0,7];
if (exists $Unbal_Hash{$key} && $balance == 0 )
{
print "$Unbal_Hash{$key}\n";
delete $Unbal_Hash{$key};
}
else
{
$Unbal_Hash{$key} = $line;
}
}
close $in_fh2;

foreach my $value (values %Unbal_Hash) {
print $out_fh $value;
}
close $out_fh;


I am just eliminating the keys which match the key and the total_amount is 0(They are not needed)


Quote
File1
889546565,6,46,APY,0,-9980,14-DEC-13,14-DEC-13,1
889996975,6,46,APY,0,-9980,14-DEC-13,14-DEC-13,1
889998385,6,46,APY,0,-14067,05-DEC-13,05-DEC-13,1
890722795,6,46,APY,0,-9430,14-DEC-13,14-DEC-13,1
890857005,6,24,APY,0,-500,10-NOV-13,10-NOV-13,1
890925475,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936315,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936335,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936355,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936415,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
696795792443,6,308,APY,550,0,24-JUL-14,24-JUL-14,3



Quote
File 2
1993532910 6 212-366-520 APY 451028365 900 -900 0 14-AUG-14 15-AUG-14 3
1993536110 6 366-520 APY 477045894 390 -390 0 14-AUG-14 15-AUG-14 2
1993536750 6 366-520 APY 917563294 300 -300 0 14-AUG-14 15-AUG-14 2
1993539310 6 366-520 APY 7512802845 432 -432 0 14-AUG-14 15-AUG-14 2
1993539950 6 366-520 APY 449362894 432 -432 0 15-AUG-14 15-AUG-14 2
1993541230 6 366-520 APY 6770624155 1234 -1234 0 15-AUG-14 15-AUG-14 2
1993542510 6 366-520 APY 628602625 100 -100 0 15-AUG-14 15-AUG-14 2
1993543790 6 366-520 APY 843380824 400 -400 0 15-AUG-14 15-AUG-14 2
1993544430 6 366-520 APY 531660774 99 -99 0 15-AUG-14 15-AUG-14 2
1993545070 6 212-366-520 JPY 444744025 432 -432 0 15-AUG-14 15-AUG-14 3
696795792443 6 14-308 APY 521806975 550 -550 0 24-JUL-14 15-AUG-14 4


I hope the way iam handling the Txns with Amount 0 is ok

Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 4, 2014, 9:54 AM

Post #15 of 43 (41462 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post


In Reply To
I can get the files sorted by the keys in perl itself.

Probably not if your files are too large to fit in RAM.

If you are under Linux or Unix, you can use the OS's sort utility, which can sort files much larger than RAM by using temporary files on disk, but I do not know if there is such utility on Windows.


In Reply To

But, Do u want me to open both the files at a time and check for the keys ?


yes, that was the idea. It is detailed in the older post you quoted from me just above.


In Reply To

Do u have a snippet for that. ?


Yes, I could provide one, but please explain exactly what you are trying to do, as I am not entirely sure of the details. It seems to me that you are trying to remove from one file data items that also exist in the other file. Is this correct? Is there more to it?

But this snippet would only work for sorted data, so that it depends on whether you are really able to sort the files on their keys.


FishMonger
Veteran / Moderator

Sep 4, 2014, 10:24 AM

Post #16 of 43 (41460 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

A number of the GNU utilities have been ported to Windows, sort being one of them.
http://gnuwin32.sourceforge.net/packages/coreutils.htm
http://unxutils.sourceforge.net/


(This post was edited by FishMonger on Sep 4, 2014, 10:29 AM)


Tejas
User

Sep 4, 2014, 10:38 AM

Post #17 of 43 (41456 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post


Quote
Yes, I could provide one, but please explain exactly what you are trying to do, as I am not entirely sure of the details. It seems to me that you are trying to remove from one file data item
s that also exist in the other file. Is this correct? Is there more to it?


Yes, This is the task..But there would be more operations and changes to the mathmatical operations.
But the comparisions would definitely be there.
And i have also specified in my earlier post that , the keys will not just be numbers, there would alphanumerics and aplhabets too (EX : aHXPVTTRER).
If your code can help me , i will definitely use it , as it is working on sorted files and i assume that the comaprisions would be really less comaparitively

FInally, I dint really get why u suggested abt windows sort utility, i never use windows at all.
My work is totally on linux and i can use command line sort utility
i will be glad to use your code snippet.

Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 4, 2014, 11:00 AM

Post #18 of 43 (41453 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

I just was not sure whether you were using Linux or Windows, tha's why I asked. Then you can use the Linux sort utility.

I'll come back later today with the basic code to do it, not enough time right now.


Tejas
User

Sep 4, 2014, 11:08 AM

Post #19 of 43 (41451 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Thanks

Code
   if (exists $Unbal_Hash{$key} && $balance == 0 )  
{
print "$Unbal_Hash{$key}\n";
delete $Unbal_Hash{$key};
}
else
{
$Unbal_Hash{$key} = $line;
}

also please comment on this code too, is this the right approach ?


(This post was edited by Tejas on Sep 4, 2014, 11:29 AM)


Laurent_R
Veteran / Moderator

Sep 4, 2014, 11:54 AM

Post #20 of 43 (41445 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

I must say that this part in your code surprised me a bit when I saw it in your code (essentially, why do you delete from the hash) , but since I haven't understood what in details you are trying to do, I do not know whether this is correct. That's the problem with this thread and the previous one on the same subject: you haven't defined precisely what you want to do for us, and I am not even sure you really know for sure yourself. When you want to write a program, you first need to clarify exactly what you want it to do (often by writing some specs or some business rules, or at the very least by having them very clear in your own mind). Unless I missed an important post, your description of what you want is far from being precise enough on what you need.

Well, enough talking, I'll try to write up some code based on my best comprehension of what you need, you'll probably have to adapt it to fit your real needs. But at least you'll have a basic algorithm, hopefully well coded, to use, and hopefully you'll have only implementation details to change.


Laurent_R
Veteran / Moderator

Sep 4, 2014, 1:59 PM

Post #21 of 43 (41436 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Alright, a first very simple solution, which might work or might be too simple for your needs.

This assumes you just want to remove duplicates, or, in other words, retain only one line for every unique key. In this case, you can just use the sort utility to merge together and sort the data from both files and produce one file with unique values. Note that, from what you said previously, your sort should be alphanumerical, not numerical.


I first define a comparison function:


Code
sub compare { 
my ($curr, $prev) = @_;
my $curr_key = (split /,/, $curr)[0];
my $prev_key = (split /,/, $prev)[0];
return 1 if $curr_key eq $prev_key;
return 0;
}

This function is receiving two lines from the calling function, splitting the lines to get the keys, and comparing the keys. It returns 1 if the keys are equal (duplicates) and 0 otherwise. In one of my real programs, this function would be much shorter (probably 2 or 3 lines) and would be most probably stored in a coderef rather than a regular function, but I tried to make it as simple as possible to help you understand the principle.

My own version might be something like this:

Code
sub compare { 
my ($curr_key, $prev_key) = (split /,/, $_)[0] for @_;
$curr_key eq $prev_key ? 1 : 0;
}

But don't worry about that, use the first version for the time being.

Please also note that it really makes sense to separate the functional rules (how to compare records, stored in this function) from the technical duplicate removing part (below). It means you can reuse the technical part and just change the functional part for another similar problem.

Now the duplicate removal. This assumes you have already opened three filehandlers, $FH_IN for the input, $FH_DUPL for printing out the duplicates, and $FH_OUT for output of the unique lines.


Code
my $previous_line = ""; 
while (my $line = <$FH_IN>) {
chomp $line;
if (compare($line, $previous_line)) {
# this line is a duplicate
print $FH_DUPL $line, "\n";
} else {
print $FH_OUT $line, "\n";
}
$previous_line = $line;
}


As you can see, this is fairly short and simple code.

I haven't tested the above because I don't really have data to do it, but I believe this should work, because it is a simplified version of something that I have tested extensively. I might have goofed something when simplifying it, but if such is the case, it should be easy enough to fix it.

I'll post a bit later a more complex solution where the two files are read in parallel. But the one above might just be sufficient for your needs.


Laurent_R
Veteran / Moderator

Sep 4, 2014, 2:36 PM

Post #22 of 43 (41434 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

If you need a more detailed output than what I suggested above, you might try the following.

This assumes that 6 filehandlers are open before we start:
- $IN1 and $IN2 for the two input files
- $ORPH1 and $ORPH2 for orphans (records in one file but not in the other)
- $OUT1 and $OUT2 for common lines (two files because the keys of the input files might be the same and the content not necessarily be exactly identical)
Of course, you can simplify all this if some files are not needed.

The comparison function needs to be slightly different than before, because it needs to return three possible values:



Code
sub compare2 {  
my ($curr_key, $prev_key) = (split /,/, $_)[0] for @_;
return $curr_key cmp $prev_key;
}


Now the actual comparison code:

Code
my $ligne1 = <$IN1>; 
my $ligne2 = <$IN2>;
die "One of the input files is empty\n" unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
while ( 1 ) {
my $comparison = compare2($ligne1, $ligne2);
if ($comparison > 0) {
print $ORPH2 $ligne2, "\n";
$ligne2 = <$IN2>;
last unless defined $ligne2;
chomp $ligne2;
} else {
if ($comparison < 0) {
print $ORPH1 $ligne1, "\n";
$ligne1 = <$IN1>;
last unless defined $ligne1;
chomp $ligne1;
} else {
print $OUT1 $ligne1, "\n";
print $OUT2 $ligne2, "\n";
$ligne1 = <$IN1>;
$ligne2 = <$IN2>;
last unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
}
}
}
print $ORPH2 $ligne2 if defined $ligne2;
print $ORPH2 $ligne2 while $ligne2 = <$IN2>;
print $ORPH1 $ligne1 if defined $ligne1;
print $ORPH1 $ligne1 while $ligne1 = <$IN1>;

Same comment as my previous post: I haven't tested on your data, because I don't have enough of your data, so I might have goofed one detail here or there, but my module from which I took the code has been thoroughly tested in real life applications and is believed to be bug free.


(This post was edited by Laurent_R on Sep 4, 2014, 2:42 PM)


Tejas
User

Sep 4, 2014, 6:44 PM

Post #23 of 43 (41425 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Hi
Iam sorry to say that our needs change day by day and i am adding new stuff day by day

The only reason iam deleting the hash is because i donot need the entries whose total sum is 0.
Or else i will end up printing txns with amount 0 and non zero .
And iam interested in non zero txns only

Basically i have two files
1. Files which have today's transaction report,which has an update of historical txns and current txns
2.And a historical non-zero txns report.
Generally, We check the latest sum of amounts

And i firstly check whether the historical txns are available in todays's report
If No --> They are still non zero (This shhud be prnted)
if yes --> Then there are 2 cases
1. They can be zero
2. They can be non-zro but with some modification(as they are available in today's report, so definitely ,there will change in the amoount)

That is only reason why iam deleting the values with 0 from the hash
and at the end i will just print those values which are non-zero


Quote
File1
889546565,6,46,APY,0,-9980,14-DEC-13,14-DEC-13,1
889996975,6,46,APY,0,-9980,14-DEC-13,14-DEC-13,1
889998385,6,46,APY,0,-14067,05-DEC-13,05-DEC-13,1
890722795,6,46,APY,0,-9430,14-DEC-13,14-DEC-13,1
890857005,6,24,APY,0,-500,10-NOV-13,10-NOV-13,1
890925475,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936315,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936335,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936355,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936415,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
696795792443,6,308,APY,550,0,24-JUL-14,24-JUL-14,3


File 2
1993532910 6 212-366-520 APY 451028365 900 -900 0 14-AUG-14 15-AUG-14 3
1993536110 6 366-520 APY 477045894 390 -390 0 14-AUG-14 15-AUG-14 2
1993536750 6 366-520 APY 917563294 300 -300 0 14-AUG-14 15-AUG-14 2
1993539310 6 366-520 APY 7512802845 432 -432 0 14-AUG-14 15-AUG-14 2
1993539950 6 366-520 APY 449362894 432 -432 0 15-AUG-14 15-AUG-14 2
1993541230 6 366-520 APY 6770624155 1234 -1234 0 15-AUG-14 15-AUG-14 2
1993542510 6 366-520 APY 628602625 100 -100 0 15-AUG-14 15-AUG-14 2
1993543790 6 366-520 APY 843380824 400 -400 0 15-AUG-14 15-AUG-14 2
1993544430 6 366-520 APY 531660774 99 -99 0 15-AUG-14 15-AUG-14 2
1993545070 6 212-366-520 APY 444744025 432 -432 0 15-AUG-14 15-AUG-14 3
696795792443 6 14-308 APY 521806975 550 -550 0 24-JUL-14 15-AUG-14 4


TXN WITH 696795792443 HAS
696795792443,6,308,APY,550,0,24-JUL-14,24-JUL-14,3 IN FIRST FILE
696795792443 6 14-308 APY 521806975 550 -550 0 24-JUL-14 15-AUG-14 4 IN SECOND FILE(Today's data)

This means that a historical txns has an update today and the total amount is 0.so we dont need this to printed .as its happily balanced

But an example below is unballanced case, where there is an update .but the total is still non zero.we have to print the latest data , as there is an update
696795792443,6,308,APY,550,0,24-JUL-14,24-JUL-14,3 IN FIRST FILE
696795792443 6 14-308 APY 521806975 550 -1550 1000 24-JUL-14 15-AUG-14 4 IN SECOND FILE(Today's data)

Finally All The UnMatched Txns shud also be printed as they are all non-zero

First File always has non-zero values
Second File has Zero Value's and Non-zero(There will be a lot of unmatched txn with 0 , which also shud nt printed, i am finding a way to do that.
And MAtched Txns with 0 are anyhow being deleted.so only the above case has to be dealt with)

All that iam doing is print all the non-zero values from both the files

Compare if a non-zero has an update in the latest file, if yes and zero , ignore
if yes and non-zero print
if not available in current file, it means its still non zero

Hope u have understood the business behind ths


(This post was edited by Tejas on Sep 4, 2014, 7:10 PM)


Laurent_R
Veteran / Moderator

Sep 5, 2014, 3:25 PM

Post #24 of 43 (41385 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

And did you try the two pieces of code I suggested?


Tejas
User

Sep 6, 2014, 8:05 AM

Post #25 of 43 (41218 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Hi

Yes, i tried it on test data.
As the main data is still spooling, it atleast takes 8 hours to spool the data from Sql
And then i will compare it on the Prod Data.

I will post the updates once the file is ready.



Thanks
Tejas


Tejas
User

Sep 8, 2014, 12:51 AM

Post #26 of 43 (96448 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Doubts in the code.
These basically are doubts related to syntax not the logic


Code
sub compare2 {   
my ($curr_key, $prev_key) = (split /,/, $_)[0] for @_;
return $curr_key cmp $prev_key;
}

Now the actual comparison code:
Code
my $ligne1 = <$IN1>; //Here We Aere reading the First Lines and how are we going ahead with next line
my $ligne2 = <$IN2>;
die "One of the input files is empty\n" unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
while ( 1 ) { // Here IS nt it going to infinite loop
my $comparison = compare($ligne1, $ligne2);
if ($comparison > 0) { //Here the data is matchng, is nt it ?
print $ORPH2 $ligne2, "\n"; //Then why are we printing it in orpahn
$ligne2 = <$IN2>;
last unless defined $ligne2; //what is lat unless defined
chomp $ligne2;
} else {
if ($comparison < 0) { // We are just Sending 0 or 1 from Compare Report , How can we get value < 0
print $ORPH1 $ligne1, "\n";
$ligne1 = <$IN1>;
last unless defined $ligne1;
chomp $ligne1;
} else { //if comaparision passes
print $OUT1 $ligne1, "\n";
print $OUT2 $ligne2, "\n";
$ligne1 = <$IN1>;
$ligne2 = <$IN2>;
last unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
}
}
}
print $ORPH2 $ligne2 if defined $ligne2; Why are we printing here ad below whn we are printing above ?
print $ORPH2 $ligne2 while $ligne2 = <$IN2>;
print $ORPH1 $ligne1 if defined $ligne1;
print $ORPH1 $ligne1 while $ligne1 = <$IN1>;


Can u give walk me through if u have time.
I am running this code on my data and its running
And i get this warining for every line

Quote
of uninitialized value in string eq at ./Kompare.pl line 58, <$in_fh2> line 261502.
Use of uninitialized value in string eq at ./Kompare.pl line 58, <$in_fh2> line 261502.
Use of uninitialized value in string eq at ...


Iam totally confused how the code works.
But it is giving me desired out put in 5 minutes.
Dint really understand how

below is the chaned code , i actually wanted all the Matche dand Unmatched in one file .


Will update with the output.

Thanks
Tejas

Code
while ( 1 ) { 
my $comparison = compare($ligne1, $ligne2);
if ($comparison > 0) {
$ligne2 = <$in_fh2>;
last unless defined $ligne2;
chomp $ligne2;
} else {
if ($comparison < 0) {
$ligne1 = <$in_fh1>;
last unless defined $ligne1;
chomp $ligne1;
} else {
print $OUT2 $ligne2, "\n";
$ligne1 = <$in_fh1>;
$ligne2 = <$in_fh2>;
last unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
}
}
}
print $OUT2 $ligne2 if defined $ligne2;
print $OUT2 $ligne2 while $ligne2 = <$in_fh2>;
print $OUT2 $ligne1 if defined $ligne1;
print $OUT2 $ligne1 while $ligne1 = <$in_fh1>;

But i dint really understand how


(This post was edited by Tejas on Sep 8, 2014, 2:32 AM)


Laurent_R
Veteran / Moderator

Sep 8, 2014, 10:50 AM

Post #27 of 43 (96433 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

OK, let me try to explain.


Code
sub compare2 {    
my ($curr_key, $prev_key) = (split /,/, $_)[0] for @_;
return $curr_key cmp $prev_key;
}


The split splits each of the lines received as argument to get the keys of these lines. Then we compare the keys. The cmp function retunrns 0 if the keys compare equal, -1 if key2 is larger than key1 and +1 if key1 is larger than key 2.

Now the main comparison part. I first read the first line of each file. Then I'll compare the lines, and based on the result of the comparison I'll fetch more lines. If the two keys compare equal (the subroutine returns 0 and we go into the else part), then I know that the lines are common to both files, so I print them to $OUT1 and $OUT2 and fetch a new line from both lines:

Code
        print $OUT1 $ligne1, "\n";   
print $OUT2 $ligne2, "\n";
$ligne1 = <$IN1>;
$ligne2 = <$IN2>;

If any of the two lines is not defined, then we have reached the end of the corresponding file, and the last exits the while loop. And we chomp the two new lines for preparing the comparison in the next interation of the loop.


Code
        last unless defined $ligne1 and defined $ligne2;   
chomp ($ligne1, $ligne2);


If the key dpon't compare equal (the subroutine returns -1 or +1), then it means that we have an orphan. If the sub returns +1, it means than the key of line 1 was larger than the key of line2, so we have in file2 a line that does not exists is file1. We print it as an orphan to $ORPH2. We keep line1 from the revious fetch and fetch a new line from file 2. If line2 is not defined, it means that we are at the end of file2, we are almost done, the "last" command exits the while loop. We chomp the new line 2 and get back to the beginning of the loop for the comparison.

If the subrotine returned -1, it is just the opposite case, line1 is an orphan, we print it to $ORPH1 and fetch the next line from file1.

When we exit the loop, it means that at least one of the file has been exhausted. We still need to print as an orphan the other line that we had and we still need to print as orphans all the lines still in the other file (if any). That's what the four print statements do after the end of the while loop.

Is this clearer now?

As for the warning about "unitialized value" you would have either to post the full code, or, at least, tell me what there is in line 58 of your program.


Tejas
User

Sep 8, 2014, 11:46 AM

Post #28 of 43 (96431 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

I got confused here, We are nt incrementing adn how can it gfo to next line ?
Thts where all the confusion started .

Code
 print $OUT1 $ligne1, "\n";    
print $OUT2 $ligne2, "\n";
$ligne1 = <$IN1>;
$ligne2 = <$IN2>;


Also,My requirement is to print the latest of both files , not both ,
So ihave changed it in my code pasted above.

Now , it a bit clearer.

i always use to use while loop with file handler
while (<IN>) {
}

Ans , U have directly used it in code and i dont see any line being incremented .so was confused how this is reading the next line

Code
 print $OUT1 $ligne1, "\n";    
print $OUT2 $ligne2, "\n";
$ligne1 = <$IN1>;
$ligne2 = <$IN2>;

Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 8, 2014, 2:36 PM

Post #29 of 43 (96429 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post


In Reply To
i always use to use while loop with file handler
while (<IN>) {
}


That's also what I almost always do (except that I am usually using a lexical filehandler rather than a bareword filehandler).

But here, since I need to read two files in parallel, with sometimes more lines from one source, sometimes more from the other source, it is just not possible. At least one of the files needs to be read line by line, on demand, depending on the conditions met. It would in theory be possible to read one file with a:

Code
while (<$IN>) { # ...

construct, and the other one on demand as I do in my code. But this turns out to be a bit complicated, and I've found that finely controlling input on both sides makes things symetrical and much simpler to code.

So, when I do something like:

Code
$ligne1 = <$IN1>;

I am just reading the next line from $IN. The $IN file handler is just an iterator that "remembers" where it should read next line. In most common cases, you put it in a while loop to read one line after the other, in this case, I am doing it "by hand", i.e. reading from the file(s) from which I need input at any particular time.

I am happy that this seems to solve your problem and does what you want (even if you had to change a couple of things). And also that it does it quite fast (this I knew, I have been using this method quite a number of times with file volumes of usually several gigabytes, I know that this is about as fast as it can get, even though there is an initial sorting phase taking quite a bit of time). And also, available memory is simply not an issue: at any given time, we only have two lines in memory. Even with files 10 or 100 times larger, processing time would obviously take longer, but memory usage would remain almost nothing (basically two lines of input, one for each file).

I am still a bit concerned about the "use of uninitialized value" warning that you get, it might be secundary or irrelevant (it seems to be the case if you obtain the desired result), but I would personally never put in production a program displaying such warnings, because they indicate something is not really behaving as expected, even when the results are or seem to be good. Please post your full program so that I can investigate what's going on on line 58 of your program.

Cheers,
Laurent.


Tejas
User

Sep 9, 2014, 12:15 AM

Post #30 of 43 (96419 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Hi Laurent

I have a doubt with the scenario below

File 1 has one row

key 10999999

File 2 has 1099999 rows from 1..99
1
.
.
10999999

As per this code how many comparisions will it take.
10999999 Comparisions . is nt it?
Can we have a bineary search in the code to overcome this.

Wont this save time ?

And the code doesnt work for below file which are totally unmatching


Quote
==> File11 <==
00011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1
0001SR1WWT45PHK17V4B,6,351,JUD,3900,0,24-OCT-09,24-OCT-09,1
000215Q0CJJ24DY0TCW0,1,410,DUD,9.99,0,14-JAN-14,14-JAN-14,1
000255VW6WVJRX2S4ZH0,1,301-351-355,DUD,99,-198,09-MAY-14,01-JUN-14,3
0004TC0TQBC9439B4E40,1,375,DUD,0,-.99,04-SEP-12,04-SEP-12,1
00060Z2GRB2XVQ5AE5NK-2,1,375,DUD,0,-10.99,23-JUL-10,23-JUL-10,1
00076D3VC3V7R6ERSCX0,1,410,DUD,7.46,0,24-FEB-13,24-FEB-13,1
000BZPHEQFWXB1SMWT6A-1,1,375,DUD,0,-13.99,23-JUL-10,23-JUL-10,1
000D1NZZD7JH5FXF9GW1,1,351,DUD,99,0,11-JUL-14,11-JUL-14,1
000DB9W32NHNS9GVS8E1,1,410,DUD,5.45,0,01-JUL-14,01-JUL-14,1



Quote
==> File12 <==
1200011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1
120001SR1WWT45PHK17V4B,6,351,JUD,3900,0,24-OCT-09,24-OCT-09,1
12000215Q0CJJ24DY0TCW0,1,410,DUD,9.99,0,14-JAN-14,14-JAN-14,1
12000255VW6WVJRX2S4ZH0,1,301-351-355,DUD,99,-198,09-MAY-14,01-JUN-14,3
120004TC0TQBC9439B4E40,1,375,DUD,0,-.99,04-SEP-12,04-SEP-12,1
1200060Z2GRB2XVQ5AE5NK-2,1,375,DUD,0,-10.99,23-JUL-10,23-JUL-10,1
1200076D3VC3V7R6ERSCX0,1,410,DUD,7.46,0,24-FEB-13,24-FEB-13,1
12000BZPHEQFWXB1SMWT6A-1,1,375,DUD,0,-13.99,23-JUL-10,23-JUL-10,1
12000D1NZZD7JH5FXF9GW1,1,351,DUD,99,0,11-JUL-14,11-JUL-14,1
12000DB9W32NHNS9GVS8E1,1,410,DUD,5.45,0,01-JUL-14,01-JUL-14,1



Code
#!/usr/bin/perl 

use strict;
use warnings;
use Cwd;
my $start_run = time();
my $cwd = getcwd();
my $clr_txns = "$cwd/File1";
my $temp_file = "$cwd/File2";
my $final_output3 = "$cwd/Final_List4.txt";
open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>";
open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>";
#!/usr/bin/perl

use strict;
use warnings;
use Cwd;
#JP_ZIP.dat Kompare.pl Sorted_Allup_Na
my $start_run = time();
my $cwd = getcwd();
my $clr_txns = "$cwd/File1";
my $temp_file = "$cwd/File2";
my $final_output2 = "$cwd/Final_List3.txt";
my $final_output3 = "$cwd/Final_List4.txt";
open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>";
open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>";
open my $OUT1, '>', $final_output2 or die "could not open $final_output <$!>";
open my $OUT2, '>', $final_output3 or die "could not open $final_output <$!>";

my $ligne1 = <$in_fh1>;
my $ligne2 = <$in_fh2>;
die "One of the input files is empty\n" unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
while ( 1 ) {
my $comparison = compare($ligne1, $ligne2);
if ($comparison > 0) {
print $OUT1 $ligne2,"\n";;
$ligne2 = <$in_fh2>;
last unless defined $ligne2;
chomp $ligne2;
} else {
if ($comparison < 0) {
print $OUT1 $ligne1, "\n";
$ligne1 = <$in_fh1>;
last unless defined $ligne1;
chomp $ligne1;
} else {
print "$comparison \n";
print $OUT2 $ligne2, "\n";
print "Found $ligne2 \n";
$ligne1 = <$in_fh1>;
$ligne2 = <$in_fh2>;
last unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
}
}
}
print $OUT2 $ligne2 if defined $ligne2;
print $OUT2 $ligne2 while $ligne2 = <$in_fh2>;
print $OUT2 $ligne1 if defined $ligne1;
print $OUT2 $ligne1 while $ligne1 = <$in_fh1>;




sub compare {
my ($curr_key, $prev_key) ;# = (split /,/, $_)[0] for @_;
$curr_key = (split /,/, $_)[0] for $_[0];
$prev_key = (split /,/, $_)[0]for $_[1];
#print "Second File $prev_key \n";
$curr_key eq $prev_key ? 1 : 0;
}



my $end_run = time();
my $run_time = sprintf "%.2f", (($end_run - $start_run) / 60);
print "Elapsed: $run_time minutes\n";



I sorted both the file with 1 st colun , they seem to be matching as per the code (1200011Y7JP90APNPKPYB0 is matching 00011Y7JP90APNPKPYB0).


(This post was edited by Tejas on Sep 9, 2014, 12:54 AM)


Laurent_R
Veteran / Moderator

Sep 9, 2014, 10:02 AM

Post #31 of 43 (96366 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

Yes, if you are looking for only one value in a sorted file where there are 10999999 values, then, yes, binary search will be much faster. But you are talking about a problem very different from the one before. Besides, implementing binary search in a file is not necessarily easy to implement in the general case, a bit easier if you know certain things about the file (such as constant line length and other features simplifying the problem).


Tejas
User

Sep 9, 2014, 9:31 PM

Post #32 of 43 (96341 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

yes.I get it.

Laurent, did u get to look at the code and the sample txn i have copied .
The ID's are different , but they seem to match as per the code .


Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 10, 2014, 10:41 AM

Post #33 of 43 (96285 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

If you're talking about files 11 and 12, the program cannot work because the keys don't match. If we look at the first record of each:


Quote
==> File11 <==
00011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1

==> File12 <==
1200011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1

The keys look very similar, except for the fact that you have an additional "12" at the beginning of each the lines in file 12.

You'll probably need either to preprocess file 12 to remove those "12" at the beginning of the lines, or change slightly, the comparison procedure in order not to use the two first characters of the lines in file 12. But then do it on another version of the program, to make sure you have the other original program still working for your other files.


Tejas
User

Sep 10, 2014, 10:18 PM

Post #34 of 43 (96267 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

 
Tht what's even iam saying.
They should nt match, as 12 is added.
But as per the code the keys are matching.


Quote
==> File11 <==
00011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1

==> File12 <==

1200011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1


Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 11, 2014, 12:18 AM

Post #35 of 43 (96240 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

I find this hard to believe. Clearly, with my version of the code, all lines would go to the orphan files. Please show the code of the "compare" subroutine.


Tejas
User

Sep 11, 2014, 5:56 AM

Post #36 of 43 (96162 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Cwd;
#JP_ZIP.dat Kompare.pl Sorted_Allup_Na
my $start_run = time();
my $cwd = getcwd();
my $clr_txns = "$cwd/File1";
my $temp_file = "$cwd/File2";
my $final_output = "$cwd/Final_List1.txt";
my $final_output1 = "$cwd/Final_List2.txt";
my $final_output2 = "$cwd/Final_List3.txt";
my $final_output3 = "$cwd/Final_List4.txt";
open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>";
open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>";
open my $orph_fh1, '>', $final_output or die "could not open $final_output <$!>";
open my $orph_fh2, '>', $final_output1 or die "could not open $final_output <$!>";
open my $OUT1, '>', $final_output2 or die "could not open $final_output <$!>";
open my $OUT2, '>', $final_output3 or die "could not open $final_output <$!>";
my $ligne1 = <$in_fh1>;
my $ligne2 = <$in_fh2>;
die "One of the input files is empty\n" unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);

while ( 1 ) {
my $comparison = compare2($ligne1, $ligne2);
if ($comparison > 0) {
print $OUT1 $ligne2,"\n";;
$ligne2 = <$in_fh2>;
last unless defined $ligne2;
chomp $ligne2;
} else {
if ($comparison < 0) {
print $OUT1 $ligne1, "\n";
$ligne1 = <$in_fh1>;
last unless defined $ligne1;
chomp $ligne1;
} else {
print "$comparison \n";
print $OUT2 $ligne2, "\n";
$ligne1 = <$in_fh1>;
$ligne2 = <$in_fh2>;
last unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
}
}
}
print $OUT2 $ligne2 if defined $ligne2;
print $OUT2 $ligne2 while $ligne2 = <$in_fh2>;
print $OUT2 $ligne1 if defined $ligne1;
print $OUT2 $ligne1 while $ligne1 = <$in_fh1>;




sub compare2 {
my ($curr_key, $prev_key) ;# = (split /,/, $_)[0] for @_;
$curr_key = (split /,/, $_)[0] for $_[0];
$prev_key = (split /,/, $_)[0]for $_[1];
#print "Second File $prev_key \n";
$curr_key eq $prev_key ? 1 : 0;
}



my $end_run = time();
my $run_time = sprintf "%.2f", (($end_run - $start_run) / 60);
print "Elapsed: $run_time minutes\n";



File1

Quote
00011Y7JP90APNPKPYB0,1,351,USD,99,0,04-JUN-14,04-JUN-14,1
0001SR1WWT45PHK17V4B,6,351,JPY,3900,0,24-OCT-09,24-OCT-09,1
000215Q0CJJ24DY0TCW0,1,410,USD,9.99,0,14-JAN-14,14-JAN-14,1
000255VW6WVJRX2S4ZH0,1,301-351-355,USD,99,-198,09-MAY-14,01-JUN-14,3
0004TC0TQBC9439B4E40,1,375,USD,0,-.99,04-SEP-12,04-SEP-12,1
00060Z2GRB2XVQ5AE5NK-2,1,375,USD,0,-10.99,23-JUL-10,23-JUL-10,1
00076D3VC3V7R6ERSCX0,1,410,USD,7.46,0,24-FEB-13,24-FEB-13,1
000BZPHEQFWXB1SMWT6A-1,1,375,USD,0,-13.99,23-JUL-10,23-JUL-10,1
000D1NZZD7JH5FXF9GW1,1,351,USD,99,0,11-JUL-14,11-JUL-14,1
000DB9W32NHNS9GVS8E1,1,410,USD,5.45,0,01-JUL-14,01-JUL-14,1


Quote
1200011Y7JP90APNPKPYB0,1,351,USD,99,0,04-JUN-14,04-JUN-14,1
120001SR1WWT45PHK17V4B,6,351,JPY,3900,0,24-OCT-09,24-OCT-09,1
12000215Q0CJJ24DY0TCW0,1,410,USD,9.99,0,14-JAN-14,14-JAN-14,1
12000255VW6WVJRX2S4ZH0,1,301-351-355,USD,99,-198,09-MAY-14,01-JUN-14,3
120004TC0TQBC9439B4E40,1,375,USD,0,-.99,04-SEP-12,04-SEP-12,1
1200060Z2GRB2XVQ5AE5NK-2,1,375,USD,0,-10.99,23-JUL-10,23-JUL-10,1
1200076D3VC3V7R6ERSCX0,1,410,USD,7.46,0,24-FEB-13,24-FEB-13,1
12000BZPHEQFWXB1SMWT6A-1,1,375,USD,0,-13.99,23-JUL-10,23-JUL-10,1
12000D1NZZD7JH5FXF9GW1,1,351,USD,99,0,11-JUL-14,11-JUL-14,1
12000DB9W32NHNS9GVS8E1,1,410,USD,5.45,0,01-JUL-14,01-JUL-14,1



Code and Files are above.
As per the code all of them are matching.And They Should'nt ideally


Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 11, 2014, 10:00 AM

Post #37 of 43 (96158 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

This line:

Code
$curr_key eq $prev_key ? 1 : 0;

in the compare2 subroutine is wrong. Try what I had:


Code
return $curr_key cmp $prev_key;


The main part of the program needs to receive 0 from the subroutine if the keys are equal (and you are just doing the opposite), -1 or +1 if they are not equal (depending on which is higher than the other). That's what the cmp operator does.


Tejas
User

Mar 5, 2015, 12:47 AM

Post #38 of 43 (93317 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Hi Laurent

Below is the code , that is being used for comparing 2 files

Code
sub compare2 { 
my ($curr_key, $prev_key) = (split /,/, $_)[0] for @_;
return $curr_key cmp $prev_key;
}

my $line1 = <$IN1>;
my $line2 = <$IN2>;
die "One of the input files is empty\n" unless defined $line1 and defined $line2;
chomp ($line1, $line2);
while ( 1 ) {
my $comparison = compare2($line1, $line2);
if ($comparison > 0) {
print $ORPH2 $line2, "\n";
$line2 = <$IN2>;
last unless defined $line2;
chomp $line2;
} else {
if ($comparison < 0) {
print $ORPH1 $line1, "\n";
$line1 = <$IN1>;
last unless defined $line1;
chomp $line1;
} else {
print $OUT1 $line1, "\n";
print $OUT2 $line2, "\n";
$line1 = <$IN1>;
$line2 = <$IN2>;
last unless defined $line1 and defined $line2;
chomp ($line1, $line2);
}
}
}
print $ORPH2 $line2 if defined $line2;
print $ORPH2 $line2 while $line2 = <$IN2>;
print $ORPH1 $line1 if defined $line1;
print $ORPH1 $line1 while $line1 = <$IN1>;



If the File two has the excat line twice, how can that be considered as a match ?

Quote
Ex :
File1 :

1 ABC 45

FILE2:
1 ABC 45
1 ABC 45


Ouput :
1 ABC 45
1 ABC 45

Current Output
1 ABC 45

The matched output file has just one line, but 2 lins from file2 are matching.
As we are moving to next line after the match ..file1 and file 2 have different data.
Any Approach for that , or should we call compare again in match code


Laurent_R
Veteran / Moderator

Mar 5, 2015, 9:54 AM

Post #39 of 43 (93304 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

Hi Tejas,

what I am doing when I need to compare two files is to have a preprocessing of each file in order to remove duplicates. It is only then that I read the two files in parallel for finding "orphans". Sometimes I do both things in one go, but not when I use the module from which this code is taken. Note that this module has other functions, including one for finding and removing duplicates.

In your case, if one line is coming twice in one file and only once in the other, then it can be argued that the second line is an orphan. But it is not a technicall question, it really depends on your business rules.

It would not be too difficult to change the code to take such cases into account (but the code would no longer be generic), but one would need to know the exact requirement.


Tejas
User

Mar 5, 2015, 11:45 AM

Post #40 of 43 (93296 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post


Quote
what I am doing when I need to compare two files is to have a preprocessing of each file in order to remove duplicates. It is only then that I read the two files in parallel for finding "orphans". Sometimes I do both things in one go, but not when I use the module from which this code is taken. Note that this module has other functions, including one for finding and removing duplicates.

Can u tell me what other funcs we have in the module
Yeah this isn't a generic requiement and my approach here is to read the required valuesinto a hash from first file(which are unique lines)
Read the second file into an array or just therequired values into variable after splitting that should be compared.
And compare them with hash key and its value.
But the real problem is when the file is too huge ex 3 Gn then hash wouldn't work well

Thanks
Tejas


Laurent_R
Veteran / Moderator

Mar 5, 2015, 11:18 PM

Post #41 of 43 (93287 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

No problem, Tejas, I'll provide you my full module, with its extensive documentation. I have made it free and open source anyway. I'll do it when I am in the office, I don't have it here on my mobile device.


Tejas
User

Mar 6, 2015, 2:25 AM

Post #42 of 43 (93283 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Thanks
But is my approach(array usage) feasible .

Thanks
Tejas


Laurent_R
Veteran / Moderator

Mar 6, 2015, 4:49 AM

Post #43 of 43 (93275 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

Well, I have developed this module (which I have sent to you, BTW) reading the two files in parallel just because the files I am reading are just too large to fit into a hash in memory. It really depends on the size of your file and on your available resources.

Otherwise, you need only one hash for file 1, no need for an array for file 2. You can load one file into a hash, and then read the other file line by line.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives