CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Merging the data in two files using a hash

 

First page Previous page 1 2 Next page Last page  View All


Tejas
User

Sep 8, 2014, 12:51 AM

Post #26 of 37 (2781 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Doubts in the code.
These basically are doubts related to syntax not the logic


Code
sub compare2 {   
my ($curr_key, $prev_key) = (split /,/, $_)[0] for @_;
return $curr_key cmp $prev_key;
}

Now the actual comparison code:
Code
my $ligne1 = <$IN1>; //Here We Aere reading the First Lines and how are we going ahead with next line
my $ligne2 = <$IN2>;
die "One of the input files is empty\n" unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
while ( 1 ) { // Here IS nt it going to infinite loop
my $comparison = compare($ligne1, $ligne2);
if ($comparison > 0) { //Here the data is matchng, is nt it ?
print $ORPH2 $ligne2, "\n"; //Then why are we printing it in orpahn
$ligne2 = <$IN2>;
last unless defined $ligne2; //what is lat unless defined
chomp $ligne2;
} else {
if ($comparison < 0) { // We are just Sending 0 or 1 from Compare Report , How can we get value < 0
print $ORPH1 $ligne1, "\n";
$ligne1 = <$IN1>;
last unless defined $ligne1;
chomp $ligne1;
} else { //if comaparision passes
print $OUT1 $ligne1, "\n";
print $OUT2 $ligne2, "\n";
$ligne1 = <$IN1>;
$ligne2 = <$IN2>;
last unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
}
}
}
print $ORPH2 $ligne2 if defined $ligne2; Why are we printing here ad below whn we are printing above ?
print $ORPH2 $ligne2 while $ligne2 = <$IN2>;
print $ORPH1 $ligne1 if defined $ligne1;
print $ORPH1 $ligne1 while $ligne1 = <$IN1>;


Can u give walk me through if u have time.
I am running this code on my data and its running
And i get this warining for every line

Quote
of uninitialized value in string eq at ./Kompare.pl line 58, <$in_fh2> line 261502.
Use of uninitialized value in string eq at ./Kompare.pl line 58, <$in_fh2> line 261502.
Use of uninitialized value in string eq at ...


Iam totally confused how the code works.
But it is giving me desired out put in 5 minutes.
Dint really understand how

below is the chaned code , i actually wanted all the Matche dand Unmatched in one file .


Will update with the output.

Thanks
Tejas

Code
while ( 1 ) { 
my $comparison = compare($ligne1, $ligne2);
if ($comparison > 0) {
$ligne2 = <$in_fh2>;
last unless defined $ligne2;
chomp $ligne2;
} else {
if ($comparison < 0) {
$ligne1 = <$in_fh1>;
last unless defined $ligne1;
chomp $ligne1;
} else {
print $OUT2 $ligne2, "\n";
$ligne1 = <$in_fh1>;
$ligne2 = <$in_fh2>;
last unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
}
}
}
print $OUT2 $ligne2 if defined $ligne2;
print $OUT2 $ligne2 while $ligne2 = <$in_fh2>;
print $OUT2 $ligne1 if defined $ligne1;
print $OUT2 $ligne1 while $ligne1 = <$in_fh1>;

But i dint really understand how


(This post was edited by Tejas on Sep 8, 2014, 2:32 AM)


Laurent_R
Veteran / Moderator

Sep 8, 2014, 10:50 AM

Post #27 of 37 (2766 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

OK, let me try to explain.


Code
sub compare2 {    
my ($curr_key, $prev_key) = (split /,/, $_)[0] for @_;
return $curr_key cmp $prev_key;
}


The split splits each of the lines received as argument to get the keys of these lines. Then we compare the keys. The cmp function retunrns 0 if the keys compare equal, -1 if key2 is larger than key1 and +1 if key1 is larger than key 2.

Now the main comparison part. I first read the first line of each file. Then I'll compare the lines, and based on the result of the comparison I'll fetch more lines. If the two keys compare equal (the subroutine returns 0 and we go into the else part), then I know that the lines are common to both files, so I print them to $OUT1 and $OUT2 and fetch a new line from both lines:

Code
        print $OUT1 $ligne1, "\n";   
print $OUT2 $ligne2, "\n";
$ligne1 = <$IN1>;
$ligne2 = <$IN2>;

If any of the two lines is not defined, then we have reached the end of the corresponding file, and the last exits the while loop. And we chomp the two new lines for preparing the comparison in the next interation of the loop.


Code
        last unless defined $ligne1 and defined $ligne2;   
chomp ($ligne1, $ligne2);


If the key dpon't compare equal (the subroutine returns -1 or +1), then it means that we have an orphan. If the sub returns +1, it means than the key of line 1 was larger than the key of line2, so we have in file2 a line that does not exists is file1. We print it as an orphan to $ORPH2. We keep line1 from the revious fetch and fetch a new line from file 2. If line2 is not defined, it means that we are at the end of file2, we are almost done, the "last" command exits the while loop. We chomp the new line 2 and get back to the beginning of the loop for the comparison.

If the subrotine returned -1, it is just the opposite case, line1 is an orphan, we print it to $ORPH1 and fetch the next line from file1.

When we exit the loop, it means that at least one of the file has been exhausted. We still need to print as an orphan the other line that we had and we still need to print as orphans all the lines still in the other file (if any). That's what the four print statements do after the end of the while loop.

Is this clearer now?

As for the warning about "unitialized value" you would have either to post the full code, or, at least, tell me what there is in line 58 of your program.


Tejas
User

Sep 8, 2014, 11:46 AM

Post #28 of 37 (2764 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

I got confused here, We are nt incrementing adn how can it gfo to next line ?
Thts where all the confusion started .

Code
 print $OUT1 $ligne1, "\n";    
print $OUT2 $ligne2, "\n";
$ligne1 = <$IN1>;
$ligne2 = <$IN2>;


Also,My requirement is to print the latest of both files , not both ,
So ihave changed it in my code pasted above.

Now , it a bit clearer.

i always use to use while loop with file handler
while (<IN>) {
}

Ans , U have directly used it in code and i dont see any line being incremented .so was confused how this is reading the next line

Code
 print $OUT1 $ligne1, "\n";    
print $OUT2 $ligne2, "\n";
$ligne1 = <$IN1>;
$ligne2 = <$IN2>;

Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 8, 2014, 2:36 PM

Post #29 of 37 (2762 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post


In Reply To
i always use to use while loop with file handler
while (<IN>) {
}


That's also what I almost always do (except that I am usually using a lexical filehandler rather than a bareword filehandler).

But here, since I need to read two files in parallel, with sometimes more lines from one source, sometimes more from the other source, it is just not possible. At least one of the files needs to be read line by line, on demand, depending on the conditions met. It would in theory be possible to read one file with a:

Code
while (<$IN>) { # ...

construct, and the other one on demand as I do in my code. But this turns out to be a bit complicated, and I've found that finely controlling input on both sides makes things symetrical and much simpler to code.

So, when I do something like:

Code
$ligne1 = <$IN1>;

I am just reading the next line from $IN. The $IN file handler is just an iterator that "remembers" where it should read next line. In most common cases, you put it in a while loop to read one line after the other, in this case, I am doing it "by hand", i.e. reading from the file(s) from which I need input at any particular time.

I am happy that this seems to solve your problem and does what you want (even if you had to change a couple of things). And also that it does it quite fast (this I knew, I have been using this method quite a number of times with file volumes of usually several gigabytes, I know that this is about as fast as it can get, even though there is an initial sorting phase taking quite a bit of time). And also, available memory is simply not an issue: at any given time, we only have two lines in memory. Even with files 10 or 100 times larger, processing time would obviously take longer, but memory usage would remain almost nothing (basically two lines of input, one for each file).

I am still a bit concerned about the "use of uninitialized value" warning that you get, it might be secundary or irrelevant (it seems to be the case if you obtain the desired result), but I would personally never put in production a program displaying such warnings, because they indicate something is not really behaving as expected, even when the results are or seem to be good. Please post your full program so that I can investigate what's going on on line 58 of your program.

Cheers,
Laurent.


Tejas
User

Sep 9, 2014, 12:15 AM

Post #30 of 37 (2752 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Hi Laurent

I have a doubt with the scenario below

File 1 has one row

key 10999999

File 2 has 1099999 rows from 1..99
1
.
.
10999999

As per this code how many comparisions will it take.
10999999 Comparisions . is nt it?
Can we have a bineary search in the code to overcome this.

Wont this save time ?

And the code doesnt work for below file which are totally unmatching


Quote
==> File11 <==
00011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1
0001SR1WWT45PHK17V4B,6,351,JUD,3900,0,24-OCT-09,24-OCT-09,1
000215Q0CJJ24DY0TCW0,1,410,DUD,9.99,0,14-JAN-14,14-JAN-14,1
000255VW6WVJRX2S4ZH0,1,301-351-355,DUD,99,-198,09-MAY-14,01-JUN-14,3
0004TC0TQBC9439B4E40,1,375,DUD,0,-.99,04-SEP-12,04-SEP-12,1
00060Z2GRB2XVQ5AE5NK-2,1,375,DUD,0,-10.99,23-JUL-10,23-JUL-10,1
00076D3VC3V7R6ERSCX0,1,410,DUD,7.46,0,24-FEB-13,24-FEB-13,1
000BZPHEQFWXB1SMWT6A-1,1,375,DUD,0,-13.99,23-JUL-10,23-JUL-10,1
000D1NZZD7JH5FXF9GW1,1,351,DUD,99,0,11-JUL-14,11-JUL-14,1
000DB9W32NHNS9GVS8E1,1,410,DUD,5.45,0,01-JUL-14,01-JUL-14,1



Quote
==> File12 <==
1200011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1
120001SR1WWT45PHK17V4B,6,351,JUD,3900,0,24-OCT-09,24-OCT-09,1
12000215Q0CJJ24DY0TCW0,1,410,DUD,9.99,0,14-JAN-14,14-JAN-14,1
12000255VW6WVJRX2S4ZH0,1,301-351-355,DUD,99,-198,09-MAY-14,01-JUN-14,3
120004TC0TQBC9439B4E40,1,375,DUD,0,-.99,04-SEP-12,04-SEP-12,1
1200060Z2GRB2XVQ5AE5NK-2,1,375,DUD,0,-10.99,23-JUL-10,23-JUL-10,1
1200076D3VC3V7R6ERSCX0,1,410,DUD,7.46,0,24-FEB-13,24-FEB-13,1
12000BZPHEQFWXB1SMWT6A-1,1,375,DUD,0,-13.99,23-JUL-10,23-JUL-10,1
12000D1NZZD7JH5FXF9GW1,1,351,DUD,99,0,11-JUL-14,11-JUL-14,1
12000DB9W32NHNS9GVS8E1,1,410,DUD,5.45,0,01-JUL-14,01-JUL-14,1



Code
#!/usr/bin/perl 

use strict;
use warnings;
use Cwd;
my $start_run = time();
my $cwd = getcwd();
my $clr_txns = "$cwd/File1";
my $temp_file = "$cwd/File2";
my $final_output3 = "$cwd/Final_List4.txt";
open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>";
open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>";
#!/usr/bin/perl

use strict;
use warnings;
use Cwd;
#JP_ZIP.dat Kompare.pl Sorted_Allup_Na
my $start_run = time();
my $cwd = getcwd();
my $clr_txns = "$cwd/File1";
my $temp_file = "$cwd/File2";
my $final_output2 = "$cwd/Final_List3.txt";
my $final_output3 = "$cwd/Final_List4.txt";
open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>";
open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>";
open my $OUT1, '>', $final_output2 or die "could not open $final_output <$!>";
open my $OUT2, '>', $final_output3 or die "could not open $final_output <$!>";

my $ligne1 = <$in_fh1>;
my $ligne2 = <$in_fh2>;
die "One of the input files is empty\n" unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
while ( 1 ) {
my $comparison = compare($ligne1, $ligne2);
if ($comparison > 0) {
print $OUT1 $ligne2,"\n";;
$ligne2 = <$in_fh2>;
last unless defined $ligne2;
chomp $ligne2;
} else {
if ($comparison < 0) {
print $OUT1 $ligne1, "\n";
$ligne1 = <$in_fh1>;
last unless defined $ligne1;
chomp $ligne1;
} else {
print "$comparison \n";
print $OUT2 $ligne2, "\n";
print "Found $ligne2 \n";
$ligne1 = <$in_fh1>;
$ligne2 = <$in_fh2>;
last unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
}
}
}
print $OUT2 $ligne2 if defined $ligne2;
print $OUT2 $ligne2 while $ligne2 = <$in_fh2>;
print $OUT2 $ligne1 if defined $ligne1;
print $OUT2 $ligne1 while $ligne1 = <$in_fh1>;




sub compare {
my ($curr_key, $prev_key) ;# = (split /,/, $_)[0] for @_;
$curr_key = (split /,/, $_)[0] for $_[0];
$prev_key = (split /,/, $_)[0]for $_[1];
#print "Second File $prev_key \n";
$curr_key eq $prev_key ? 1 : 0;
}



my $end_run = time();
my $run_time = sprintf "%.2f", (($end_run - $start_run) / 60);
print "Elapsed: $run_time minutes\n";



I sorted both the file with 1 st colun , they seem to be matching as per the code (1200011Y7JP90APNPKPYB0 is matching 00011Y7JP90APNPKPYB0).


(This post was edited by Tejas on Sep 9, 2014, 12:54 AM)


Laurent_R
Veteran / Moderator

Sep 9, 2014, 10:02 AM

Post #31 of 37 (2699 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

Yes, if you are looking for only one value in a sorted file where there are 10999999 values, then, yes, binary search will be much faster. But you are talking about a problem very different from the one before. Besides, implementing binary search in a file is not necessarily easy to implement in the general case, a bit easier if you know certain things about the file (such as constant line length and other features simplifying the problem).


Tejas
User

Sep 9, 2014, 9:31 PM

Post #32 of 37 (2674 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

yes.I get it.

Laurent, did u get to look at the code and the sample txn i have copied .
The ID's are different , but they seem to match as per the code .


Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 10, 2014, 10:41 AM

Post #33 of 37 (2618 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

If you're talking about files 11 and 12, the program cannot work because the keys don't match. If we look at the first record of each:


Quote
==> File11 <==
00011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1

==> File12 <==
1200011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1

The keys look very similar, except for the fact that you have an additional "12" at the beginning of each the lines in file 12.

You'll probably need either to preprocess file 12 to remove those "12" at the beginning of the lines, or change slightly, the comparison procedure in order not to use the two first characters of the lines in file 12. But then do it on another version of the program, to make sure you have the other original program still working for your other files.


Tejas
User

Sep 10, 2014, 10:18 PM

Post #34 of 37 (2600 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

 
Tht what's even iam saying.
They should nt match, as 12 is added.
But as per the code the keys are matching.


Quote
==> File11 <==
00011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1

==> File12 <==

1200011Y7JP90APNPKPYB0,1,351,DUD,99,0,04-JUN-14,04-JUN-14,1


Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 11, 2014, 12:18 AM

Post #35 of 37 (2573 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

I find this hard to believe. Clearly, with my version of the code, all lines would go to the orphan files. Please show the code of the "compare" subroutine.


Tejas
User

Sep 11, 2014, 5:56 AM

Post #36 of 37 (2495 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Cwd;
#JP_ZIP.dat Kompare.pl Sorted_Allup_Na
my $start_run = time();
my $cwd = getcwd();
my $clr_txns = "$cwd/File1";
my $temp_file = "$cwd/File2";
my $final_output = "$cwd/Final_List1.txt";
my $final_output1 = "$cwd/Final_List2.txt";
my $final_output2 = "$cwd/Final_List3.txt";
my $final_output3 = "$cwd/Final_List4.txt";
open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>";
open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>";
open my $orph_fh1, '>', $final_output or die "could not open $final_output <$!>";
open my $orph_fh2, '>', $final_output1 or die "could not open $final_output <$!>";
open my $OUT1, '>', $final_output2 or die "could not open $final_output <$!>";
open my $OUT2, '>', $final_output3 or die "could not open $final_output <$!>";
my $ligne1 = <$in_fh1>;
my $ligne2 = <$in_fh2>;
die "One of the input files is empty\n" unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);

while ( 1 ) {
my $comparison = compare2($ligne1, $ligne2);
if ($comparison > 0) {
print $OUT1 $ligne2,"\n";;
$ligne2 = <$in_fh2>;
last unless defined $ligne2;
chomp $ligne2;
} else {
if ($comparison < 0) {
print $OUT1 $ligne1, "\n";
$ligne1 = <$in_fh1>;
last unless defined $ligne1;
chomp $ligne1;
} else {
print "$comparison \n";
print $OUT2 $ligne2, "\n";
$ligne1 = <$in_fh1>;
$ligne2 = <$in_fh2>;
last unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
}
}
}
print $OUT2 $ligne2 if defined $ligne2;
print $OUT2 $ligne2 while $ligne2 = <$in_fh2>;
print $OUT2 $ligne1 if defined $ligne1;
print $OUT2 $ligne1 while $ligne1 = <$in_fh1>;




sub compare2 {
my ($curr_key, $prev_key) ;# = (split /,/, $_)[0] for @_;
$curr_key = (split /,/, $_)[0] for $_[0];
$prev_key = (split /,/, $_)[0]for $_[1];
#print "Second File $prev_key \n";
$curr_key eq $prev_key ? 1 : 0;
}



my $end_run = time();
my $run_time = sprintf "%.2f", (($end_run - $start_run) / 60);
print "Elapsed: $run_time minutes\n";



File1

Quote
00011Y7JP90APNPKPYB0,1,351,USD,99,0,04-JUN-14,04-JUN-14,1
0001SR1WWT45PHK17V4B,6,351,JPY,3900,0,24-OCT-09,24-OCT-09,1
000215Q0CJJ24DY0TCW0,1,410,USD,9.99,0,14-JAN-14,14-JAN-14,1
000255VW6WVJRX2S4ZH0,1,301-351-355,USD,99,-198,09-MAY-14,01-JUN-14,3
0004TC0TQBC9439B4E40,1,375,USD,0,-.99,04-SEP-12,04-SEP-12,1
00060Z2GRB2XVQ5AE5NK-2,1,375,USD,0,-10.99,23-JUL-10,23-JUL-10,1
00076D3VC3V7R6ERSCX0,1,410,USD,7.46,0,24-FEB-13,24-FEB-13,1
000BZPHEQFWXB1SMWT6A-1,1,375,USD,0,-13.99,23-JUL-10,23-JUL-10,1
000D1NZZD7JH5FXF9GW1,1,351,USD,99,0,11-JUL-14,11-JUL-14,1
000DB9W32NHNS9GVS8E1,1,410,USD,5.45,0,01-JUL-14,01-JUL-14,1


Quote
1200011Y7JP90APNPKPYB0,1,351,USD,99,0,04-JUN-14,04-JUN-14,1
120001SR1WWT45PHK17V4B,6,351,JPY,3900,0,24-OCT-09,24-OCT-09,1
12000215Q0CJJ24DY0TCW0,1,410,USD,9.99,0,14-JAN-14,14-JAN-14,1
12000255VW6WVJRX2S4ZH0,1,301-351-355,USD,99,-198,09-MAY-14,01-JUN-14,3
120004TC0TQBC9439B4E40,1,375,USD,0,-.99,04-SEP-12,04-SEP-12,1
1200060Z2GRB2XVQ5AE5NK-2,1,375,USD,0,-10.99,23-JUL-10,23-JUL-10,1
1200076D3VC3V7R6ERSCX0,1,410,USD,7.46,0,24-FEB-13,24-FEB-13,1
12000BZPHEQFWXB1SMWT6A-1,1,375,USD,0,-13.99,23-JUL-10,23-JUL-10,1
12000D1NZZD7JH5FXF9GW1,1,351,USD,99,0,11-JUL-14,11-JUL-14,1
12000DB9W32NHNS9GVS8E1,1,410,USD,5.45,0,01-JUL-14,01-JUL-14,1



Code and Files are above.
As per the code all of them are matching.And They Should'nt ideally


Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 11, 2014, 10:00 AM

Post #37 of 37 (2491 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

This line:

Code
$curr_key eq $prev_key ? 1 : 0;

in the compare2 subroutine is wrong. Try what I had:


Code
return $curr_key cmp $prev_key;


The main part of the program needs to receive 0 from the subroutine if the keys are equal (and you are just doing the opposite), -1 or +1 if they are not equal (depending on which is higher than the other). That's what the cmp operator does.

First page Previous page 1 2 Next page Last page  View All
 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives