CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Compare each and every value from two files

 

First page Previous page 1 2 Next page Last page  View All


Tejas
User

May 30, 2014, 4:31 AM

Post #1 of 28 (7677 views)
Compare each and every value from two files Can't Post


Quote
File1.txt
1879726181|01-MAY-14|351|HZN73J000T8NZ6F559K0|107.54|ULT|H1|1|300
569925790|01-MAY-14|351|Q9HBZTJKCZXTYH14ENF1|99|ULT|H1|1|300
1473807211|01-MAY-14|351|5NKHH50DHHC16VQ7Z671|99|ULT|H1|1|300
6499734605|01-MAY-14|351|WWENATJ49M5GRNF557W1|99|ULT|H1|1|300
1096350090|01-MAY-14|351|GD9TC74NT9VQG4C1F2P0|107.91|ULT|H1|1|300
644200090|01-MAY-14|351|X1RHWCK4QQGRFZZJ5VV0|99|ULT|H1|1|300
767741221|01-MAY-14|351|ZJ9XCWF89BE3CZF564P0|99|ULT|H1|1|300
1365594481|01-MAY-14|351|1M5SQ5VJFBR0SJ9QR130|99|ULT|H1|1|300
14826739705|01-MAY-14|351|C92MRXRP89DZT7DGNWV1|107.17|ULT|H1|1|300
804480960|01-MAY-14|351|Q9R62YZM8C80969HHY20|106.92|ULT|H1|1|300
14826624105|01-MAY-14|351|393ZET33XYQQNBHB4YV0|99|ULT|H1|1|300
2044233751|01-MAY-14|351|EGQDEADGJ88QMQCCTQN0|99|ULT|H1|1|300
1195029191|01-MAY-14|351|YGW22XJKBE7FXP829FT0|99|ULT|H1|1|300
568081611|01-MAY-14|351|FBJDA702EKJHV091GXC0|99|ULT|H1|1|300



Quote
File2.txt
1879726181|01-MAY-14|351|HZN73J000T8NZ6F559K0|107.54|ULT|H1|1|300
569925790|01-MAY-14|351|Q9HBZTJKCZXTYH14ENF1|99|ULT|H1|1|300
1473807211|01-MAY-14|351|5NKHH50DHHC16VQ7Z671|99|ULT|H1|1|300
6499734605|01-MAY-
14|351|WWENATJ49M5GRNF557W1|99|ULT|H1|1|300
1096350090|01-MAY-14|351|GD9TC74NT9VQG4C1F2P0|107.91|ULT|H1|1|300
644200090|01-MAY-14|351|X1RHWCK4QQGRFZZJ5VV0|49|ULT|H1|1|300
32564221|01-MAY-14|351|ZJ9XCWF89BE3CZF564P0|49|ULT|H1|1|300
325644481|01-MAY-14|351|1M5SQ5VJFBR0SJ9QR130|39|ULT|H1|1|300
3256439705|01-MAY-14|351|C92MRXRP89DZT7DGNWV1|107.17|ULT|H1|1|300
32564960|01-MAY-14|351|Q9R62YZM8C80969HHY20|126.92|ULT|H1|1|300
14826624105|01-MAY-14|351|393ZET33XYQQNBHB4YV0|39|ULT|H1|1|300
2044233751|01-MAY-14|351|EGQDEADGJ88QMQCCTQN0|99|ULT|H1|1|300
1195029191|01-MAY-14|351|YGW22XJKBE7FXP829FT0|99|ULT|H1|1|300
568081611|01-MAY-14|351|FBJDA702EKJHV091GXC0|99|ULT|H1|1|300


As Shown row[3] has the ID' s which are same in both the files
I have compare both the files and check what are mismatching and provide details.

Ex: 7TH Entry has row[0] mismatch and row[4] mismatch

The output should be

Mismatch.txt

767741221|01-MAY-14|351|ZJ9XCWF89BE3CZF564P0|99|ULT|H1|1|300 has row[1] mismtach and row[7] mismatch.

ANYOTHER UNDERSTANDABLE WAY TO REPRESENT THESE MISMATCHES ?


Any easy and smart way to test this.

I have tried testing this with hash approach.
but that is really not understandab;e


Code
i have taken file 2 in hash_cua and file1 in hash_vppcl 
foreach my $h_esid (keys(%hash_cuat))
{
#print "$h_esid\n";
foreach my $h_etid (keys %{$hash_cuat{$h_esid}})
{
#print "$h_etid\n";
if (not exists ($hashc_cvppcl{$h_esid}{$h_etid}))
{
$res = `grep $h_esid $cuat_txns | grep $h_etid >> FOUND_UAT_MISSING_PROD.txt`;
next;
}
foreach my $h_cust_id (keys %{$hash_cuat{$h_esid}{$h_etid}})
{
#for variance in cust id adding it to variance hash as "custid_var" = 'Y' and moving to next element in the loop
if(not exists ($hash_vppcl{$h_esid}{$h_etid}{$h_cust_id}) )
{
$variance_hash{$h_esid}{$h_etid}{'cust_id_var'} = 'Y'; next;
}
foreach my $h_amount (keys %{$hash_cuat{$h_esid}{$h_etid}{$h_cust_id}})
{
if(not exists ($hash_vppcl{$h_esid}{$h_etid}{$h_cust_id}{$h_amount}))
{
$variance_hash{$h_esid}{$h_etid}{'prod_amount'} = $h_amount;
$variance_hash{$h_esid}{$h_etid}{'cuat_amount'} = (keys %{$hash_vppcl{$h_esid}{$h_etid}{$h_cust_id}})[0]; ####
next;
}
foreach my $h_date (keys %{$hash_cuat{$h_esid}{$h_etid}{$h_cust_id}{$h_amount}})
{
if(exists ($hash_vppcl{$h_esid}{$h_etid}{$h_cust_id}{$h_amount}{$h_date}))
{
print PRESENT "$h_cust_id|$h_date|$h_etid|$h_esid|$h_amount\n";
}
else
{
$variance_hash{$h_esid}{$h_etid}{'entry_date_var'} = 'Y';
}
}
}
}
}
}


Code



Tejas
User

May 30, 2014, 5:51 AM

Post #2 of 28 (7657 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

And this approach will come out of the loop after first mismatche.
I would like to get all the mismatch of a particular line..


Laurent_R
Veteran / Moderator

May 30, 2014, 6:04 AM

Post #3 of 28 (7653 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

This is the approach I might take. Read one of the files and store it in a simple hash where the key is the first field and the value the entire line.

Then read the other file, get the key (the first field), use the key to look up into the hash and compare the entire lines. If they match, you are done with these lines. If they don't match, then split the 2 lines into two arrays and compare the elements of the arrays to figure out where the mismatch is. This should lead you to much less code and much faster execution.

This is a very quick and incomplete code for the important parts (untested):

Code
my %hash; 
while (my $line = <$FILE1>) {
chomp $line;
my $key = (split /\|/, $line, 2)[0];
$hash{$key} = $line;
}

And, later in the program:

Code
while (my $line2 = <$FILE2>) { 
chomp $line2;
my $key = (split /\|/, $line2, 2)[0];
print "missing line: $line \n" and next unless defined $hash{$key};
next if $line2 eq $hash{$key};
# if we get here, the lines are different let's compare the fields
my @array1 = split /\|/, $hash{$key};
my @array2 = split /\|/, $line2;
for my $i (0..$#array1) {
print "something" if $array1[$i] ne $array2[$i]
}
}

There are a few more things to be done (such as deleting from the hash the lines that have been compared, so that at the end, the hash contains the lines missing in the sec ond file), but you have a basic squeletton to work on.


Tejas
User

May 30, 2014, 6:15 AM

Post #4 of 28 (7649 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post

But here, is there a way to know which field or fields are not matching

This actually is to provide stats where if one or more than two values arent matching,all of them have to be a part , with the original value


Tejas
User

May 30, 2014, 6:25 AM

Post #5 of 28 (7644 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post

Hi

Also, Can you tell me
what is the importance of ",2 "and" [0]" in the below code

I meant how does it differ for
(my $key = (split /\|/, $line)[0] ; )
and
(my $key = (split /\|/, $line, 2)[0];)

And

Is this
print "missing line: $line \n" and next unless defined $hash{$key}; (In ur snippet)
or
print "missing line: $line2 \n" and next unless defined $hash{$key};
Thanks
Tejas


(This post was edited by Tejas on May 30, 2014, 6:50 AM)


Laurent_R
Veteran / Moderator

May 30, 2014, 7:41 AM

Post #6 of 28 (7612 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post


Quote
But here, is there a way to know which field or fields are not matching


Sure, I made it simple. But if you change:


Code
     for my $i (0..$#array1) {  
print "something" if $array1[$i] ne $array2[$i]
}


to something like this:


Code
my @errors; 
for my $i (0..$#array1) {
push @errors, $i if $array1[$i] ne $array2[$i] ;
}
print "There are some differences on fields # @errors \n";

You basically get your desired output.


Laurent_R
Veteran / Moderator

May 30, 2014, 7:56 AM

Post #7 of 28 (7608 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post


In Reply To
what is the importance of ",2 "and" [0]" in the below code

I meant how does it differ for
(my $key = (split /\|/, $line)[0] ; )
and
(my $key = (split /\|/, $line, 2)[0];)

In this case, there is really no functional difference between :

Code
my  $key = (split /\|/, $line)[0] ;

and:

Code
my  $key = (split /\|/, $line, 2)[0] ;

It is just that the first version splits the whole line and then takes only the first field, whereas the second version only splits the line into two components, the key and the rest; since it does not have to do the work of splitting the rest of the line, the second version is likely to be slightly faster


In Reply To
Is this
print "missing line: $line \n" and next unless defined $hash{$key}; (In ur snippet)
or
print "missing line: $line2 \n" and next unless defined $hash{$key};

You are right, this is a mistake, it should be $line2.


Tejas
User

May 30, 2014, 8:02 AM

Post #8 of 28 (7606 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post

Yes, I ve tested it and its working great

All , thst i need is to fine tune it in the output file

JUST LIKE BELOW
ID FILE1_INDEX1 FILE2_INDEX1 FILE1_INDEX2 FILE2_INDEX2
ABC 123 1234 99.00 49.00



And the outut iam getting is

There are some differences on fields # JPJX611DP2TCEWFTSK51|01-MAY-14345|01-MAY-14
There are some differences on fields # B0V913S3XXM1RVNXA7G0|01-MAY-14345|01-MAY-14 B0V913S3XXM1RVNXA7G0|99|109

ABOVE Txn has date and amount mismatch , and it has to show on oneline

and the indexs can be built on

my @value = ('parent_id','Date','tin','primary_key','AMOUNT','CURRENCY','COMPANY_CODE','vacl_id','source_ID')
and we are getting the excat index in the last push
but the same key is pushed twice due to two mismatches.

Need just a single line with the values on corresponding indexes
Thanks
Tejas


Laurent_R
Veteran / Moderator

May 30, 2014, 8:08 AM

Post #9 of 28 (7604 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

Please show the code you have now. I think that if you use the code I posted in post #6 above, you should have it right.


Tejas
User

May 30, 2014, 8:10 AM

Post #10 of 28 (7601 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post

        my @errors;
for my $i (0..$#array1) {
# print "$line2 has mismatch in $value[$i] $array1[$i] && $array2[$i] \t" if $array1[$i] ne $array2[$i]
push @errors, "$key|$array1[$i]|$array2[$i]" if $array1[$i] ne $array2[$i] ;
}
print "There are some differences on fields # @errors \n";

}

Here it is


Laurent_R
Veteran / Moderator

May 30, 2014, 9:30 AM

Post #11 of 28 (7580 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

My understanding is that you only needed the field numbers where there was a difference (plus of course some way of identifying the line). This could be done as follows:


Code
while (my $line2 = <$FILE2>) {  
chomp $line2;
my $key = (split /\|/, $line2, 2)[0];
print "missing line: $line2 \n" and next unless defined $hash{$key};
next if $line2 eq $hash{$key};
# if we get here, the lines are different let's compare the fields
my @array1 = split /\|/, $hash{$key};
my @array2 = split /\|/, $line2;
my @errors;
for my $i (0..$#array1) {
push @errors, $i if $array1[$i] ne $array2[$i] ;
}
print "There are differences on fields # @errors in line keyed $key\n";
}

which should print something like this:

Code
There are differences on fields # 2 3 5 in line keyed 12345678


Alternatively, if you want to output the full line, you could replace the last code line above with this one:

Code
     print "There are differences on fields # @errors in line  $line2\n";

Or, it you want to display the fields where the mismatch occurs:

Code
     print "There are differences in line  $line2 on fields @array2[@fields] \n"; # could also be  @array1[@fields] if you prefer

If what you want is none of the above 3 solutions, please specify clearly what you want.


Tejas
User

May 30, 2014, 9:49 AM

Post #12 of 28 (7572 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post

This is good

for my $i (0..$#array1) {
push @errors,"$array1[$i]|$array2[$i]|" if $array1[$i] ne $array2[$i] ;
}
print "$key|@errors\n";

Actually i wanted a output file with headings

and mismatched value under it

ID AMOUNT(Array1) AMOUNT(Array2) DATE DATE

XYZ 99 109 13-MAY14 13-MAY-1234
PQR 44 44.90 01-MAY-14 13-MAY-14


Thanks
Tejas


Tejas
User

Jun 1, 2014, 12:39 AM

Post #13 of 28 (7548 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post


Code
  print "missing line: $line " and next unless defined hash{$key};

Does this check for the whole line or just the key in the hash
i have changed it to

Code
 print "missing line: $line " and next  unless exists hash{$key};

Because i will check th ID, if it exists then the line should be same of different which we are handling in next two conditions

Code
    
This checks for the wohle line ..is nt it ?
next if $line2 eq $hash{$key};



Laurent_R
Veteran / Moderator

Jun 1, 2014, 2:26 AM

Post #14 of 28 (7517 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

On the first question: the whole approach is based on the idea that the first field is the key for comparing the files, which means that if I don't have a hash entry for the first field, then the whole line is considered to be missing.

On the second point: using defined or exists does not really change anything in the code as suggested, exists is just fine if you prefer. It might change things if you take the decision to remove hash entries in the process as I suggested earlier, depending on the way you decide to do it.

On the third point:

Code
next if $line2 eq $hash{$key};


Yes, it compares the whole lines. If they are the same, there is no need to split them into fields and check field by field. This is in theory unecessary code (you could compare all lines field by field), but this is likely to run significantly faster (especially if many lines are identical and only relatively few have differences).


Tejas
User

Jun 2, 2014, 4:07 AM

Post #15 of 28 (7315 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post

 There are two edge cases which are nt getting handled

Quote
File1.txt
1488811211|01-MAY-14|351|MM6948YQ3K8X95MP4451|198|dvs|2p|1|700

File2.txt
1488811211|01-MAY-14|351|MM6948YQ3K8X95MP4451|198|dvs|2p|1|700
1488811211|01-MAY-14|351|MM6948YQ3K8X95MP4451|198|dvs|2p|1|700


As the lines are similar , they match .
But File1 has 1 entry of similar type and file1 has just one.
So, Here The second line is missing from first file.

But here, There is no trace of that second entry as the hash is only one with the key , And output says' everything is Fine ';.
But thats not the case
So I have tried deleting what ever is matching from the hash..

And then , This problem occured.


Quote
File1.txt
1488811211|01-MAY-14|351|MM6948YQ3K8X95MP4451|198|dvs|2p|1|700
1488811211|01-MAY-14|351|MM6948YQ3K8X95MP4451|198|dvs|2p|1|700

File2.txt
1488811211|01-MAY-14|351|MM6948YQ3K8X95MP4451|198|dvs|2p|1|700
1488811211|01-MAY-14|351|MM6948YQ3K8X95MP4451|198|dvs|2p|1|700


If i delete hash after first match , then though the second line exists , it goes to missing as the hash is deleted.

And Finally
The Third Case

Quote
File1.txt
1488811211|01-MAY-14|351|MM6948YQ3K8X95MP4451|198|dvs|2p|1|700

File2.txt
1488811211|01-MAY-14|351|MM6948YQ3K8X95MP4451|198|dvs|2p|1|700
1488811211|01-MAY-14|301|MM6948YQ3K8X95MP4451|198|dvs|2p|1|700


Here as iam taking value at index 3 as key . The code Checks just for this key and finalizes that
a) First Entry Matches
b) Second Entry Varies (But Its Missing In Relity)
But, It actually is misisng from first file.
Only First Entry Should be matching and Second entry should be missing .
So probably $Hash{$row[3]}{row[2]} would give the results,
But it is again nt handling above two cases .

Im confused of how to handle these at this juncture.
Your Inputs are of great value!!

Supposed Solutions:
a. Deleting Only those which are matched, (But if there are two similar entries , how does a hash store or does it bother storing ?)


If there are 10 Identical lines, how many hashes are created.
As per the code i see, its just one :(

Thanks
Tejas

(This post was edited by Tejas on Jun 2, 2014, 4:38 AM)


Laurent_R
Veteran / Moderator

Jun 2, 2014, 10:30 AM

Post #16 of 28 (7244 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

I thought that the keys were unique in your files. If you have duplicates, then it is slightly more complicated.

This might require a little bit more thinking, but what I am thinking of from the top of my head would be to transform the simple hash into a hash of arrays (or quite possibly a hash of hashes). Thus if you have several times the same line, you store as many array elements. And, when you read file2, you remove one line from the inner array. I'll try to write out some quick code and come back.


Laurent_R
Veteran / Moderator

Jun 2, 2014, 11:09 AM

Post #17 of 28 (7227 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

Please read very carefully what I am saying at the end of this post (after the code).

This is a very quick and incomplete code for the important parts (untested):


Code
my %hash;  
while (my $line = <$FILE1>) {
chomp $line;
my $key = (split /\|/, $line, 2)[0];
push @{$hash{$key}}, $line;
}


And, later in the program:


Code
WHILE_LOOP: while (my $line2 = <$FILE2>) {  
chomp $line2;
my $key = (split /\|/, $line2, 2)[0];
print "missing line: $line \n" and next unless defined $hash{$key};
for my $i (0..scalar @{$hash{$key}) {
if ( ${$hash{$key}->[$i]}) eq $line2) {
delete ${$hash{$key}->[$i]};
next WHILE_LOOP;
}
}
# if we get here, no line was found to be identical but be do not know which one to compare field by field.
# Can we just pick any?
my $line1 = shift @{$hash{$key};
my @array1 = split /\|/, $line1;
my @array2 = split /\|/, $line2;
for my $i (0..$#array1) {
print "something" if $array1[$i] ne $array2[$i]
}
}


This is untested and, since it is a bit more complicated than my previous code, there may very well be some mistakes.

As said in the comments in the middle of the code, we have a serious problem: if we have several lines with the same key in file1 and we find one identical one, it is fine. But if we don't find an identical, then we do not know which to pick up to do the field by field comparison, and this is basically unsolvable unless you can give more rules. Here I have just decided that, in that case I just pick randomly the first one.

I am working on similar file comparisons very often, and we have usually a two-step process: first remove all duplicates from both files and then only compare the individual lines which we know to be unique un each file.


(This post was edited by Laurent_R on Jun 2, 2014, 11:40 AM)


Tejas
User

Jun 2, 2014, 11:32 AM

Post #18 of 28 (7215 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post

Iam actually picking up lines from Database.
So I have grouped by 2 keys and getting the data

And i have changed the hash{key} = $line to

$hash{key1}{key2} = $line ;
This seems to work fine as of now.
But ur implementation is really good.

My Doubts

Code
 push @{$hash{$key}}, $line;

What is this doing..I mean does the array have all the duplicate included.
Just , want to know what is this line doing..?


Code
 for my $i (0..scalar @{$hash{$key}) {  
if ( ${$hash{$key}->[$i]}) eq $line2) { //What are we searching.
delete ${$hash{$key}->[$i]};
next WHILE_LOOP;
}



Thanks for inputs.
Just trying to understand perl deeper, through ur inputs

thanks


Laurent_R
Veteran / Moderator

Jun 2, 2014, 11:57 AM

Post #19 of 28 (7205 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post


Code
push @{$hash{$key}}, $line;


Now, the hash, instead of containing individual lines, contains an array (really a reference to an array) storing all the lines having the same key.

An individual hash entry such as $hash{$key} can now contain something like this: [line_a, line_c, line_s].

You should probably use the Data::Dumper module to visualize the data structure. At the top of your script, add the following line:

Code
use Data::Dumper;


And after the file1 has been read (before starting to read file 2), insert the following line:

Code
print Dumper \%hash:

(Do it with a relatively small input file1.)



Code
 for my $i (0..scalar @{$hash{$key}) {   
if ( ${$hash{$key}->[$i]}) eq $line2) { //What are we searching.
delete ${$hash{$key}->[$i]};
next WHILE_LOOP;
}

This code (if working worrectly) is looking at all lines (with the same key) that are stored in the inner array of the hash entry for this key, and compare them one by one with the current line of file 2. If an identical line is found, then we have a match and can delete that line from the inner array and move on to the next line of file2.

As I said, we have a serious problem if no match was found and if the array still contains more than one line, there is just no way to know with which line of the array to compare the current line of file2. I just chose to pick up the first one of the array for lack of any better solution, but this can obviouslt be buggy for some cases.

Now, if using a key composed of two fields can make your lines unique (no duplicate when using this key), then this is obviously much much better, but only you can tell if this is a correct assumption. I do not know your data.


Laurent_R
Veteran / Moderator

Jun 2, 2014, 1:41 PM

Post #20 of 28 (7168 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

Except that I made a mistake and did not use the same data structure for both parts or the code (that's the problem when you can't test). The second part of the code can be simplified as follows:


Code
WHILE_LOOP: while (my $line2 = <$FILE2>) {   
chomp $line2;
my $key = (split /\|/, $line2, 2)[0];
print "missing line: $line \n" and next unless defined $hash{$key};
for my $i (0..scalar @{$hash{$key}) {
if ( $hash{$key}[$i]) eq $line2) {
delete $hash{$key}[$i]};
next WHILE_LOOP;
}
}
# if we get here, no line was found to be identical but we do not know which one to compare field by field.
# Can we just pick any?
my $line1 = shift @{$hash{$key}};
my @array1 = split /\|/, $line1;
my @array2 = split /\|/, $line2;
for my $i (0..$#array1) {
print "something" if $array1[$i] ne $array2[$i]
}
}

(Two changes in the way to access the hash content in the for loop I hope it is more or less correct now, but still cannot test.)


Tejas
User

Jun 9, 2014, 11:10 PM

Post #21 of 28 (5734 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post

This works great..
But its taking quite a lot of time if the file size is >1 gb

Do we need to change the approach ?

Thanks
Tejas


Laurent_R
Veteran / Moderator

Jun 10, 2014, 10:19 AM

Post #22 of 28 (5573 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

Well 1 GB starts to be a lot of data and is bound to take some time. Can you say more precisely what you mean by "quite a lot of time". If it is in the order of say 20 to 30 minutes, then there is probably not very much that can be done to significantly improve the run time. If the run time is much larger than that, then, yes, we may have to think of another solution. One point though is this: is one file significantly larger than the other? If so, it is the smallest one that you want to process first to store in your hash, for a number of reasons, the main one being that you may exhaust the memory available on your system, at which point, depending on the OS configuration, your program might just abort, or it may start to write part of the memory onto disk, in which case the program will becore very slow.

The best alternate route that I have found so far for comparing very large files (frequently 3 to 5 GB each) is to sort both files in accordance to the comparison key (using the Unix system sort), and then read the files in parallel. Algorithmically, this is a bit more complicated than the hash solution (there are several possible edge cases when reading two files in parallel). For files of about 3.5 GB each, the sorting takes about 10 minutes for each file and the comparison less than 10 minutes, so that the total run time is a bit less than half an hour.

And don't listen to people who tell you to use a database, this is orders of magnitude slower.


Tejas
User

Jun 10, 2014, 10:30 AM

Post #23 of 28 (5562 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post


Quote
And don't listen to people who tell you to use a database, this is orders of magnitude slower.


Evn i feel the same.
But can u elaborate on wht is orders of magnitude


Quote
The best alternate route that I have found so far for comparing very large files (frequently 3 to 5 GB each) is to sort both files in accordance to the comparison key (using the Unix system sort), and then read the files in parallel. Algorithmically, this is a bit more complicated than the hash solution (there are several possible edge cases when reading two files in parallel). For files of about 3.5 GB each, the sorting takes about 10 minutes for each file and the comparison less than 10 minutes, so that the total run time is a bit less than half an hour


Can You Please add a code snippet of how to read files in paralleel.
Are you pointing at Process and Threading Concept?
I would be more than pleased to understand the concept.

Also Sir,
Did u have a look at

http://perlguru.com/gforum.cgi?post=78880;sb=post_latest_reply;so=ASC;forum_view=forum_view_collapsed;;page=unread#unread[/quote]
Thanks
Tejas


Laurent_R
Veteran / Moderator

Jun 10, 2014, 11:01 AM

Post #24 of 28 (5539 views)
Re: [Tejas] Compare each and every value from two files [In reply to] Can't Post

One order of magnitude is a factor of 10. Two orders of magnitude is a factor of 100. and so on.



Quote
Are you pointing at Process and Threading Concept?


Not at all. I am speaking of a single process that reads the first line of the two files (sorted in ascending order) and compare their keys. If the keys are equal, then we are on matching records, we can compare the rest of the data.

Lets call A the line of file A and B the line of file B. If the key of line A is greater than the key of line B, then it means that line B is an "B orphan" (a line that is in file B but not in file A), that I will store in an B orphan file. Then I read next line of file B and do the comparison again, and so on.

Reciprocally, if the key of line B is greater than the key of line A, we have an "A-orphan" (a line that is in file A but not in file B). I store line A in the A orphan file. Then I read next line of file A and do the comparison again, and so forth.

And you do that untill you reach the end of both files (there are some slightly tricky edge cases when reaching the end of one file and not being at the end of the other one).

The only problem with this approach is the one I already mentioned earlier with the hash approach (and with any approach, for that matter): you can't do anything with duplicates, the keys have to be unique within each file, because otherwise you don't know what you are really comparing. So either you can find a key that will be unique, or you need some form of preprocessing to remove duplicates.

I have written a module to do this parallel reading of files, I can give you some code, but this will have to be later, I have no time right now.

You did not answer my question about how long it really takes with 1 GB files, this is important to me to evaluate the best possible options.


Tejas
User

Jun 10, 2014, 11:12 AM

Post #25 of 28 (5529 views)
Re: [Laurent_R] Compare each and every value from two files [In reply to] Can't Post

Sorry for that
Its around 5 minutes for comapring 2 file s which are 1.5 gb
Actually i have comapred the same file by creating a duplicate file
Just to analyze the mac time
Its arounf 5 minutes

Laurent , can u help me out in the other
Post.

First page Previous page 1 2 Next page Last page  View All
 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives