CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Comparing 2 Huge Csv Files and Find the Mismatched Rows

 



Tejas
User

Sep 26, 2013, 12:35 AM

Post #1 of 4 (1147 views)
Comparing 2 Huge Csv Files and Find the Mismatched Rows Can't Post

Hii

I have two csv files
1) Main Data.csv
2) PrbablyMissed.csv

Data In the Files are in Below Format
MainData.csv
ID1 ,AMT, ID2

CDZNYQ9R8108QR3E3EJ0,3900.00,351
V0Y9WC7YYJ6V8T3DDJM0,3900.00,351
BZ6FD9Q3964VX16EMKY0,3900.00,351
3S5VCXSSS0PPV9V875Q1,3900.00,351
687802764243,399.00,362

Simialrly in PrabbalyMissing_Mismatch.csv

BZ6FD9Q3964VX16EMKY0,900.00,351
3S5VCXSSS0PPV9V875Q1,900.00,351
V0Y9WC7YYJ6V8T3DDJM0,900.00,351
CDZNYQ9R8108QR3E3EJ0,900.00,351

As you can see the output of above files,
The Amounts of first 4 records mismatch (3900 and 900 )in second file and they are not in sequnce too.
Tht is, The First Row of MainData.csv can be in Thousandth row of PrabbalyMissing_Mismatch.csv and these can match or mismatch or might also be missed in the second file.


** ID1 , ID2 will be same in both the files ,only amount mistaches or If missed the that row will not exist at all

I have to compare the First File with Second File and see what records match or mismatch or missed in Second File
So I have to check whether amounts match or mismatch or missed using a perl script .

I have tried varous methods to code but in vain

my $f1 = ("/home/tejaswi/JP_PROD_UST_MISMATCH/MISSING_PROD_UAT.csv");

open FILE1, "$f1" or die "Could not open file MISSING_JP_UAT \n";
my $f2= ("/home/tejaswi/JP_PROD_UST_MISMATCH/MISSING_JP_UAT.csv");

open FILE2, "$f2" or die "Could not open file chm_dirx_bud_29.csv \n";
my $Matched_File=("/home/tejaswi/JP_PROD_UST_MISMATCH/MATCHED_FILE.csv");
#open Matched_File,">$Matched_File" or die "Could not open file chm_dirx_bud_29.csv \n";

open (Matched_File, ">$Matched_File") or die "Cannot open $Matched_File for writing \n";
my $outfile = '/home/tejaswi/JP_PROD_UST_MISMATCH/Output_JP_Mismatch.csv';

my @outlines;
my $y ;
my $outer_text ;
my $inner_text;

foreach (<FILE1>) {
$y = 0;
$outer_text = $_;
# print " $outer_text \n ";
seek(FILE2,0,0);

foreach (<FILE2>) {
$inner_text = $_;
#print " $inner_text \n ";
if($outer_text eq $inner_text) {
$y = 1;
print "Match Found \n";
print Matched_File "ORG: $outer_text \n";
print Matched_File "FNG: $inner_text \n";
last;
}
}

if($y != 1) {
# print "No Match Found \n";
push(@outlines, $outer_text);
}
}

open (OUTFILE, ">$outfile") or die "Cannot open $outfile for writing \n";
print OUTFILE @outlines;
close OUTFILE;

close FILE1;
close FILE2;
close Matched_File;


It works closely, but problem with Missing Records.
i NEED 3 OUTPUT FILES
1.Matched
2.UnMatched
3.Missing

Advance Thanks fo rthe Help :)


2teez
Novice

Sep 26, 2013, 5:55 AM

Post #2 of 4 (1142 views)
Re: [Tejas] Comparing 2 Huge Csv Files and Find the Mismatched Rows [In reply to] Can't Post

Hi Tejas,
Since you are working with a CSV files the rule of thump is to ask you to check how to use a CSV module like Text::CSV_XS http://search.cpan.org/~hmbrand/Text-CSV_XS-1.01/CSV_XS.pm. But one can simply use split function in perl to work this dataset presented like so:


Code
use warnings; 
use strict;
use Inline::Files;

my %id1;

while (<DATA2>) {
chomp;
my @lines = split /,/, $_;
push @{ $id1{ $lines[0] } }, @lines[ 1 .. $#lines ];
}

<DATA1>; # take out the header if not needed

while (<DATA1>) {
chomp;
next if /^$/; # next on blanck line
my @lines2 = split /,/, $_;
if ( exists $id1{ $lines2[0] }) {
if ( $id1{ $lines2[0] }->[0] eq $lines2[1] ) {
print $_,"Matched", $/;
}
else{
print $_,"Mis-matched comparing: ",
$id1{ $lines2[0] }->[0],' with ',$lines2[1], $/;
}
}else{
print $_," doesn't exist in second file",$/;
}
}

__DATA1__
ID1 ,AMT, ID2

CDZNYQ9R8108QR3E3EJ0,3900.00,351
V0Y9WC7YYJ6V8T3DDJM0,3900.00,351
BZ6FD9Q3964VX16EMKY0,3900.00,351
3S5VCXSSS0PPV9V875Q1,3900.00,351
687802764243,399.00,362


__DATA2__
BZ6FD9Q3964VX16EMKY0,900.00,351
3S5VCXSSS0PPV9V875Q1,900.00,351
V0Y9WC7YYJ6V8T3DDJM0,900.00,351
CDZNYQ9R8108QR3E3EJ0,900.00,351


Please, note that for the above codes to work you must have the module Inline::Files installed. If not you might have to open the two files using two separate open functions.

A closer look at the codes above show the use of two similar while loops which some would consider a repetition, so in that case one might do the following:

Code
use warnings; 
use strict;

die "Usage: perlscript.pl file1 file2 " unless @ARGV == 2;
my ( $file1, $file2 ) = @ARGV;

my %id1;

my %file_operation = (
$file1 => sub {
return if $_[0] =~ /^$|^\bID1\b/; # next on blanck line or header
my @lines2 = split /,/, $_[0];
if ( exists $id1{ $lines2[0] } ) {
if ( $id1{ $lines2[0] }->[0] eq $lines2[1] ) {
print $_[0], "Matched", $/;
}
else {
print $_[0], "Mis-matched comparing: ",
$id1{ $lines2[0] }->[0], ' with ', $lines2[1], $/;
}
}
else {
print $_[0], " doesn't exist in second file", $/;
}
},
$file2 => sub {
my @lines = split /,/, $_[0];
push @{ $id1{ $lines[0] } }, @lines[ 1 .. $#lines ];
},
);

open_file( $_, $file_operation{$_} ) for ( $file2, $file1 );

sub open_file {
my ( $filename, $code_ref ) = @_;
open my $fh, '<', $filename or die $!;
while (<$fh>) {
chomp;
$code_ref->($_);
}
}

The basic logic here is load the file to compare into an hash, then stepwise check the hash key with each line of the main file then if you like you can have different arrays or files opened to "put in" those line that matched or otherwise or just print it out like I did in the codes above.
Hope this helps.


(This post was edited by 2teez on Sep 26, 2013, 6:38 AM)


Tejas
User

Oct 2, 2013, 11:47 PM

Post #3 of 4 (1098 views)
Re: [2teez] Comparing 2 Huge Csv Files and Find the Mismatched Rows [In reply to] Can't Post

Wow..
Thts a wonder.
This code dirrectly worked with some minor changes..
Thanks Bro

The only issue i have with the code after seeing the output is
explained below

_DATA1_

ID1 ,AMT, ID2

CDZNYQ9R8108QR3E3EJ0,3900.00,351
CDZNYQ9R8108QR3E3EJ0,-3900.00,351

_DATA2_

CDZNYQ9R8108QR3E3EJ0,3900.00,351

We can see that we have 3900 in both and as per the code it will match obviouslly..
But the issue is the line 2 , as per the code it is mismatched(-3999 <> 3999).
But , the requirement is once a line is matched, that should not be considered.

Codnitions are

1. If a line is matched , that shoudnt be considred as it is matched with some other line.
2 .The line is mismatched,only if the amount differs(Excluding condition 1 ,as it is matched and done)
3.If the whole line doesnt appear(The full line,not just amount) in second file, the it is mismatchde.


Above condtions are getting safisfied, code just needs to handle the first condition
I tried to handle it but couldnt really succeed
Need your help
Thank You


Tejas
User

Oct 11, 2013, 6:21 AM

Post #4 of 4 (1044 views)
Re: [Tejas] Comparing 2 Huge Csv Files and Find the Mismatched Rows [In reply to] Can't Post

can some one with the coditiions i have above

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives