CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate: Comparing two CSV files: Edit Log



jb60606
New User

Apr 6, 2013, 7:31 PM


Views: 1416
Comparing two CSV files

To give you a little background on the task at hand, I have two files containing comma separated market data, generated by a vendor's software. File #1 was generated by the vendor's "old version" of their software. File #2 was generated by the vendor's new version of their software. Each line of each file will contain 10 to 15 comma separated "fields" describing the specifications of one quote, or one trade, or one custom message (see below). The software may have been started in succession rather than simultaneously, so that top-most lines will almost never match, but will eventually sync up. After syncing, the data in each field should be identical to its counter in the other file.

Additionally, and to further complicate things :)), there could be gaps in the data of either file, taking the two briefly out of sync again.


Code
MSFT,13745219,Q,14:32:31.610000,Q,NORMAL,28.640000,192,28.650000,204,0.000000,0.000000,0.000000,-1,-1,Y,Q,Q 
MSFT,977623,T,14:32:31.707000,UNKNOWN,28.644500,186,0,25218929,D,U
MSFT,977627,T,14:32:31.770000,UNKNOWN,28.640000,100,0,25219029,D,U
MSFT,13745382,Q,14:32:32.176000,Q,NORMAL,28.640000,190,28.650000,204,0.000000,0.000000,0.000000,-1,-1,Y,Q,Q
MSFT,13745839,Q,14:32:33.266000,Q,NORMAL,28.640000,190,28.650000,203,0.000000,0.000000,0.000000,-1,-1,Y,Q,Q
MSFT,13746391,Q,14:32:34.267000,Q,NORMAL,28.640000,188,28.650000,203,0.000000,0.000000,0.000000,-1,-1,Y,Q,Q
MSFT,977695,T,14:32:35.167000,UNKNOWN,28.645000,100,0,25219129,D,U
MSFT,13746656,Q,14:32:35.268000,Q,NORMAL,28.640000,188,28.650000,204,0.000000,0.000000,0.000000,-1,-1,Y,Q,Q
MSFT,977698,T,14:32:35.388000,UNKNOWN,28.650000,100,0,25219229,D,U
MSFT,977701,T,14:32:35.695000,UNKNOWN,28.647300,100,0,25219329,D,U


Note: I purposely left out the "Custom" type of message in the above data sample. It's in a different format, and would only convolute the problem at this time.

My desire is to compare each file to confirm the following:

1.) using the "sequence number" (the second field on each line) as a key, check if each quote or trade in File #1 is in File #2.
2.) If it's found, continue to compare the remaining fields of that line (the bid/ask/volume/etc). If they're identical, move onto the next quote/trade. If a field is NOT identical to its counterpart, print out the line number, field and data that differs (this must be in a simple format, like CSV).

I'm very new to Perl, and have only used it for such tasks as sifting through a single CSV file to extract certain information. I've been pulling my hair out trying to figure out how to compare two files.

I've been able to push each file into their own unique hash and how to extract keys and check if each key exists in the other hash, but i'm completely oblivious on how to compare each hash line by line then field by field. The syntax of working with hashes is absolutely foreign to me. Can anyone help?

e.g.

Code
use warnings; 
use strict;

my $inFile01 = "CME.ESM3.MKD01.out";
my $inFile02 = "CME.ESM3.MKD11.out";

open(DATA, '<', $inFile01)
or die("Can't open input file \"$inFile01\": $!\n");

my %hash01;
my $count01 = 0; # start the incrementer
while (my $line = <DATA01>) {
$line =~ s/\s*\z//;
my @tokens = split /,/, $line;
my $symbol = shift @tokens;
my $qsymbol = "$symbol-$count01";
$hash01{seqNum} = $tokens[0];
$hash01{type} = $tokens[1];
$hash01{timeStamp} = $tokens[2];
$hash01{status} = $tokens[4];
$hash01{bid} = $tokens[5];
$hash01{bidQty} = $tokens[6];
$hash01{ask} = $tokens[7];
$hash01{askQty} = $tokens[8];
$hash01{$qsymbol} = \@tokens;
$count01 ++;

close DATA01;


open(DATA02, '<', $inFile02) or die("Can't open input file \"$inFile02\": $!\n");

my %hash02;
my $count02 = 0;
while (my $line02 = <DATA02>) {
$line02 =~ s/\s*\z//;
my @tokens02 = split /,/, $line02;
my $symbol02 = shift @tokens02;
my $qsymbol02 = "$symbol02-$count02";
$hash02{seqNum} = $tokens02[0];
$hash02{type} = $tokens02[1];
$hash02{timeStamp} = $tokens02[2];
$hash02{status} = $tokens02[4];
$hash02{bid} = $tokens02[5];
$hash02{bidVol} = $tokens02[6];
$hash02{ask} = $tokens02[7];
$hash02{askVol} = $tokens02[8];
$hash02{$qsymbol02} = \@tokens02;

close DATA02;

for ( keys %hash01){
unless ( exists $hash02{$_}) {
print "$_: not found in second hash\n";
next;
}



P.S. I should mention that I also tried putting the second file in a hash, and the first file in an array, with the intention of looping through the array to check if the sequence number exists in the hash. This was no problem, but comparing the additional fields was again way over my head.


(This post was edited by jb60606 on Apr 6, 2013, 7:37 PM)


Edit Log:
Post edited by jb60606 (New User) on Apr 6, 2013, 7:37 PM


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives