CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Comparing two CSV files

 



jb60606
New User

Apr 6, 2013, 7:31 PM

Post #1 of 8 (1384 views)
Comparing two CSV files Can't Post

To give you a little background on the task at hand, I have two files containing comma separated market data, generated by a vendor's software. File #1 was generated by the vendor's "old version" of their software. File #2 was generated by the vendor's new version of their software. Each line of each file will contain 10 to 15 comma separated "fields" describing the specifications of one quote, or one trade, or one custom message (see below). The software may have been started in succession rather than simultaneously, so that top-most lines will almost never match, but will eventually sync up. After syncing, the data in each field should be identical to its counter in the other file.

Additionally, and to further complicate things :)), there could be gaps in the data of either file, taking the two briefly out of sync again.


Code
MSFT,13745219,Q,14:32:31.610000,Q,NORMAL,28.640000,192,28.650000,204,0.000000,0.000000,0.000000,-1,-1,Y,Q,Q 
MSFT,977623,T,14:32:31.707000,UNKNOWN,28.644500,186,0,25218929,D,U
MSFT,977627,T,14:32:31.770000,UNKNOWN,28.640000,100,0,25219029,D,U
MSFT,13745382,Q,14:32:32.176000,Q,NORMAL,28.640000,190,28.650000,204,0.000000,0.000000,0.000000,-1,-1,Y,Q,Q
MSFT,13745839,Q,14:32:33.266000,Q,NORMAL,28.640000,190,28.650000,203,0.000000,0.000000,0.000000,-1,-1,Y,Q,Q
MSFT,13746391,Q,14:32:34.267000,Q,NORMAL,28.640000,188,28.650000,203,0.000000,0.000000,0.000000,-1,-1,Y,Q,Q
MSFT,977695,T,14:32:35.167000,UNKNOWN,28.645000,100,0,25219129,D,U
MSFT,13746656,Q,14:32:35.268000,Q,NORMAL,28.640000,188,28.650000,204,0.000000,0.000000,0.000000,-1,-1,Y,Q,Q
MSFT,977698,T,14:32:35.388000,UNKNOWN,28.650000,100,0,25219229,D,U
MSFT,977701,T,14:32:35.695000,UNKNOWN,28.647300,100,0,25219329,D,U


Note: I purposely left out the "Custom" type of message in the above data sample. It's in a different format, and would only convolute the problem at this time.

My desire is to compare each file to confirm the following:

1.) using the "sequence number" (the second field on each line) as a key, check if each quote or trade in File #1 is in File #2.
2.) If it's found, continue to compare the remaining fields of that line (the bid/ask/volume/etc). If they're identical, move onto the next quote/trade. If a field is NOT identical to its counterpart, print out the line number, field and data that differs (this must be in a simple format, like CSV).

I'm very new to Perl, and have only used it for such tasks as sifting through a single CSV file to extract certain information. I've been pulling my hair out trying to figure out how to compare two files.

I've been able to push each file into their own unique hash and how to extract keys and check if each key exists in the other hash, but i'm completely oblivious on how to compare each hash line by line then field by field. The syntax of working with hashes is absolutely foreign to me. Can anyone help?

e.g.

Code
use warnings; 
use strict;

my $inFile01 = "CME.ESM3.MKD01.out";
my $inFile02 = "CME.ESM3.MKD11.out";

open(DATA, '<', $inFile01)
or die("Can't open input file \"$inFile01\": $!\n");

my %hash01;
my $count01 = 0; # start the incrementer
while (my $line = <DATA01>) {
$line =~ s/\s*\z//;
my @tokens = split /,/, $line;
my $symbol = shift @tokens;
my $qsymbol = "$symbol-$count01";
$hash01{seqNum} = $tokens[0];
$hash01{type} = $tokens[1];
$hash01{timeStamp} = $tokens[2];
$hash01{status} = $tokens[4];
$hash01{bid} = $tokens[5];
$hash01{bidQty} = $tokens[6];
$hash01{ask} = $tokens[7];
$hash01{askQty} = $tokens[8];
$hash01{$qsymbol} = \@tokens;
$count01 ++;

close DATA01;


open(DATA02, '<', $inFile02) or die("Can't open input file \"$inFile02\": $!\n");

my %hash02;
my $count02 = 0;
while (my $line02 = <DATA02>) {
$line02 =~ s/\s*\z//;
my @tokens02 = split /,/, $line02;
my $symbol02 = shift @tokens02;
my $qsymbol02 = "$symbol02-$count02";
$hash02{seqNum} = $tokens02[0];
$hash02{type} = $tokens02[1];
$hash02{timeStamp} = $tokens02[2];
$hash02{status} = $tokens02[4];
$hash02{bid} = $tokens02[5];
$hash02{bidVol} = $tokens02[6];
$hash02{ask} = $tokens02[7];
$hash02{askVol} = $tokens02[8];
$hash02{$qsymbol02} = \@tokens02;

close DATA02;

for ( keys %hash01){
unless ( exists $hash02{$_}) {
print "$_: not found in second hash\n";
next;
}



P.S. I should mention that I also tried putting the second file in a hash, and the first file in an array, with the intention of looping through the array to check if the sequence number exists in the hash. This was no problem, but comparing the additional fields was again way over my head.


(This post was edited by jb60606 on Apr 6, 2013, 7:37 PM)


Kenosis
User

Apr 6, 2013, 8:47 PM

Post #2 of 8 (1367 views)
Re: [jb60606] Comparing two CSV files [In reply to] Can't Post

Perhaps the following will be helpful:


Code
use strict; 
use warnings;

my $inFile02 = pop;
my %hash;

while (<>) {
chomp;
my @fields = split /,/;
$hash{ $fields[1] } = \@fields if @fields;
}

push @ARGV, $inFile02;

while (<>) {
chomp;
my @fields = split /,/;
next unless $hash{ $fields[1] };

my @results;
for my $i ( 0 .. @fields - 1 ) {
push @results, "$i|$hash{ $fields[1] }[$i]|$fields[$i]"
if $hash{ $fields[1] }[$i] ne $fields[$i];
}

print "$.," . ( join ',', @results ) . "\n" if @results;
}


Usage: perl script.pl inFile01 inFile02 [>outFile]

The last, optional parameter directs output to a file.

Using your dataset for two files, I changes just a couple of fields. Here's the output:


Code
4,6|28.640000|28.740000 
6,6|28.640000|30.640000,9|203|205
10,3|14:32:35.695000|15:32:35.695000


The first column is the line number of the second file where a field mismatch occurred. The subsequent fields have three elements:


Code
fieldNum|inFile01Val|inFile02Val


The script first pops the second file's name off @ARGV (for later use) and reads through the first file's lines. It splits each line, uses the seqnum for the key, and a reference to the array of fields as the value.

Next, the second file's name is pushed back onto @ARGV and its lines are read and split, like the first file. If the seqnum isn't found in the hash created from the first file, the next line is requested.

A for loop is used to iterate through the elements of both arrays (records from both files with matching seqnums). If the elements don't match, the field number (0 - n-1) and the mismatched values are pushed onto a temp array (@results), separated by a "|". If that temp array has elements, the file's line number (in Perl's $.) and the array are printed with commas separating the values.

Hope this helps!


(This post was edited by Kenosis on Apr 6, 2013, 9:18 PM)


Laurent_R
Veteran / Moderator

Apr 7, 2013, 1:47 AM

Post #3 of 8 (1358 views)
Re: [Kenosis] Comparing two CSV files [In reply to] Can't Post

Yes, using a hash to store the data of the first file (with the sequence being tyhe key and the content of the line the value) and then reading the second file, as shown in Kenosis's script, is the right to solve this type of problem, provided, however, that the sequence number is unique in the first file and that the file is not too large to fit in memory.


jb60606
New User

Apr 7, 2013, 4:11 PM

Post #4 of 8 (1348 views)
Re: [Kenosis] Comparing two CSV files [In reply to] Can't Post

This script is ingenious and performs the task perfectly. I can't thank you enough. And equal thanks for documenting each step.

There will be situations with some market data where there will be multiple identical sequence numbers. What I did in the past was concatenate multiple hash values to the SeqNum or symbol to form a unique identifier and suspect that should work with this script as well.

Thanks again.


(This post was edited by jb60606 on Apr 7, 2013, 4:14 PM)


Kenosis
User

Apr 7, 2013, 6:04 PM

Post #5 of 8 (1342 views)
Re: [jb60606] Comparing two CSV files [In reply to] Can't Post

You're most welcome, jb60606!


Laurent_R
Veteran / Moderator

Apr 8, 2013, 4:20 AM

Post #6 of 8 (1335 views)
Re: [jb60606] Comparing two CSV files [In reply to] Can't Post


In Reply To
There will be situations with some market data where there will be multiple identical sequence numbers. What I did in the past was concatenate multiple hash values to the SeqNum or symbol to form a unique identifier and suspect that should work with this script as well.


Yes, this is one way of doing it. You can also store in your hash a reference to an array (you get a hash of arrays).


jb60606
New User

Apr 10, 2013, 2:58 PM

Post #7 of 8 (1294 views)
Re: [Kenosis] Comparing two CSV files [In reply to] Can't Post

I had previously mentioned in my last post that there will be data feeds that will have no unique identifier (e.g. a sequence number unique to the symbol, etc). In the past, my way around this was to concatenate several elements of each record to the symbol (this was attempted with $uname below), though, I can't get that to work with this code. I've uploaded a new example of data with identical timestamps, sequence#s, etc. Does anyone have any recommendations on a way around this?


Code
while (<>) { 
chomp;
my @fields = split /,/;
my $uname = $fields[0].$fields[1].$fields[8];
$hash{ $uname } = \@fields if @fields;
}

### Second file's name is pushed back onto @ARGV.
push @ARGV, $inFile02;

### Lines are read and split, like the first file.
while (<>) {
chomp;
my @fields = split /,/;
my $uname = $fields[0].$fields[1].$fields[2];
my @missing;
### If the SeqNum isn't found in the hash created from the first file
### the script proceeds to the next line
unless ($hash{ $uname }){
print "The Sequence Number $fields[1] was not found in the new feed\n";
next;
}

my @results;
for my $i ( 0 .. @fields - 1 ) {
push @results, "$i|$hash{ $uname }[$i]|$fields[$i]"
if $hash{ $uname }[$i] ne $fields[$i];
}

print "$.," . ( join ',', @results ) . "\n" if @results;



Code
ESM3,2285969,Q,13:59:02.919000,Q,WIDE,1549.250000,1504,1549.500000,356,0.000000,0.000000,0.000000,122,59,N,CME,CME 
ESM3,2285970,Q,13:59:02.919000,Q,WIDE,1549.250000,1504,1549.500000,355,0.000000,0.000000,0.000000,122,58,N,CME,CME
ESM3,2285972,T,13:59:02.919000,BUY,1549.500000,1,0,1738795,,U
ESM3,2285972,T,13:59:02.919000,BUY,1549.500000,1,0,1738796,,U
ESM3,2285972,T,13:59:02.919000,BUY,1549.500000,18,0,1738814,,U
ESM3,2285972,T,13:59:02.919000,BUY,1549.500000,1,0,1738815,,U
ESM3,2285972,T,13:59:02.919000,BUY,1549.500000,2,0,1738817,,U
ESM3,2285972,T,13:59:02.919000,BUY,1549.500000,2,0,1738819,,U
ESM3,2285972,T,13:59:02.919000,BUY,1549.500000,15,0,1738834,,U
ESM3,2285972,Q,13:59:02.919000,Q,WIDE,1549.250000,1504,1549.500000,314,0.000000,0.000000,0.000000,122,51,N,CME,CME
ESM3,2285973,Q,13:59:02.920000,Q,WIDE,1549.250000,1509,1549.500000,314,0.000000,0.000000,0.000000,123,51,N,CME,CME



Kenosis
User

Apr 10, 2013, 4:44 PM

Post #8 of 8 (1281 views)
Re: [jb60606] Comparing two CSV files [In reply to] Can't Post

Hi jb60606!

Try the following:


Code
use strict; 
use warnings;

my $inFile02 = pop;
my %hash;

while (<>) {
chomp;
my @fields = split /,/;
my $key = $fields[1] =~ /^\d{6,}$/ ? $fields[1] : $fields[0] . $fields[1] . $fields[8];
$hash{$key} = \@fields if @fields;
}

push @ARGV, $inFile02;

while (<>) {
chomp;
my @fields = split /,/;
my $key = $fields[1] =~ /^\d{6,}$/ ? $fields[1] : $fields[0] . $fields[1] . $fields[8];
next unless $hash{ $key };

my @results;
for my $i ( 0 .. @fields - 1 ) {
push @results, "$i|$hash{ $key }[$i]|$fields[$i]"
if $hash{ $key }[$i] ne $fields[$i];
}

print "$.," . ( join ',', @results ) . "\n" if @results;
}


You'll note the addition of the following:


Code
my $key = $fields[1] =~ /^\d{6,}$/ ? $fields[1] : $fields[0] . $fields[1] . $fields[8];


This ternary operator evaluates the first field (SeqNum) for 6+ digits, which it seems your datasets have. If there's a match, the first field is used as the key, else the concatenation of fields 0, 1, and 8 is used as the unique key identifier. (Of course, you're going to have to choose fields to concatenate that you think will not be different between the two records whose fields you'll be examining for differences.)

Tried this on a data set with the SeqNum removed in one record and it worked, although your mileage may vary so some tweaking may be in order.

Hope this helps!


(This post was edited by Kenosis on Apr 10, 2013, 4:59 PM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives