CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Merging the data in two files using a hash

 

First page Previous page 1 2 Next page Last page  View All


Tejas
User

Sep 2, 2014, 5:42 AM

Post #1 of 37 (3457 views)
Merging the data in two files using a hash Can't Post

 

Can SOMEONE please comment on this and can you tell me whether this is good or ugly.
And can the code be shrinked


Quote
File1
File1
28045071,1,56,DAD,418756991,0,-9.02,01-AUG-14,01-AUG-14,1
28045281,1,19,DAD,12701012015,0,-261.02,01-AUG-14,01-AUG-14,1
28045991,1,19,DAD,379031901,0,-22.42,01-AUG-14,01-AUG-14,1
2213506106,1,24,DAD,1374249100,0,-20,01-AUG-14,01-AUG-14,1
2213506116,1,24,DAD,1374249100,0,-20,01-AUG-14,01-AUG-14,1
2264530076,1,24,DAD,1377063511,0,-350,01-AUG-14,01-AUG-14,1
2613542516,1,24,DAD,501029031,0,-30,01-AUG-14,01-AUG-14,1
2634699316,1,24,DAD,512242996,0,-100,01-AUG-14,01-AUG-14,1
2639141256,1,24,DAD,13496038905,0,-25,01-AUG-14,01-AUG-14,1
2641900466,1,24,DAD,56276190,0,-50,01-AUG-14,01-AUG-14,1
28053391,1,19,DAD,766709012,0,-70,01-AUG-14,01-AUG-14,1



Quote
28051341,1,56,DAD,199610116,0,-12.74,02-AUG-14,02-AUG-14,1
28051961,1,19,DAD,6735124615,0,-36.45,02-AUG-14,02-AUG-14,1
28052061,1,19,DAD,394104487,0,-48.61,02-AUG-14,02-AUG-14,1
28053391,1,19,DAD,766709012,0,-60,02-AUG-14,02-AUG-14,1
2399932016,1,24,DAD,567508320,0,-50,02-AUG-14,02-AUG-14,1
2451060666,1,24,DAD,499140250,0,-50,02-AUG-14,02-AUG-14,1
2495205736,1,24,DAD,774256411,0,-20,02-AUG-14,02-AUG-14,1
2604153876,1,24,DAD,7378719,0,-50,02-AUG-14,02-AUG-14,1
2638779256,1,24,DAD,240129917,0,-50,02-AUG-14,02-AUG-14,1
2646215356,1,24,DAD,1036846291,0,-40,02-AUG-14,02-AUG-14,1


Quote
OUTPUT
OUTPUT
28045071,1,56,DAD,418756991,0,-9.02,01-AUG-14,01-AUG-14,1
28045281,1,19,DAD,12701012015,0,-261.02,01-AUG-14,01-AUG-14,1
28045991,1,19,DAD,379031901,0,-22.42,01-AUG-14,01-AUG-14,1
2213506106,1,24,DAD,1374249100,0,-20,01-AUG-14,01-AUG-14,1
2213506116,1,24,DAD,1374249100,0,-20,01-AUG-14,01-AUG-14,1
2264530076,1,24,DAD,1377063511,0,-350,01-AUG-14,01-AUG-14,1
2613542516,1,24,DAD,501029031,0,-30,01-AUG-14,01-AUG-14,1
2634699316,1,24,DAD,512242996,0,-100,01-AUG-14,01-AUG-14,1
2639141256,1,24,DAD,13496038905,0,-25,01-AUG-14,01-AUG-14,1
2641900466,1,24,DAD,56276190,0,-50,01-AUG-14,01-AUG-14,1
28051341,1,56,DAD,199610116,0,-12.74,02-AUG-14,02-AUG-14,1
28051961,1,19,DAD,6735124615,0,-36.45,02-AUG-14,02-AUG-14,1
28052061,1,19,DAD,394104487,0,-48.61,02-AUG-14,02-AUG-14,1
28053391,1,19,DAD,766709012,0,-60,02-AUG-14,02-AUG-14,1 This Txn is repeated and the latest has to be considered, Second File being the latest.
2399932016,1,24,DAD,567508320,0,-50,02-AUG-14,02-AUG-14,1
2451060666,1,24,DAD,499140250,0,-50,02-AUG-14,02-AUG-14,1
2495205736,1,24,DAD,774256411,0,-20,02-AUG-14,02-AUG-14,1
2604153876,1,24,DAD,7378719,0,-50,02-AUG-14,02-AUG-14,1
2638779256,1,24,DAD,240129917,0,-50,02-AUG-14,02-AUG-14,1
2646215356,1,24,DAD,1036846291,0,-40,02-AUG-14,02-AUG-14,1


We can see that
28053391,1,19,DAD,766709012,0,-70,01-AUG-14,01-AUG-14,1
is repeated in both the files, but the latest should be considered and should be printed.
So, In the output the second file's data is printed


Output has All First Files Txns and All Second Files Txns and if the Txn Repeats (Key is first column) in second file,
Second file's data has to be considered.

Code
#! /usr/bin/perl 

my $pwd = `pwd`;
chomp($pwd);
my $clr_txns= "$pwd/File1.txt";
my $temp_file = "$pwd/File2.txt";
my $final_output= "$pwd/Final_List.txt";
open (FIRST,"< $clr_txns")or die "could not open $clr_txns $!";
open (SECOND,"< $temp_file")or die "could not open $cto_txns $!";
open (MATCH,"> $final_output")or die "could not open $final_output$!";

my %hash = ();
my %hash1 = ();
while (my $line = <FIRST>) {
my @elements = split ',', $line;
my $key = $elements[0];
print "$key\n";
$hash{$key} = 1;
$hash2{$key} = $line;
}
#open SECOND, "< $secondFile" or die "could not open second file...\n";
while (my $line = <SECOND>) {
my @elements = split ',', $line;
my $key = $elements[0]; # Perl arrays are zero-indexed
if ($hash{$key}) {
#print "($hash{$key} \n";
print MATCH "$line";
$hash{$key} = 0;
}
else {
print MATCH "$line" ; #Also Print unmatched, as we need all the txns from both the files
}
}
while( my( $key, $value ) = each %hash2 ){
if($hash{$key} != 0) {
print MATCH "$value"; # Print the values of other files, and eliminate the matched ones
}


}

close (FIRST);
close (SECOND);


Thanks
Tejas


FishMonger
Veteran / Moderator

Sep 2, 2014, 6:55 AM

Post #2 of 37 (3451 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

Using your choice of words, I must say that it's ugly.

You should ALWAYS include the strict and warnings pragmas.

Use a lexical var for the filehandle.

Use the 3 arg form of open.

Use descriptive names for the vars. %hash and %hash1 are very poor var name choices.

You only need 1 hash, not 2.

Use proper vertical and horizontal whichspace (line spacing and indentation) and be consistent.

Don't create vars which are not needed/used such as your @elements array.

With 1 or 2 exceptions the first arg for the split function should be a regex pattern, not a string.


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Cwd;

my $cwd = getcwd();
my $clr_txns = "$cwd/File1.txt";
my $temp_file = "$cwd/File2.txt";
my $final_output = "$cwd/Final_List.txt";

open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>";
open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>";
open my $out_fh, '<', $final_output or die "could not open $final_output <$!>";

# Since I don't know what your data represents, I can't come up with a better name
# so I kept your hash name
my %hash;

while (my $line = <$in_fh1>) {
my $key = (split /,/, $line)[0];
$hash{$key} = $line;
}
close $in_fh1;

while (my $line = <$in_fh2>) {
my $key = (split /,/, $line)[0];
$hash{$key} = $line;
}
close $in_fh2;

foreach my $value (values %hash) {
print $out_fh $value;
}
close $out_fh;



(This post was edited by FishMonger on Sep 2, 2014, 7:17 AM)


Laurent_R
Veteran / Moderator

Sep 2, 2014, 10:02 AM

Post #3 of 37 (3439 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

Hi Tejas,

I have answered on your other post and see now that you posted the same question twice and Fishmonger has already given you a more detailed answer. Posting twice lead to duplication of work, please don't do it.


Tejas
User

Sep 2, 2014, 11:28 AM

Post #4 of 37 (3436 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Sorry for that.
i have mistakenly done that in the other post
Will delete it.


Thanks
Tejas


Tejas
User

Sep 2, 2014, 11:46 AM

Post #5 of 37 (3430 views)
Re: [FishMonger] Merging the data in two files using a hash [In reply to] Can't Post

Thanks.
Iam actully in the test phase and did not really look into that.
Now i will change the script accordinglky.

Will this script work if i have two files with 50 lakh lines each
And assume there are no duplictae's and output will have 1 crore lines .

I cant think anything except hash here, but am afraid this wont work.
Even Sorting Both the files and running this code will not help as , in the worst case scenario, there would nt be any duplicates.

Any new ways of implenting this .
Thanks
Tejsa


FishMonger
Veteran / Moderator

Sep 2, 2014, 11:51 AM

Post #6 of 37 (3427 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

What are the file sizes in KB or MB, not the number of lines?

How much RAM do you have?


Tejas
User

Sep 2, 2014, 11:55 AM

Post #7 of 37 (3423 views)
Re: [FishMonger] Merging the data in two files using a hash [In reply to] Can't Post

I hve afile with 1 gb .
Some files are around 200 to 300 mb

1 gb ram

its just not that, last time whn i used hash for a file with 50lakh keys.
my system hanged

Thanks
Tejas


(This post was edited by Tejas on Sep 2, 2014, 11:56 AM)


FishMonger
Veteran / Moderator

Sep 2, 2014, 12:33 PM

Post #8 of 37 (3413 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

1GB of RAM is considered very low these days, especially if you're using Windows.

The files in the 200-300MB range should be much of a problem, but the GB files will be a problem due to your limited RAM.

My first recommendation is to add more RAM. IMO, 4GB should the minimum when doing this kind of work on Windows.

If you can't upgrade the the RAM then you could filter the data through a database rather than storing everything in memory via a hash. You could parse each line as you're currently doing but instead of assigning a hash value, you store that data in the DB. You could even access your csv files with sql statements as if they were database tables. Once the 2 input files have been processed, you execute another query that dumps the data directly to a new csv file.

Going the DB will be a little more complex coding but it will also reduce the memory footprint and won't hang the system like your previous experience.


Tejas
User

Sep 2, 2014, 10:24 PM

Post #9 of 37 (3405 views)
Re: [FishMonger] Merging the data in two files using a hash [In reply to] Can't Post


Quote
access your csv files with sql statements as if they were database tables


Does that i dont need to have sql at all and task can be performed as if they were tables.
Do u have a snippet for this ?

Thanks
Tejas


FishMonger
Veteran / Moderator

Sep 3, 2014, 8:20 AM

Post #10 of 37 (3392 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post


Code
#!/usr/bin/perl 

use strict;
use warnings;
use DBI;
use DBD::CSV;
use Data::Dumper;

# connect to "csv database" using default parameters
my $dbh = DBI->connect("DBI:CSV:") or die $DBI::errstr;

# prepare and execute select statement to fetch 1 row
my $sth = $dbh->prepare("select * from file1.csv limit 1") or die $dbh->errstr();
$sth->execute;

# fetch the row
my @row = $sth->fetchrow_array;

# dump out the row (array)
print Dumper \@row;

# disconnect from the "csv database"
$dbh->disconnect;


Output using your first file.

c:\test>csv2sql_example.pl

Code
$VAR1 = [ 
'28045281',
'1',
'19',
'DAD',
'12701012015',
'0',
'-261.02',
'01-AUG-14',
'01-AUG-14',
'1'
];


http://search.cpan.org/~timb/DBI-1.631/DBI.pm
http://search.cpan.org/~jzucker/DBD-CSV-0.22/lib/DBD/CSV.pm


(This post was edited by FishMonger on Sep 3, 2014, 8:22 AM)


FishMonger
Veteran / Moderator

Sep 3, 2014, 8:50 AM

Post #11 of 37 (3387 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

I should mention that the most efficient database approach would be to use the "load data inifile" sql statement to load the csv files into the database.

1) create the database and table structure (that's 2 separate sql statements).

2) insert file1.txt via a "load data inifile" statement.

3) insert file2.txt via a slightly adjusted "load data inifile" statement i.e., add the REPLACE keyword so that when a duplicate ID (primary key) is seen, it will update/replace that row from file1 with the row from file2.

4) once both files are loaded execute a "select into outfile" statement to dump out the data to a new csv file.

5) delete the database if not needed.

http://dev.mysql.com/doc/refman/5.1/en/load-data.html
http://dev.mysql.com/doc/refman/5.1/en/select-into.html


Laurent_R
Veteran / Moderator

Sep 3, 2014, 9:56 AM

Post #12 of 37 (3380 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

Hi Tejas, is your data sorted? It seems to be, but I don't understand fully how.

Update: the reason I am asking is that if the data is sorted one way or another in accordance with the comparison key, then you could read the files in parallel and remove the duplicates as you go. The good thing about this approach is that it will work for files of just about any size, irrespective of RAM size, and it will be much faster than a database approach. The downside is that it requires a bit of cleverness, or rather care and attention, to get the algorithm really right.


(This post was edited by Laurent_R on Sep 3, 2014, 10:21 AM)


Tejas
User

Sep 3, 2014, 10:55 AM

Post #13 of 37 (3368 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post


Quote
then you could read the files in parallel and remove the duplicates as you go


I can get the files sorted by the keys in perl itself.

Bbut, Do u want me to open both the files at a time and check for the keys ?

Do u have a snippet for that. ?

The Worst case scenario would be that both file will not have any matching key and ultimately all the data in both the files have to be stored in a seperate file


Quote
I said it already. You first need to sort the files on their comparison key, say in ascending order (using the Unix sort utility, for example).

Then you open both new sorted files, read the first line of each. If the keys compare equal, then you have a common record. Store the line in a file of common records (if you need one) and move to the next line for both files. And repeat the key comparison.

If they don't compare equal, then the smallest of the two corresponds to an "orphan", i.e. a record that is in the file wxhere you found it and not in the other. Write that out to an orphan file. Get the next line of the file where the orphan was found, keeping the line from the other file. And repeat the comparison.

And so on until the end of one file, at which point any remaining lines in the other file are also orphans.

I have written a generic module to do that (and a number of other things on large files), and I am using it regularly , but have not uploaded it to the CPAN so far, because uploading a module requires to do a few additional steps (preparing an install procedure, etc.) that I don't know (yet) how to do.

But if you are trying to do it and don't succeed (and show how you've tried), I would gladly post the core algorithm.

The file comparison is extremely fast, but the initial sorting of the files has an overhead, which is why I was discouraged you from trying this approach given that your hash approach is giving good results in view of the data size.

Here's what u ve suggested when i had a similar problem last time
And My Keys are nt just numbers , there are alhanumeric keys too

Thanks
Tejas


(This post was edited by Tejas on Sep 3, 2014, 11:08 AM)


Tejas
User

Sep 4, 2014, 5:40 AM

Post #14 of 37 (3321 views)
Re: [FishMonger] Merging the data in two files using a hash [In reply to] Can't Post

Hi
Here is the Script with some minute changes


Code
#!/usr/bin/perl 

use strict;
use warnings;
use Cwd;

my $cwd = getcwd();
my $clr_txns = "$cwd/File1.txt";
my $temp_file = "$cwd/File2.txt";
my $final_output = "$cwd/Final_List.txt";

open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>";
open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>";
open my $out_fh, '>', $final_output or die "could not open $final_output <$!>";

my %Unbal_Hash;

while (my $line = <$in_fh1>) {
my $key = (split /,/, $line)[0];
$Unbal_Hash{$key} = $line;
}
close $in_fh1;

while (my $line = <$in_fh2>) {
my ($key,$balance) = (split /\t/, $line)[0,7];
if (exists $Unbal_Hash{$key} && $balance == 0 )
{
print "$Unbal_Hash{$key}\n";
delete $Unbal_Hash{$key};
}
else
{
$Unbal_Hash{$key} = $line;
}
}
close $in_fh2;

foreach my $value (values %Unbal_Hash) {
print $out_fh $value;
}
close $out_fh;


I am just eliminating the keys which match the key and the total_amount is 0(They are not needed)


Quote
File1
889546565,6,46,APY,0,-9980,14-DEC-13,14-DEC-13,1
889996975,6,46,APY,0,-9980,14-DEC-13,14-DEC-13,1
889998385,6,46,APY,0,-14067,05-DEC-13,05-DEC-13,1
890722795,6,46,APY,0,-9430,14-DEC-13,14-DEC-13,1
890857005,6,24,APY,0,-500,10-NOV-13,10-NOV-13,1
890925475,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936315,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936335,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936355,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936415,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
696795792443,6,308,APY,550,0,24-JUL-14,24-JUL-14,3



Quote
File 2
1993532910 6 212-366-520 APY 451028365 900 -900 0 14-AUG-14 15-AUG-14 3
1993536110 6 366-520 APY 477045894 390 -390 0 14-AUG-14 15-AUG-14 2
1993536750 6 366-520 APY 917563294 300 -300 0 14-AUG-14 15-AUG-14 2
1993539310 6 366-520 APY 7512802845 432 -432 0 14-AUG-14 15-AUG-14 2
1993539950 6 366-520 APY 449362894 432 -432 0 15-AUG-14 15-AUG-14 2
1993541230 6 366-520 APY 6770624155 1234 -1234 0 15-AUG-14 15-AUG-14 2
1993542510 6 366-520 APY 628602625 100 -100 0 15-AUG-14 15-AUG-14 2
1993543790 6 366-520 APY 843380824 400 -400 0 15-AUG-14 15-AUG-14 2
1993544430 6 366-520 APY 531660774 99 -99 0 15-AUG-14 15-AUG-14 2
1993545070 6 212-366-520 JPY 444744025 432 -432 0 15-AUG-14 15-AUG-14 3
696795792443 6 14-308 APY 521806975 550 -550 0 24-JUL-14 15-AUG-14 4


I hope the way iam handling the Txns with Amount 0 is ok

Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 4, 2014, 9:54 AM

Post #15 of 37 (3312 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post


In Reply To
I can get the files sorted by the keys in perl itself.

Probably not if your files are too large to fit in RAM.

If you are under Linux or Unix, you can use the OS's sort utility, which can sort files much larger than RAM by using temporary files on disk, but I do not know if there is such utility on Windows.


In Reply To

But, Do u want me to open both the files at a time and check for the keys ?


yes, that was the idea. It is detailed in the older post you quoted from me just above.


In Reply To

Do u have a snippet for that. ?


Yes, I could provide one, but please explain exactly what you are trying to do, as I am not entirely sure of the details. It seems to me that you are trying to remove from one file data items that also exist in the other file. Is this correct? Is there more to it?

But this snippet would only work for sorted data, so that it depends on whether you are really able to sort the files on their keys.


FishMonger
Veteran / Moderator

Sep 4, 2014, 10:24 AM

Post #16 of 37 (3310 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

A number of the GNU utilities have been ported to Windows, sort being one of them.
http://gnuwin32.sourceforge.net/packages/coreutils.htm
http://unxutils.sourceforge.net/


(This post was edited by FishMonger on Sep 4, 2014, 10:29 AM)


Tejas
User

Sep 4, 2014, 10:38 AM

Post #17 of 37 (3306 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post


Quote
Yes, I could provide one, but please explain exactly what you are trying to do, as I am not entirely sure of the details. It seems to me that you are trying to remove from one file data item
s that also exist in the other file. Is this correct? Is there more to it?


Yes, This is the task..But there would be more operations and changes to the mathmatical operations.
But the comparisions would definitely be there.
And i have also specified in my earlier post that , the keys will not just be numbers, there would alphanumerics and aplhabets too (EX : aHXPVTTRER).
If your code can help me , i will definitely use it , as it is working on sorted files and i assume that the comaprisions would be really less comaparitively

FInally, I dint really get why u suggested abt windows sort utility, i never use windows at all.
My work is totally on linux and i can use command line sort utility
i will be glad to use your code snippet.

Thanks
Tejas


Laurent_R
Veteran / Moderator

Sep 4, 2014, 11:00 AM

Post #18 of 37 (3303 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

I just was not sure whether you were using Linux or Windows, tha's why I asked. Then you can use the Linux sort utility.

I'll come back later today with the basic code to do it, not enough time right now.


Tejas
User

Sep 4, 2014, 11:08 AM

Post #19 of 37 (3301 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Thanks

Code
   if (exists $Unbal_Hash{$key} && $balance == 0 )  
{
print "$Unbal_Hash{$key}\n";
delete $Unbal_Hash{$key};
}
else
{
$Unbal_Hash{$key} = $line;
}

also please comment on this code too, is this the right approach ?


(This post was edited by Tejas on Sep 4, 2014, 11:29 AM)


Laurent_R
Veteran / Moderator

Sep 4, 2014, 11:54 AM

Post #20 of 37 (3295 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

I must say that this part in your code surprised me a bit when I saw it in your code (essentially, why do you delete from the hash) , but since I haven't understood what in details you are trying to do, I do not know whether this is correct. That's the problem with this thread and the previous one on the same subject: you haven't defined precisely what you want to do for us, and I am not even sure you really know for sure yourself. When you want to write a program, you first need to clarify exactly what you want it to do (often by writing some specs or some business rules, or at the very least by having them very clear in your own mind). Unless I missed an important post, your description of what you want is far from being precise enough on what you need.

Well, enough talking, I'll try to write up some code based on my best comprehension of what you need, you'll probably have to adapt it to fit your real needs. But at least you'll have a basic algorithm, hopefully well coded, to use, and hopefully you'll have only implementation details to change.


Laurent_R
Veteran / Moderator

Sep 4, 2014, 1:59 PM

Post #21 of 37 (3286 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Alright, a first very simple solution, which might work or might be too simple for your needs.

This assumes you just want to remove duplicates, or, in other words, retain only one line for every unique key. In this case, you can just use the sort utility to merge together and sort the data from both files and produce one file with unique values. Note that, from what you said previously, your sort should be alphanumerical, not numerical.


I first define a comparison function:


Code
sub compare { 
my ($curr, $prev) = @_;
my $curr_key = (split /,/, $curr)[0];
my $prev_key = (split /,/, $prev)[0];
return 1 if $curr_key eq $prev_key;
return 0;
}

This function is receiving two lines from the calling function, splitting the lines to get the keys, and comparing the keys. It returns 1 if the keys are equal (duplicates) and 0 otherwise. In one of my real programs, this function would be much shorter (probably 2 or 3 lines) and would be most probably stored in a coderef rather than a regular function, but I tried to make it as simple as possible to help you understand the principle.

My own version might be something like this:

Code
sub compare { 
my ($curr_key, $prev_key) = (split /,/, $_)[0] for @_;
$curr_key eq $prev_key ? 1 : 0;
}

But don't worry about that, use the first version for the time being.

Please also note that it really makes sense to separate the functional rules (how to compare records, stored in this function) from the technical duplicate removing part (below). It means you can reuse the technical part and just change the functional part for another similar problem.

Now the duplicate removal. This assumes you have already opened three filehandlers, $FH_IN for the input, $FH_DUPL for printing out the duplicates, and $FH_OUT for output of the unique lines.


Code
my $previous_line = ""; 
while (my $line = <$FH_IN>) {
chomp $line;
if (compare($line, $previous_line)) {
# this line is a duplicate
print $FH_DUPL $line, "\n";
} else {
print $FH_OUT $line, "\n";
}
$previous_line = $line;
}


As you can see, this is fairly short and simple code.

I haven't tested the above because I don't really have data to do it, but I believe this should work, because it is a simplified version of something that I have tested extensively. I might have goofed something when simplifying it, but if such is the case, it should be easy enough to fix it.

I'll post a bit later a more complex solution where the two files are read in parallel. But the one above might just be sufficient for your needs.


Laurent_R
Veteran / Moderator

Sep 4, 2014, 2:36 PM

Post #22 of 37 (3284 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

If you need a more detailed output than what I suggested above, you might try the following.

This assumes that 6 filehandlers are open before we start:
- $IN1 and $IN2 for the two input files
- $ORPH1 and $ORPH2 for orphans (records in one file but not in the other)
- $OUT1 and $OUT2 for common lines (two files because the keys of the input files might be the same and the content not necessarily be exactly identical)
Of course, you can simplify all this if some files are not needed.

The comparison function needs to be slightly different than before, because it needs to return three possible values:



Code
sub compare2 {  
my ($curr_key, $prev_key) = (split /,/, $_)[0] for @_;
return $curr_key cmp $prev_key;
}


Now the actual comparison code:

Code
my $ligne1 = <$IN1>; 
my $ligne2 = <$IN2>;
die "One of the input files is empty\n" unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
while ( 1 ) {
my $comparison = compare2($ligne1, $ligne2);
if ($comparison > 0) {
print $ORPH2 $ligne2, "\n";
$ligne2 = <$IN2>;
last unless defined $ligne2;
chomp $ligne2;
} else {
if ($comparison < 0) {
print $ORPH1 $ligne1, "\n";
$ligne1 = <$IN1>;
last unless defined $ligne1;
chomp $ligne1;
} else {
print $OUT1 $ligne1, "\n";
print $OUT2 $ligne2, "\n";
$ligne1 = <$IN1>;
$ligne2 = <$IN2>;
last unless defined $ligne1 and defined $ligne2;
chomp ($ligne1, $ligne2);
}
}
}
print $ORPH2 $ligne2 if defined $ligne2;
print $ORPH2 $ligne2 while $ligne2 = <$IN2>;
print $ORPH1 $ligne1 if defined $ligne1;
print $ORPH1 $ligne1 while $ligne1 = <$IN1>;

Same comment as my previous post: I haven't tested on your data, because I don't have enough of your data, so I might have goofed one detail here or there, but my module from which I took the code has been thoroughly tested in real life applications and is believed to be bug free.


(This post was edited by Laurent_R on Sep 4, 2014, 2:42 PM)


Tejas
User

Sep 4, 2014, 6:44 PM

Post #23 of 37 (3275 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Hi
Iam sorry to say that our needs change day by day and i am adding new stuff day by day

The only reason iam deleting the hash is because i donot need the entries whose total sum is 0.
Or else i will end up printing txns with amount 0 and non zero .
And iam interested in non zero txns only

Basically i have two files
1. Files which have today's transaction report,which has an update of historical txns and current txns
2.And a historical non-zero txns report.
Generally, We check the latest sum of amounts

And i firstly check whether the historical txns are available in todays's report
If No --> They are still non zero (This shhud be prnted)
if yes --> Then there are 2 cases
1. They can be zero
2. They can be non-zro but with some modification(as they are available in today's report, so definitely ,there will change in the amoount)

That is only reason why iam deleting the values with 0 from the hash
and at the end i will just print those values which are non-zero


Quote
File1
889546565,6,46,APY,0,-9980,14-DEC-13,14-DEC-13,1
889996975,6,46,APY,0,-9980,14-DEC-13,14-DEC-13,1
889998385,6,46,APY,0,-14067,05-DEC-13,05-DEC-13,1
890722795,6,46,APY,0,-9430,14-DEC-13,14-DEC-13,1
890857005,6,24,APY,0,-500,10-NOV-13,10-NOV-13,1
890925475,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936315,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936335,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936355,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
890936415,6,24,APY,0,-1000,29-OCT-13,29-OCT-13,1
696795792443,6,308,APY,550,0,24-JUL-14,24-JUL-14,3


File 2
1993532910 6 212-366-520 APY 451028365 900 -900 0 14-AUG-14 15-AUG-14 3
1993536110 6 366-520 APY 477045894 390 -390 0 14-AUG-14 15-AUG-14 2
1993536750 6 366-520 APY 917563294 300 -300 0 14-AUG-14 15-AUG-14 2
1993539310 6 366-520 APY 7512802845 432 -432 0 14-AUG-14 15-AUG-14 2
1993539950 6 366-520 APY 449362894 432 -432 0 15-AUG-14 15-AUG-14 2
1993541230 6 366-520 APY 6770624155 1234 -1234 0 15-AUG-14 15-AUG-14 2
1993542510 6 366-520 APY 628602625 100 -100 0 15-AUG-14 15-AUG-14 2
1993543790 6 366-520 APY 843380824 400 -400 0 15-AUG-14 15-AUG-14 2
1993544430 6 366-520 APY 531660774 99 -99 0 15-AUG-14 15-AUG-14 2
1993545070 6 212-366-520 APY 444744025 432 -432 0 15-AUG-14 15-AUG-14 3
696795792443 6 14-308 APY 521806975 550 -550 0 24-JUL-14 15-AUG-14 4


TXN WITH 696795792443 HAS
696795792443,6,308,APY,550,0,24-JUL-14,24-JUL-14,3 IN FIRST FILE
696795792443 6 14-308 APY 521806975 550 -550 0 24-JUL-14 15-AUG-14 4 IN SECOND FILE(Today's data)

This means that a historical txns has an update today and the total amount is 0.so we dont need this to printed .as its happily balanced

But an example below is unballanced case, where there is an update .but the total is still non zero.we have to print the latest data , as there is an update
696795792443,6,308,APY,550,0,24-JUL-14,24-JUL-14,3 IN FIRST FILE
696795792443 6 14-308 APY 521806975 550 -1550 1000 24-JUL-14 15-AUG-14 4 IN SECOND FILE(Today's data)

Finally All The UnMatched Txns shud also be printed as they are all non-zero

First File always has non-zero values
Second File has Zero Value's and Non-zero(There will be a lot of unmatched txn with 0 , which also shud nt printed, i am finding a way to do that.
And MAtched Txns with 0 are anyhow being deleted.so only the above case has to be dealt with)

All that iam doing is print all the non-zero values from both the files

Compare if a non-zero has an update in the latest file, if yes and zero , ignore
if yes and non-zero print
if not available in current file, it means its still non zero

Hope u have understood the business behind ths


(This post was edited by Tejas on Sep 4, 2014, 7:10 PM)


Laurent_R
Veteran / Moderator

Sep 5, 2014, 3:25 PM

Post #24 of 37 (3235 views)
Re: [Tejas] Merging the data in two files using a hash [In reply to] Can't Post

And did you try the two pieces of code I suggested?


Tejas
User

Sep 6, 2014, 8:05 AM

Post #25 of 37 (3068 views)
Re: [Laurent_R] Merging the data in two files using a hash [In reply to] Can't Post

Hi

Yes, i tried it on test data.
As the main data is still spooling, it atleast takes 8 hours to spool the data from Sql
And then i will compare it on the Prod Data.

I will post the updates once the file is ready.



Thanks
Tejas

First page Previous page 1 2 Next page Last page  View All
 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives