CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Mindbending re-tweak challenge to compare files in reverse order & flag output

 



stuckinarut
Novice

Feb 27, 2014, 11:32 AM

Post #1 of 15 (1767 views)
Mindbending re-tweak challenge to compare files in reverse order & flag output Can't Post

I need to re-tweak a script I've used to merge two files and flag common entries with single single trailing space and letter (or letters) to indicate any shared occurrences or not for all items in the list. I am using Windows XP [version 5.1.2600] (I think that's right).

perl compare.pl listG.txt listL.txt >combolist.txt

Example:

listG.txt
AS-298
AS-375
AS-402
(etc., etc.)

and

listL.txt
AS-402
AS-590
(etc., etc.)

and yields:

AS-298 G
AS-375 G
AS-402 G L
AS-590 L

It works great!


Code
 
#!/usr/bin/perl

use strict;

use warnings;

# RUN AS: listcompare.pl listG.txt listL.txt

my @letters = qw(G L);
my $letter = shift @letters;
my %lines;
while (<>) {
chomp;
if (my $l = $lines{$_}) {
my $ll = $l->{letters};
next if ($ll->[-1] eq $letter);
push @$ll, $letter;
}
else {
$lines{$_} = {
f => scalar(@letters),
letters => [$letter],
}
}
if (eof) {
$letter = shift @letters;
}
}
print $_, ' ', join('', @{$lines{$_}{letters}}), "\n" for sort keys %lines;

# END OF SCRIPT


The need is to re-tweak the script for a pressing task of comparing two different lists but NOT merging them this time. Instead, the output can ONLY be the listL.txt lines, and also flagging any/all of those lines which have been matched to listM entries with a trailing {space} and 'YES' ... also sorted in Alphabetical sequence A-Z based upon the first column. If there is no match, only the original line intact will be output (without a 'YES'). Then the manual labor of reviewing each of the lines in the output ;-(

Each of the two new list formats will each have 3 columns (separated by a space). Both lists will have about 2,500 lines.

listL.txt (3 sample lines)

D9PM 40 L7MRQ
D9PM 40 A3WX
D9PM 80 Q5BAL
(etc., etc.)

listM.txt (3 sample lines)

A3WX 40 R5QRC
A3WX 40 D9PM
A3WX 80 L2AFT
(etc., etc.)

The mindbending challenges is that the 3 columns (space separated fields) in listL.txt must be matched to the 3 columns (space separated fields) in listM.txt, BUT IN REVERSE ORDER {SIGH}.

In other words, the only match which would occur in the above examples would be for:

D9PM 40 A3WX

And the flagged output line would be from listL.txt would be:

D9PM 40 A3WX YES

It would really help if there are any 'Duplicate' matches to add maybe another space and the word 'DUPE' after 'YES' but I can try and identify these manually {MAJOR SIGH}.

Can this even be done? Hopefully I have explained things correctly. My eyeballs are rolling in all directions ;-(

Any assistance would be greatly appreciated.

Thanks very much!

-stuckinarut


Kenosis
User

Feb 27, 2014, 1:04 PM

Post #2 of 15 (1753 views)
Re: [stuckinarut] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

Here's one option:

Code
use strict; 
use warnings;

my ( %hash, $reverse );

while (<>) {
my @fields = $reverse ? reverse split : split;
$hash{"@fields"}{$ARGV}++;
$reverse++ if eof;
}

keys %{ $hash{$_} } == 2 and print "$_ YES\n" for keys %hash;

Command-line usage:

Code
perl script.pl listL.txt listM.txt

Output:

Code
D9PM 40 A3WX YES

The script creates a hash of hashes (HoH), where the keys are the lines and the values are references to hashes whose keys are the file names being processed. Dumping the hash using your datasets shows the following:

Code
$VAR1 = { 
'R5QRC 40 A3WX' => {
'listM.txt' => 1
},
'D9PM 80 Q5BAL' => {
'listL.txt' => 1
},
'L2AFT 80 A3WX' => {
'listM.txt' => 1
},
'D9PM 40 L7MRQ' => {
'listL.txt' => 1
},
'D9PM 40 A3WX' => {
'listM.txt' => 1,
'listL.txt' => 1
}
};

The script splits each line's elements into an array, which is interpolated and used as a hash key. When it reached the end of the first file, it sets a 'reverse' flag, to reverse the elements split from the second file.

If there are two keys after dereferencing (keys %{ $hash{$_} } == 2), then it prints the 'top' key with a "YES" after it.

Hope this helps to at least provide some direction...


(This post was edited by Kenosis on Feb 27, 2014, 3:59 PM)


stuckinarut
Novice

Feb 27, 2014, 4:18 PM

Post #3 of 15 (1729 views)
Re: [Kenosis] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

Thank you, Kenosis.

Your code works great with my simple data example!

I also ran a test with a bit more data, but what happens is the output list also contains any of the ListM.txt matches as duplicates lines. Can those be eliminated?

Any non-matches from the original ListL.txt file still need to print out even without 'YES'. I tried including...


Code
values %{ $hash{$_} } ne 2 and print "$_\n" for keys %hash;



... but that didn't work.

Also trying to figure out how to integrate an alphabetical sort for the ListL.txt output (sorted on the first column/field) based on this format:


Code
@sorted = sort { $hash{$a} cmp $hash{$b} } keys %hash;


I think I might be better at 'Basket Weaving' :^)

-stuckinarut


Kenosis
User

Feb 27, 2014, 4:24 PM

Post #4 of 15 (1726 views)
Re: [stuckinarut] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

You're most welcome, stuckinarut!

Could you share the datasets (the "more data") that produces the duplicate lines? I'm not sure I'm properly envisioning the issue you're experiencing. We'll handle the sort-on-first-column then...


(This post was edited by Kenosis on Feb 27, 2014, 4:42 PM)


stuckinarut
Novice

Feb 27, 2014, 4:58 PM

Post #5 of 15 (1720 views)
Re: [Kenosis] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

Ohhh, my bad, Kenosis.

I just realized that I had used a bit larger 'consolidated' ListL.txt' file that also included some of the (reversed) transactions contained in the ListM.txt file. These would then appropriately be included based upon the first column/field value. My apologies for the oversight until just now ;-(

Actually, this may still work out. I think what I can do is to take the final master consolidated output.txt file when run, import it into Excel, and do my sorting there on the first column/field. An extra step, but that will work.

My only remaining issue is still to include any of the non-matches in the final output (i.e., anything in ListL.txt without a match to a (reversed) ListM.txt entry), and also without a trailing space and 'YES' in the final output.

I made several 'flawed' attempts with an IF ELSE ( ? : ) approach using 'ne' that still has me stuck-in-a-rut.

Thanks for your patience.

-stuckinarut


Chris Charley
User

Feb 27, 2014, 6:47 PM

Post #6 of 15 (1708 views)
Re: [stuckinarut] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

I think that this code will

  • not print a duplicate line in the 'L' file

  • print unmatched items in the 'L' file

  • for matching data in the second file, report duplicates in the second file

  • print sorted on the first field


  • I added a duplicate line in your sample data and got this output.

    Code
    D9PM 40 A3WX YES DUP 
    D9PM 40 L7MRQ
    D9PM 80 Q5BAL


    Code
    #!/usr/bin/perl 
    use strict;
    use warnings;

    my %data;

    2 == @ARGV or die "Supply fileL and fileM to compare.";

    open my $L, "<", shift or die $!;
    while (<$L>) {
    chomp;
    $data{$_} = $_;
    }
    close $L or die $!;

    my %seen;
    open my $M, "<", shift or die $!;
    while (<$M>) {
    my $key = join " ", reverse split;
    if ($data{$key}) {
    if (! $seen{$key}++) {
    $data{$key} .= ' YES';
    }
    else {
    $data{$key} .= ' DUP' if $seen{$key} == 2;
    }
    }
    }
    close $M or die $!;

    for my $v (sort values %data) {
    print $v, "\n";
    }

    __DATA__
    listL.txt

    D9PM 40 L7MRQ
    D9PM 40 A3WX
    D9PM 80 Q5BAL


    listM.txt (4 sample lines - includes a duplicate)

    A3WX 40 R5QRC
    A3WX 40 D9PM
    A3WX 80 L2AFT
    A3WX 40 D9PM



    (This post was edited by Chris Charley on Feb 27, 2014, 6:57 PM)


    stuckinarut
    Novice

    Feb 27, 2014, 7:26 PM

    Post #7 of 15 (1701 views)
    Re: [Chris Charley] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

    Hi, Chris Charley:

    OMG-OMG... I almost soiled myself when I saw the output from your code:


    Quote

    I added a duplicate line in your sample data and got this output.
    Code

    D9PM 40 A3WX YES DUP
    D9PM 40 L7MRQ
    D9PM 80 Q5BAL


    Excitedly, I rushed to the other Confuzzzer to try it out, but got this instead:

    D9PM 40 A3WX
    D9PM 40 L7MRQ
    D9PM 80 Q5BAL

    Tried it 3 times using both data files in your post with always the same result via:

    perl yourscript.pl listL.txt listM.txt >myoutput.txt

    Really appreciate your assistance to help. Must be a loose screw somewhere here on my end?

    Regards,

    -stuckinarut


    Kenosis
    User

    Feb 27, 2014, 8:29 PM

    Post #8 of 15 (1692 views)
    Re: [stuckinarut] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

    Given listL.txt:

    Code
    D9PM 40 L7MRQ 
    D9PM 40 A3WX
    D9PM 80 Q5BAL
    D9PM 80 Q5BAL
    D9PM 40 A3WX

    And listM.txt:

    Code
    A3WX 40 R5QRC 
    A3WX 40 D9PM
    A3WX 80 L2AFT
    A3WX 40 D9PM
    L7MRQ 40 D9PM

    And the script, with minor modifications in the printing routine (and it sorts by the first column):

    Code
    use strict; 
    use warnings;

    my ( $listL, $listM, %hash, $reverse ) = @ARGV;

    while (<>) {
    my @fields = $reverse ? reverse split : split;
    $hash{"@fields"}{$ARGV}++;
    $reverse++ if eof;
    }

    for my $key ( sort { ( split ' ', $a )[0] cmp ( split ' ', $b )[0] } keys %hash ) {
    print $key if exists $hash{$key}{$listL};
    print ' YES' if keys %{ $hash{$key} } == 2;
    print ' DUP' if exists $hash{$key}{$listM} and $hash{$key}{$listM} > 1;
    print "\n";
    }

    Output:

    Code
    D9PM 80 Q5BAL 
    D9PM 40 L7MRQ YES
    D9PM 40 A3WX YES DUP

    Hopefully, this is at least incrementally closer to what you're after...


    (This post was edited by Kenosis on Feb 27, 2014, 8:31 PM)


    stuckinarut
    Novice

    Feb 27, 2014, 9:04 PM

    Post #9 of 15 (1684 views)
    Re: [Kenosis] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

    Hello again, Kenosis:

    Gosh, THANKS for your extra efforts.

    The output was exactly as in your post!

    I then tried the (not yet fullly compiled) larger data lists so far, and the first line and a few others were blank. Strange.

    When I imported the (output) .txt file into Excel to do the final sorting (1st Column A, then Column B, then Column C - all A-Z - smallest to largest). In some cases, the last character of the first column got put in the second column of the output, and the 2nd digit (zero) in the second column ended up in the 3rd column of the ouput with records. Very strange. Apparently Excel does not like the file even though it appears fine viewing in Notepad ;-( Maybe all 3 columns could be sorted in the Perl script?

    I'm still adding to my 'L' and 'M' data lists via a LOT of 'manual' manipulation in Excel, because the 3 columns of data needed from about 40 different source files are NOT in the same order in the source file lines, and am deleting all the unnecessary columns/fields first. I'm about worn out, and still only half done.

    This is a 'Hobby' group (Non-Commercial) data summary than I need to finish tomorrow, so it's gonna be a lonnng, lonnnng night. What you have kindly provided should help if I can figure out what's going whacko in Excel.

    Thanks so much again (and to Chris for your efforts too).

    Regards,

    -stuckinarut


    stuckinarut
    Novice

    Feb 27, 2014, 9:12 PM

    Post #10 of 15 (1682 views)
    Re: [stuckinarut] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

    Kenosis:

    Ohhh... I just discovered that the 'DUP' apparently is tied only to the 'L' list 1st column and 'M' list 3rd column? It must actually be a combination of all 3 columns (in other words, exactly the same data - except that in list 'M' the order is reversed).

    The 1st Column data in list 'L' would not be a 'DUP' matching to the 3rd Column in list 'M', unless the 2nd column is also the same. In other words, a separate entry with '40' and separate '80' would simply be 'YES' flags without 'DUP' added.

    Not sure I'm explaining this correctly.

    -stuckinarut


    stuckinarut
    Novice

    Feb 27, 2014, 9:29 PM

    Post #11 of 15 (1681 views)
    Re: [stuckinarut] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

    UPDATE:

    I did another import of the Perl output file and tweaked with the delimiter settings and the column data split problem in Excel went away {SIGH OF RELIEF HERE}.

    One 'DUP' entry ended up in the 2nd Column on an otherwise blank line for some reason.

    I'm thinking perhaps the blank line problem *might* be if I did an extra carriage return to a blank line in either the 'L' or 'M' data files (or both). Will check that out.

    FYI,

    -stuckinarut


    stuckinarut
    Novice

    Feb 28, 2014, 10:15 AM

    Post #12 of 15 (1651 views)
    Re: [stuckinarut] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

    Hmmm... if it's not possible to use all 3 columns/fields in a hash to compare for full matches also involving the middle column/field, I think I just figured out a workaround but dunno how to implement it.

    Before going into the hash, 'IF' the 40 or 80 from the 2nd (middle) column/field were combined/merged onto the tailend of each of the 1st and 3rd column's data from both the 'L' and 'M' list, then the desired output could be realized - I think.

    Like each of the following...

    D9PM 40 L7MRQ
    D9PM 40 A3WX
    D9PM 80 Q5BAL
    D9PM 80 Q5BAL
    D9PM 40 A3WX

    ...would first become:

    D9PM40 L7MRQ40
    D9PM40 A3WX40
    D9PM80 Q5BAL80
    D9PM80 Q5BAL80
    D9PM40 A3WX40

    ...and...

    A3WX 40 R5QRC
    A3WX 40 D9PM
    A3WX 80 L2AFT
    A3WX 40 D9PM
    L7MRQ 40 D9PM

    would become:

    A3WX40 R5QRC40
    A3WX40 D9PM40
    A3WX80 L2AFT80
    A3WX40 D9PM40
    L7MRQ40 D9PM40

    The output would look a bit funky, but I think properly yield any desired 'YES' and/or 'YES DUP' flags.

    Trying to think 'outside the bun' . Does this make any sense?

    Thanks.

    -stuckinarut


    stuckinarut
    Novice

    Feb 28, 2014, 10:50 AM

    Post #13 of 15 (1648 views)
    Re: [stuckinarut] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

    FYI, I decided to run the new test 'combo' lists, and got uninitialized messages along with the output as:


    Code
    (blank line) 
    D9PM40 L7MRQ40 YES
    D9PM40 A3WX40 YES DUP
    D9PM80 Q5BAL80


    This is getting closer as I race toward the deadline later today.

    Actually including any 'DUP' lines from list 'L' (but without matches to list 'M' and also flagging them as 'DUP' would be beneficial for the manual analysis in dealing with the original actual source file entries (which contain a lot of other column data). So fine-tuning the final output schema would yield:


    Code
    D9PM40 L7MRQ40 YES 
    D9PM40 A3WX40 YES DUP
    D9PM40 A3WX40 YES DUP
    D9PM80 Q5BAL80 DUP
    D9PM80 Q5BAL80 DUP


    Mind-bending ;-(

    -stuckinarut


    Kenosis
    User

    Feb 28, 2014, 11:06 AM

    Post #14 of 15 (1644 views)
    Re: [stuckinarut] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post

    ...if it's not possible to use all 3 columns/fields in a hash to compare for full matches also involving the middle column/field...

    When the array of split elements is interpolated and used as a hash key above, a space separates the elements, effectively producing a string. For example:

    Code
    use strict; 
    use warnings;

    my @arr1 = qw/D9PM 40 A3WX/;
    my @arr2 = reverse qw/A3WX 40 D9PM/;

    print 'Match!' if "@arr1" eq "@arr2"; # prints "Match!"


    Thus, there is no need to concatenate any of the fields to test for equivalence.


    stuckinarut
    Novice

    Feb 28, 2014, 11:30 AM

    Post #15 of 15 (1640 views)
    Re: [Kenosis] Mindbending re-tweak challenge to compare files in reverse order & flag output [In reply to] Can't Post


    In Reply To
    ...if it's not possible to use all 3 columns/fields in a hash to compare for full matches also involving the middle column/field...

    When the array of split elements is interpolated and used as a hash key above, a space separates the elements, effectively producing a string. For example:

    Code
    use strict; 
    use warnings;

    my @arr1 = qw/D9PM 40 A3WX/;
    my @arr2 = reverse qw/A3WX 40 D9PM/;

    print 'Match!' if "@arr1" eq "@arr2"; # prints "Match!"


    Thus, there is no need to concatenate any of the fields to test for equivalence.


    Thank you for the detailed illustration, Kenosis. So apparently two list 'L' lines of D9PM 40 A3WX and D9PM 80 A3WX would NOT be treated as 'DUP' entries. Hmmm. I'll have to re-run the the larger list(s) I'm still adding to, but must dash into town briefly.

    Again, I really appreciate your help!

    -stuckinarut

     
     


    Search for (options) Powered by Gossamer Forum v.1.2.0

    Web Applications & Managed Hosting Powered by Gossamer Threads
    Visit our Mailing List Archives