CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Extracting duplicates and ordering by occurences

 



stuckinarut
Novice

Jul 27, 2014, 2:06 PM

Post #1 of 15 (1226 views)
Extracting duplicates and ordering by occurences Can't Post

I'm hoping someone can point out my error(s) that I can't figure out to correct in this simple script.

The OBJECTIVES of the script:

1. EXTRACT DUPLICATES FROM A SINGLE FIELD LIST

2. SORT AND LIST DUPLICATES ORDER BASED ON:

A. (FIRST) THE NUMBER OF OCCURENCES

B. (THEN) ALPHABETICALLY BY ENTRY ID/NAME
IN ASCENDING ORDER WITHIN EACH GROUP

3. OUTPUT THE RESULTS

INPUT LIST (Sample)

B2B
F44Q
D8L
M93RX
Q3T
D8L
A1ABC
S6YM
Q3T
B2B
A1ABC
Q3T

OUTPUT LIST (Desired)

3 Q3T
2 B2B
2 D8L

My Code:


Code
#!/usr/bin/perl 

use strict;
use warnings;

# ================================================================#
# dupers.pl
# Extract Dupes and list in Descending Alphabetic Order With Count
# ================================================================

# Read IN the raw data file list

my %C_list;
open my $C_list, '<', 'Clist.txt' or die "Cannot open Clist.txt: $!";
while (my $line = <$C_list>) {
$line =~ s/\r//g;
$line =~ s/\s+$//;
chomp $line;
$C_list{$line} = 1
}


# Loop, Sort in REVERSE order and Output Dupes with MORE than 1 entry Alphabetically within each Descending ordered
group by count

my $value;

foreach $value (sort {$C_list{$b} cmp $C_list{$a} } keys %C_list) {

if value ( %{ $C_list{$_} } >1 ) {

print "$value $C_list{$value} ";
print "$C_list{$_} \n";

} else {

}
}

# End Of Script


Thanks for any help!

-stuckinarut



Laurent_R
Veteran / Moderator

Jul 27, 2014, 2:46 PM

Post #2 of 15 (1220 views)
Re: [stuckinarut] Extracting duplicates and ordering by occurences [In reply to] Can't Post

You forgot to say what is wrong with your program or why it does not fit your purpose or how it is not giving you the expected result. How are we supposed to guess? You basically failed to give us the most important piece of information to help us helping you to solve your problem.

Well, I nonetheless have a guess. If I understand correctly what you are trying to do, I would think that your error is probably in this line of code:

Code
$C_list{$line} = 1


I would suggest that you change it to this:

Code
$C_list{$line}++;

so that you actually count the number of times $lines occurs in your input, rather than setting the counter to 1 each time.

There may be some other problem in your program, but I will not investigate further so long as you don't provide us with information about what is wrong in your program. Try to make the change I proposed, and, if this still does not work, please tell us in which respect you are not getting the result you are looking for.

As a side note, you don't need to do this:

Code
     } else {  

}


An else statement is never required in an if conditional, just omit it if you don't need it:

Code
     if value ( %{ $C_list{$_} } >1 ) {  
print "$value $C_list{$value} ";
print "$C_list{$_} \n";
}



stuckinarut
Novice

Jul 27, 2014, 3:08 PM

Post #3 of 15 (1217 views)
Re: [Laurent_R] Extracting duplicates and ordering by occurences [In reply to] Can't Post

Hello, Laurent:

Sorry I did not provide more info.

I am getting the following errors:

1. "Global symbol "$value" requires explicit package name". Apparently I am not understanding why this didn't do the job:

my $value;

2. A "Compliation Error" near line 30 that I can not figure out ;-(

3. YES... I realized after my posting that the } else { ... statement was unnecessary and removed it.

I changed the line you suggested but that did not fix things. I was trying to decide about a counter, but thought perhaps it was maybe needed down in the foreach loop section.

What I want to do is for the Output to be ONLY "Duplicates" greater >1 entry from the original list. I think maybe I am misunderstanding the keys % (etc) part of things as well. It should be pretty simple but I'm just plain stuck.

Thanks for your reply and efforts.

-stuckinarut




Laurent_R
Veteran / Moderator

Jul 27, 2014, 3:09 PM

Post #4 of 15 (1216 views)
Re: [stuckinarut] Extracting duplicates and ordering by occurences [In reply to] Can't Post

Just picked up another problem I did not see at first glance:


Code
     if value ( %{ $C_list{$_} } >1 ) {

I would be surprised if that even compiles.

What you want is probably something like this:

Code
foreach my $value (sort {$C_list{$b} <=> $C_list{$a} } keys %C_list) {  
if ($value > 1) {
print "$value $C_list{$value} ";
print "$C_list{$_} \n";
}
}



(This post was edited by Laurent_R on Jul 27, 2014, 3:10 PM)


stuckinarut
Novice

Jul 27, 2014, 3:18 PM

Post #5 of 15 (1214 views)
Re: [stuckinarut] Extracting duplicates and ordering by occurences [In reply to] Can't Post

Perhaps this added info will help.

I thought when the (Clist.txt) is read into the Hash, that each line gets a separate value. Yes? No?

So what I am trying to do is obtain only those entries which are "duplicates" and forget about the rest... only to Output the Dupes like this.

Let's say the highest number of duplicate entries are 3 times. The final output will first be the number 3 followed by a space then followed by the entry ID/Name from the original list.

If there is more than one duplicate with 3 entries, the final output list will then be sorted those 3 by ID/Name.

Next would follow any duplicates with 2 entries in the same manner as the above.

The ">1" in the code should then eliminate any single entries, but maybe I am using the wrong values in the line???

I hope this helps.

-stuckinarut



stuckinarut
Novice

Jul 27, 2014, 3:21 PM

Post #6 of 15 (1212 views)
Re: [Laurent_R] Extracting duplicates and ordering by occurences [In reply to] Can't Post


In Reply To
Just picked up another problem I did not see at first glance:


Code
     if value ( %{ $C_list{$_} } >1 ) {

I would be surprised if that even compiles.

What you want is probably something like this:

Code
foreach my $value (sort {$C_list{$b} <=> $C_list{$a} } keys %C_list) {  
if ($value > 1) {
print "$value $C_list{$value} ";
print "$C_list{$_} \n";
}
}



Ohhh... I didn't see your new reply before making some other comments.

Just tried your code snippet and that eliminated the previous compliation error I could not figure out, however I am still getting the "Global symbol "$value" explicit package error. That has me stumped.

Thank you for your help!

-stuckinarut


stuckinarut
Novice

Jul 27, 2014, 3:49 PM

Post #7 of 15 (1207 views)
Re: [stuckinarut] Extracting duplicates and ordering by occurences [In reply to] Can't Post

I commented out:

# my $value;

This got rid of the previous global/package error.

Now down to one "syntax error" near Line 29 that I will keep plodding away on.

-stuckinarut



stuckinarut
Novice

Jul 27, 2014, 6:16 PM

Post #8 of 15 (1196 views)
Re: [stuckinarut] Extracting duplicates and ordering by occurences [In reply to] Can't Post

I tried numerous additional code variations with no success, so decided to take a different approach to try and understand what's going on.

So I created a Test Hash with 3 entries, 2 of them being duplicates (but assigned a duplicate count value). The results are what I want EXCEPT are NOT sorted alphabetically within the same group number of duplicates - in this case 2. I used a single (NON-Duplicate) value of 1 for the test entry "NONDUPE" - like the majority of the original list entries will be and to NOT be included in the output:


Code
#!/usr/bin/perl  

#use strict; (REMOVED FOR IT TO WORK!!!)
#use warnings; (REMOVED FOR IT TO WORK!!!)

# DEFINE A TEST HASH (BUT BASED ON NUMBER OF ENTRIES & ANY DUPLICATES)
%C_list = ( "B2B" , 2,
"Q3T" , 3,
"D8L", 2,
"NONDUPE", 1);
# FOREACH LOOP
foreach $value (sort {$C_list{$b} <=> $C_list{$a} } keys %C_list) {

if ( $C_list{$value} != 1 ) {

print "$C_list{$value} $value \n";

}

}


This was my Output:


Code
3 Q3T  
2 D8L
2 B2B


Almost what I'm looking for, however B2B must come before D8L alphabetically. I have no clue how to make it work WITHOUT removing the 'strict' and 'warning' lines.

Using the first part of my original code to read in the full Test List, with the counter++ line added, I wanted to see what the Hash looked like:


Code
#!/usr/bin/perl  

use strict;
use warnings;

# ================================================================#
# dupers.pl
# Extract Dupes and list in Descending Alphabetic Order With Count
# ================================================================

# Read IN the raw data file list

my %C_list;
open my $C_list, '<', 'Clist.txt' or die "Cannot open Clist.txt: $!";
while (my $line = <$C_list>) {
$line =~ s/\r//g;
$line =~ s/\s+$//;
chomp $line;
$C_list{$line}++;
}


# PRINT THE HASH
print %C_list;

# End Of Script


Here was the result:


Code
S6YM1A1ABC2Q3T3F44Q1M93RX1D8L2B2B2


Parsed out for better reading:



Code
S6YM 1 
A1ABC 2
Q3T 3
F44Q 1
M93RX 1
D8L 2
B2B 2


So with the counts (and duplicates) being accurate (but numbers trailing compared to desired output), it actually blows my mind how this part actually works :^)

HOWEVER... in spite of trying everything I can think of so far, I just can not get the last half of the full script to work without any errors.

ARRRRRRGGGHHHHH ;-(

-stuckinarut


Laurent_R
Veteran / Moderator

Jul 27, 2014, 11:27 PM

Post #9 of 15 (1183 views)
Re: [stuckinarut] Extracting duplicates and ordering by occurences [In reply to] Can't Post

Don't remove:

Code
use strict;     
use warnings;


but fix the errors that they show.


Code
my %C_list = ( "B2B" , 2,  
"Q3T" , 3,
"D8L", 2,
"NONDUPE", 1);


No time now for the sorting, but ist's quite simple.


stuckinarut
Novice

Jul 27, 2014, 11:35 PM

Post #10 of 15 (1183 views)
Re: [stuckinarut] Extracting duplicates and ordering by occurences [In reply to] Can't Post

Hours later and not sure exactly how, but I got the entire script finally working EXCEPT that the ID/Names still don't sort alphabetically within each numbers value group.


Code
#!/usr/bin/perl  

use strict;
use warnings;

# ================================================================#
# dupers.pl
# Extract Dupes and list in Descending Alphabetic Order With Count
# ================================================================

# Read IN the raw data file list

my %C_list;
open my $C_list, '<', 'Clist.txt' or die "Cannot open Clist.txt: $!";
while (my $line = <$C_list>) {
$line =~ s/\r//g;
$line =~ s/\s+$//;
chomp $line;
$C_list{$line}++;
}

# FOREACH LOOP
foreach my $value (sort {$C_list{$b} <=> $C_list{$a} } keys %C_list) {

if ( $C_list{$value} != 1 ) {

print "$C_list{$value} $value \n";

}

}

# End Of Script


The INPUT and OUTPUT Lists:


Code
 
INPUT (Clist.txt)
B2B
F44Q
D8L
M93RX
Q3T
D8L
A1ABC
S6YM
Q3T
B2B
A1ABC
Q3T

OUTPUT
3 Q3T
2 A1ABC
2 D8L
2 B2B

BUT the OUTPUT needs to be:

3 Q3T
2 A1ABC
2 B2B
2 D8L


I'm wondering if it might take TWO Loops to get the desired results ???

-stuckinarut


stuckinarut
Novice

Jul 27, 2014, 11:38 PM

Post #11 of 15 (1182 views)
Re: [stuckinarut] Extracting duplicates and ordering by occurences [In reply to] Can't Post

Hi, Laurent:

I was typing my last post while you made a post so didn't see your new info until I pushed the POST button :^)

Yes, the full sorting part is the unknown.

-stuckinarut


stuckinarut
Novice

Jul 27, 2014, 11:50 PM

Post #12 of 15 (1180 views)
Re: [stuckinarut] Extracting duplicates and ordering by occurences [In reply to] Can't Post

HA! I just discovered I can use essentially the same code to do another needed task - to produce a list of "Uniques" by changing only this line to:


Code
if ( $C_list{$value} == 1 ) {


Which outputs:


Code
 
But still not sorted alphabetically ;-(

FYI, each of the separate INPUT lists I will be running will have anywhere from about 3,000 to 6,000 entries with a typical "DUPE" rate being about 30%.

-stuckinarut



1 S6YM
1 F44Q
1 M93RX



stuckinarut
Novice

Jul 28, 2014, 12:18 AM

Post #13 of 15 (1171 views)
Re: [stuckinarut] Extracting duplicates and ordering by occurences [In reply to] Can't Post

Hmmm... now I'm wondering if some type of Array is needed to make it all work correctly ???

To clarify the Sorting order:

#1 = By number of Dupe ocurences in DESCENDING Order

#2 = (Then) by ID/NAME within each Dupe Quantity Group in ASCENDING Order

It's after Midnight here so time to sleep on this.

-stuckinarut


Laurent_R
Veteran / Moderator

Jul 28, 2014, 12:29 AM

Post #14 of 15 (1169 views)
Re: [stuckinarut] Extracting duplicates and ordering by occurences [In reply to] Can't Post

OK, sorry I could not complete earlier, I was in the train to get to work on a mobile device, I had to stop when my train arrived at its final stop.

For the sorting, simply try this:


Code
foreach my $value (sort {$C_list{$b} <=> $C_list{$a} || $a cmp $b } keys %C_list) { # ...

The way it works is as follows: it compares numerically the values or numbers of occurrences ( the "$C_list{$b} <=> $C_list{$a}" part), and, if the values are equal, it compares the keys alphabetically.


stuckinarut
Novice

Jul 28, 2014, 4:53 AM

Post #15 of 15 (1163 views)
Re: [Laurent_R] Extracting duplicates and ordering by occurences [In reply to] Can't Post


In Reply To
OK, sorry I could not complete earlier, I was in the train to get to work on a mobile device, I had to stop when my train arrived at its final stop.

For the sorting, simply try this:


Code
foreach my $value (sort {$C_list{$b} <=> $C_list{$a} || $a cmp $b } keys %C_list) { # ...

The way it works is as follows: it compares numerically the values or numbers of occurrences ( the "$C_list{$b} <=> $C_list{$a}" part), and, if the values are equal, it compares the keys alphabetically.


Ohhhhhhh, Laurent, THANK YOU VERY MUCH for another very Educational Experience !!! I have learned several new things here.

I'm back up on a few hours of sleep and think part of a previous error was that I had not fully commented OUT a line of code that didn't work. At 70 years old, my eyesight is not the best ;-(

When I finally decided to copy & join the bottom section of the code that did partially work from my basic ground-up little Hash Test to the full front end $C_list section, I learned more about Hashes and that the less lines of code the better without a bunch of # lines. Also, to especially look for any "curly braces" issues trying to hide from me.

Your "explanation" of the final code snippet solution was extremely helpful to understand HOW it works !!!

So you travel to work on a train! As a youth, I was fascinated by trains, and had an elaborate model train set. Then later as teenager, I traveled ~ 2,000 miles round trip by train several Summers to visit relatives. But there were no computers, no cell phones, no tablets and no Internet back then. Times have changed !!!

Thank you again. Now to analyze a lot of Clist.txt file duplicates.

Regards,

-stuckinarut

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives