CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Counting unique lines input from data file

 



Pegleg Pete
Novice

Mar 1, 2002, 11:01 PM

Post #1 of 19 (2767 views)
Counting unique lines input from data file Can't Post

Yes, I am very new (only a few days). I fell into Perl (and I can't get up...) and am really impressed how such a small amount of code can do so much. I want to read in a text file and print out only the unique lines; I also want to have a count of the "repeats". I put something together from examples in a couple Perl books and it seems to be working except I have not yet addressed the counting issue. My background in programming (many moons ago) wants me to create a two dimensional array but from what I have read about Perl (over the last couple days) seems to indicate I am over thinking this and there is probably a two word solution.

open(MYFILE, "logfile.txt") || die "opening testfile: $!";
@unique=();
while(<MYFILE>) { unless($i{$_}++) { push(@Unique, $_) } };
print @Unique;
close(MYFILE);



I appreciate your response, I know it is not very challenging question.


Paul
Enthusiast

Mar 2, 2002, 2:31 AM

Post #2 of 19 (2762 views)
Re: [Pegleg Pete] Counting unique lines input from data file [In reply to] Can't Post

That looks ok. I guess you could do:

while (<MYFILE>) { grep { /\Q$_/ } @Unique ? next : push @Unique, $_ }

...but it is long :)


mhx
Enthusiast / Moderator

Mar 2, 2002, 5:20 AM

Post #3 of 19 (2760 views)
Re: [RedRum] Counting unique lines input from data file [In reply to] Can't Post


In Reply To
That looks ok. I guess you could do:

while ( < MYFILE> ) { grep { /\Q$_/ } @Unique ? next : push @Unique, $_ }


Nope, that wouldn't work because because the precedence is actually:

[perl]
while ( < MYFILE> ) { grep { /\Q$_/ } (@Unique ? next : push @Unique, $_) }
[/perl]

So as soon as @Unique is not empty, it'll just move on to the next iteration. Thus, you always end up with the first line from MYFILE in the @Unique array.
And even if the precedence was:

[perl]
while ( < MYFILE> ) { (grep { /\Q$_/ } @Unique) ? next : push @Unique, $_ }
[/perl]

that wouldn't work because in a grep $_ is aliased with each element of the list. So that would basically be checking if $_ matches the string stored in $_. The grep would always return a nonempty list as soon as @Unique isn't empty and we would again end up with the first line of MYFILE only.

A corrected version would be:

[perl]
while ( $line=< MYFILE> ) { (grep /\Q$line/, @unique) ? next : push @unique, $line }
[/perl]

However, this would still be awfully slow. I've set up a benchmark test with 5 repetitions and a 2 MB logfile (with almost no repeating lines). While Pete's original version was finished in less than 2 seconds, the corrected version of your code is still running. I started the benchmark test before I started to write this reply and I'm gonna wait until it's finished now. So, well, it took almost 2 hours. But at least, it yielded the same output.

-- mhx

At last with an effort he spoke, and wondered to hear his own words, as if some other will was using his small voice. "I will take the Ring," he said, "though I do not know the way."

-- Frodo



Paul
Enthusiast

Mar 2, 2002, 7:29 AM

Post #4 of 19 (2753 views)
Re: [mhx] Counting unique lines input from data file [In reply to] Can't Post

2 hours for a 2MB file?

Are you using a commadore 64? Laugh

In any case it was just something off the top of my head. I hadn't even tested it Smile


(This post was edited by RedRum on Mar 2, 2002, 7:31 AM)


mhx
Enthusiast / Moderator

Mar 2, 2002, 7:45 AM

Post #5 of 19 (2747 views)
Re: [RedRum] Counting unique lines input from data file [In reply to] Can't Post


In Reply To
2 hours for a 2MB file?


It's not that worse. As I said, I used 5 repetitions for benchmarking, so 2 MB would actually be only 24 minutes or so. Wink


In Reply To
Are you using a commadore 64? Laugh


Hell no! I've been running that test under Linux with perl 5.6.1 (gcc, -O3, no -DDEBUGGING), on a P-III notebook @ 1 GHz, bus clock @ 133 MHz, 512 MB of RAM and it continuously used up at least 90% of my CPU...


In Reply To
In any case it was just something off the top of my head. I hadn't even tested it Smile


Never mind. Cool

-- mhx

At last with an effort he spoke, and wondered to hear his own words, as if some other will was using his small voice. "I will take the Ring," he said, "though I do not know the way."

-- Frodo



Pegleg Pete
Novice

Mar 4, 2002, 8:16 AM

Post #6 of 19 (2720 views)
Re: [mhx] **COUNTING** unique lines input [In reply to] Can't Post

Thanks for the response-- however, I am still hoping to get *a count* of the lines (my code, such as it is, already prints out lines) when they are not unique. I want to display ALL lines, I just want to print out how many times a sentence appears when it is not unique-- e.g.,

This line is unique. ( unique)
How now brown cow? (3 times)
Now is the time for all good men... (5 times)
Looking forward to hearing from you (unique)
.
.
.
My favorite alblum is "DarkSide of the Moon" (unique)


etc.

The largest input file will be around 50K, tops. Speed not important (to me). There are perhaps 50 characters per line followed by [end line], there may be say 250 lines in the file.


Thanks for any effort.


mhx
Enthusiast / Moderator

Mar 4, 2002, 8:38 AM

Post #7 of 19 (2719 views)
Re: [Pegleg Pete] **COUNTING** unique lines input [In reply to] Can't Post

Ok, try this:

[perl]
#!/usr/bin/perl -w
use strict;

my $file = 'logfile.txt';
my(@lines, %count);

open MYFILE, $file or die "cannot open $file: $!\n";
while( <MYFILE> ) {
chomp;
push @lines, $_;
$count{$_}++;
}
close MYFILE;

foreach( @lines ) {
printf "%s (%s)\n", $_, $count{$_} > 1 ? "$count{$_} times" : 'unique';
}
[/perl]

Hope that's what you were looking for.

-- mhx

At last with an effort he spoke, and wondered to hear his own words, as if some other will was using his small voice. "I will take the Ring," he said, "though I do not know the way."

-- Frodo



Pegleg Pete
Novice

Mar 4, 2002, 5:31 PM

Post #8 of 19 (2708 views)
Re: [mhx] **COUNTING** unique lines input [In reply to] Can't Post

Well-- your code does a bang up job of counting the lines, and my code does real good at printing out each line just once without counting. The problem is, I want the repeated lines to be printed just the one time, with the total appended, e.g.

This line is unique

This line isn't unique


Your code counts correctly but the last line is printed six times.

This line is unique (unique)

This line isn't unique (6 times)

This line isn't unique (6 times)

This line isn't unique (6 times)

This line isn't unique (6 times)

This line isn't unique (6 times)

This line isn't unique (6 times)

Am i reading your code correctly when I see it pushed into the array each time? And my code pushes it only when they are unique? And it isn't a simple matter of putting the two together Tongue

Is solution 2 dimensional array?


Pegleg Pete
Novice

Mar 4, 2002, 6:01 PM

Post #9 of 19 (2705 views)
Re: [Pegleg Pete] **COUNTING** unique lines input [In reply to] Can't Post

This is a unique line
This line is not unique and there are two of them.
This line is not unique either and there are three of them
This line is another unique line.
This line is not unique either and there are three of them
This line is not unique either and there are three of them
Let's pretend there are 53 of these.
This line is not unique and there are two of them.

should come out (something like):

This is a unique line. (unique)
This line is not unique and there are two of them. (2 times)
This line is not unique either and there are three of them. (3 times)
Let's pretend there are 53 of these. (53 times)
This line is another unique line. (unique)


I am feeding in a logfile and the unique lines get buried in the repeated lines.
The repeated lines are probably not important but it would be helpful to know
how many times something occurs rather than knowing the event occured.



I hope I am not beating this into the ground-- I wasn't certain I if I was clear I needed both.


mhx
Enthusiast / Moderator

Mar 4, 2002, 10:43 PM

Post #10 of 19 (2695 views)
Re: [Pegleg Pete] **COUNTING** unique lines input [In reply to] Can't Post

This will not print repeating lines:

[perl]
#!/usr/bin/perl -w
use strict;

my $file = 'logfile.txt';
my(@lines, %count);

open MYFILE, $file or die "cannot open $file: $!\n";
while( <MYFILE> ) {
chomp;
push @lines, $_ unless $count{$_}++;
}
close MYFILE;

foreach( @lines ) {
printf "%s (%s)\n", $_, $count{$_} > 1 ? "$count{$_} times" : 'unique';
}
[/perl]

If you don't even need to preserve the order of the lines:

[perl]
#!/usr/bin/perl -w
use strict;

my $file = 'logfile.txt';
my %count;

open MYFILE, $file or die "cannot open $file: $!\n";
while( <MYFILE> ) {
chomp;
$count{$_}++;
}
close MYFILE;

foreach( keys %count ) {
printf "%s (%s)\n", $_, $count{$_} > 1 ? "$count{$_} times" : 'unique';
}
[/perl]

-- mhx

At last with an effort he spoke, and wondered to hear his own words, as if some other will was using his small voice. "I will take the Ring," he said, "though I do not know the way."

-- Frodo



Pegleg Pete
Novice

Mar 5, 2002, 7:36 AM

Post #11 of 19 (2688 views)
Re: [mhx] **COUNTING** unique lines input [In reply to] Can't Post

Works great--you're a peach. I'll play with this and see if I can do a couple more things.


Paul
Enthusiast

Mar 5, 2002, 8:05 AM

Post #12 of 19 (2684 views)
Re: [mhx] **COUNTING** unique lines input [In reply to] Can't Post

Hmm you are obsessed with printf Smile


mhx
Enthusiast / Moderator

Mar 5, 2002, 8:28 AM

Post #13 of 19 (2681 views)
Re: [RedRum] **COUNTING** unique lines input [In reply to] Can't Post


In Reply To
Hmm you are obsessed with printf Smile


Not really. Wink I attempt to use it where appropriate. In the above case I used it because it seemed more readable to me.
And hey, I'm writing C all day, it's more surprising that I'm frequently using print... Cool

-- mhx

At last with an effort he spoke, and wondered to hear his own words, as if some other will was using his small voice. "I will take the Ring," he said, "though I do not know the way."

-- Frodo



yapp
User

Mar 6, 2002, 6:26 AM

Post #14 of 19 (2666 views)
Re: [mhx] **COUNTING** unique lines input [In reply to] Can't Post

Well then, is printf better or faster then using simple scalars in Perl strings?

Yet Another Perl Programmer

_________________________________
~~> [url=http://www.codingdomain.com]www.codingdomain.com <~~
More then 3500 X-Forum [url=http://www.codingdomain.com/cgi-perl/downloads/x-forum]Downloads! Cool


Paul
Enthusiast

Mar 6, 2002, 6:52 AM

Post #15 of 19 (2663 views)
Re: [yapp] **COUNTING** unique lines input [In reply to] Can't Post

Not sure but the perl peeps say print is much more efficient for printing that printf.


(This post was edited by RedRum on Mar 6, 2002, 6:53 AM)


mhx
Enthusiast / Moderator

Mar 6, 2002, 8:26 AM

Post #16 of 19 (2659 views)
Re: [RedRum] **COUNTING** unique lines input [In reply to] Can't Post


In Reply To
Not sure but the perl peeps say print is much more efficient for printing that printf.


That's correct. So in applications where speed is the primary goal, forget Perl's printf.

[perl]
#!/usr/bin/perl -w
use strict;
use Benchmark;

my $bigstring = <<'EOS';
meis novis da XLIV tum X. kcahere sic face cis rere sic da loco huic
his decapitamentum. meo novo. dum hoc fac sic praecidementum huic da
meo nexto. cum novum tum nextum serementum novo da cis novum redde
EOS

my $number = 0xDEADBEEF;
my $float = 3.1415926535;

timethese( 100000, {
a => sub {
print STDERR q(Here's a string: ) . $bigstring .
q(A number: ) . $number . "\n" .
q(And a float: ) . $float . "\n";
},
b => sub {
print STDERR
qq(Here's a string: ${bigstring}A number: $number\nAnd a float: $float\n);
},
c => sub {
printf STDERR q(Here's a string: %sA number: %u%cAnd a float: %.10f%c),
$bigstring, $number, 10, $float, 10;
},
d => sub {
printf STDERR qq(Here's a string: %sA number: %u\nAnd a float: %.10f\n),
$bigstring, $number, $float;
},
});
[/perl]

Code
Benchmark: timing 100000 iterations of a, b, c, d... 
a: 2 wallclock secs ( 1.87 usr + 0.47 sys = 2.34 CPU) @ 42735.04/s (n=100000)
b: 1 wallclock secs ( 1.71 usr + 0.48 sys = 2.19 CPU) @ 45662.10/s (n=100000)
c: 8 wallclock secs ( 6.16 usr + 0.63 sys = 6.79 CPU) @ 14727.54/s (n=100000)
d: 7 wallclock secs ( 6.04 usr + 0.63 sys = 6.67 CPU) @ 14992.50/s (n=100000)


That makes print about 3-4 times as fast as printf.

-- mhx

At last with an effort he spoke, and wondered to hear his own words, as if some other will was using his small voice. "I will take the Ring," he said, "though I do not know the way."

-- Frodo



yapp
User

Mar 6, 2002, 11:10 PM

Post #17 of 19 (2651 views)
Re: [mhx] **COUNTING** unique lines input [In reply to] Can't Post

So putting a scalar in a string is even fasten then using concatination!!
Perl is a lot smarter then I thought before.

Yet Another Perl Programmer

_________________________________
~~> [url=http://www.codingdomain.com]www.codingdomain.com <~~
More then 3500 X-Forum [url=http://www.codingdomain.com/cgi-perl/downloads/x-forum]Downloads! Cool


Pegleg Pete
Novice

Mar 7, 2002, 10:54 AM

Post #18 of 19 (2643 views)
Re: [yapp] **COUNTING** unique lines input [In reply to] Can't Post

Thank you very much all, I've learned much over the last couple days from your post.

If you don't mind, I'd like to extend it a bit-- by looking or asking about another structure (that, just coinicidentally, could be used in this program)...

Suppose (it's great, so far, you display the unique line and display the lines and the number of times when they aren't unique).. but, after looking at this display, I want to ignore a number of lines (not push them on the queue)... say there were ten lines I see all the time and want to ignore.

And-- (maybe this isn't a solution, but I was thinking...) I know I could use a search routine to look at the front part of the string (the first ten words would be enough..) and I could couple it with a loop structure (if not "this line" or if not "that line" or if not "etcetera... say I have ten lines"-- would it be better to keep all the lines I wanted to ignore in a separate file which other users could modify? (Or that I could more easily modify, not that it would be hard to change the orginal program)?

I am thinking it would just be a cleaner solution, not that I know anything about what a clean program is. A few seconds in speed doesn't matter to me.

Does my request make sense?


Pegleg Pete
Novice

Mar 8, 2002, 12:24 PM

Post #19 of 19 (2633 views)
Re: [Pegleg Pete] **COUNTING** unique lines input [In reply to] Can't Post

e.g.:

# push lines into but ignore these:
if (($TheLine ne "") # blank lines
and ($TheLine !~ /^"/ ) # begining with quote marks ( insults )
and ($TheLine !~ /^.Pause./) # beginning with [Pause] ( Pauses)
and ($TheLine !~ /^--/)) # beginning with hypens ( date marks)

# ## etcetera, say 10 different lines I see and want to ignore

{push @lines, $_ unless $count{$_}++; }
}


I have looked at other examples (FAQs.. et al) and it looks like other people aren't shy about using extended if structures, didn't want to get ticket from Perl Protocol Police.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives