CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Loading words from one .txt file, searching another for these words and printing the frequency

 



TJC
Novice

Jan 15, 2014, 3:39 AM

Post #1 of 7 (1898 views)
Loading words from one .txt file, searching another for these words and printing the frequency Can't Post

Hi,

I am attempting to take a .txt file with the following format:

>Title1
Word1
>Title2
Word2
>Title3
Word3

And so on - and load the titles (as defined by any word marked by > while others are ignored i.e. Word1) as a wordlist to then search another .txt file by - ultimately printing the frequency of each of these titles. To clarify:

If my initial file contains:

>Apple
Word1
>Grape
Word2
>Orange
Word3

And the .txt file i'm searching within is:

Apple Grape Avocado Strawberry
Orange Apple
Grape Raspberry Apple

The printed output is:

Apple occurs: 3
Grape occurs: 2
Orange occurs: 1

I've seen various snippets of code around which take wordlists and output word counts/frequencies such as:


Code
sub by_count { 
$count{$b} <=> $count{$a};
}

open(INPUT, "<Input.txt");
open(OUTPUT, ">WordFreqs.txt");
$bucket='red|blue|green';

while(<INPUT>){
@words = split(/\s+/);
foreach $word (@words){
if($word=~/($bucket)/io){
$count{$1}++;}
}
}
foreach $word (sort by_count keys %count) {
print OUTPUT "$word occurs $count{$word} times\n";
}

close INPUT;
close OUTPUT;


But I am unsure how to populate the wordlists using words from a .txt file rather than specifying them within the code and secondly, how to only add the word to the wordlist if it begins with '>'.

If anybody has any suggestions/tips or any snippets of code I could work on modifying - that would be great - thank you!


(This post was edited by TJC on Jan 15, 2014, 5:07 AM)


zapzap
User

Jan 15, 2014, 5:00 AM

Post #2 of 7 (1876 views)
Re: [TJC] Loading words from one .txt file, searching another for these words and printing the frequency [In reply to] Can't Post

Your approach looks pretty good.


Code
#!/usr/bin/perl 
use strict;
use warnings;

open my $in, '<','input.txt' or die "Unable to open input file: $!\n";
open my $out, '>', 'frequency.txt' or die "Unable to create file for writing: $!\n";
open my $titles, '<', 'titles.txt' or die "Unable to open titles file: $!\n";

my @Titles;
# Search through titles file for titles and store in titles array
# We know every title starts with '<' so just search for that with
# a regex, perl is awesome for these situations.
while ( <$titles> ) {
chomp;
push(@Titles,$1) if /^>([[:alpha:]]+)/;
}

# So now we have our titles array created
# Now we can proceed to search through the input file and look
# for either one of the titles. But first we need to create
# a regular expression of the titles to search for in the input file
# I'm going to use your approach to create the string
# title1|title2|title3 for the regex using the join function with
# the '|' as the delimiter
my $str = join('|',@Titles);
my $re = qr/$str/; # I'm not sure if this line is necessary?

my %count;

# Loop through input file
while ( <$in> ) {
chomp;
my @words = split(/ /,$_);
for my $word (@words) {
$count{$word}++ if $word =~ $re;
}
}

for my $key ( sort { $a cmp $b } keys %count ) {
print $out "$key occurs $count{$key} times\n";
}

close $in;
close $out;
close $titles;


Personally, I wouldn't do anything like that.
I would try something like this. I'm sure there are others that can do better and I wouldn't mind witnessing a different approach.



Code
#!/usr/bin/perl 
use strict;
use warnings;

open my $in, '<','input.txt' or die "Unable to open input file: $!\n";
open my $out, '>', 'frequency.txt' or die "Unable to create file for writing: $!\n";
open my $titles, '<', 'titles.txt' or die "Unable to open titles file: $!\n";

my %count;
my $str = join('|',map{ /^>([[:alpha:]]+)/ } <$titles>);
while(<$in>) {
chomp;
for(split(/ /)) {
$count{$_}++ if /$str/;
}
}

print $out map { "$_ occurs $count{$_} times\n" } sort { $a cmp $b } keys %count;
close $in;
close $out
close $titles;


I hope this helps, if it doesn't, I apologize.
zap

P.S. One small note, I had difficulty understanding your goal because the counts for the fruits are off. There are 2 Grapes.


(This post was edited by zapzap on Jan 15, 2014, 5:03 AM)


TJC
Novice

Jan 15, 2014, 5:55 AM

Post #3 of 7 (1867 views)
Re: [zapzap] Loading words from one .txt file, searching another for these words and printing the frequency [In reply to] Can't Post

Thank you very much for explaining each line/block!

Your second approach is interesting - thank you for that.

Our of interest: Is it possible to specify the word separator as a tab rather a space? For example if the word i'm searching for is on a line such as:


Code
Word1    Word2    SearchWord    Word4


With tab-delimitation - can Perl return only the SearchWord rather the entire line as your above script does currently.


(This post was edited by TJC on Jan 15, 2014, 6:33 AM)


FishMonger
Veteran / Moderator

Jan 15, 2014, 6:51 AM

Post #4 of 7 (1848 views)
Re: [zapzap] Loading words from one .txt file, searching another for these words and printing the frequency [In reply to] Can't Post

If the list of "Titles" is small, then the map/regex approach is fine. However, as the list increases it becomes inefficient i.e., does not scale well.

I would prefer to load the Titles into a hash and do a simple (exists) hash lookup. Using this approach does not loose efficiency as the list increases which allows it to scale well.

I also try to avoid unnecessary use of map because of its extra overhead, could add unnecessary inefficiency.


Code
#!/usr/bin/perl 

use strict;
use warnings;

open my $input, '<', 'input.txt' or die "Unable to open input file: $!\n";
open my $output, '>', 'frequency.txt' or die "Unable to create file for writing: $!\n";
open my $titles, '<', 'titles.txt' or die "Unable to open titles file: $!\n";

my %titles;
while (<$titles>) {
chomp;
next unless /^>(.+)/;
$titles{lc $1} = undef;
}

my %count;
while (<$input>) {
chomp;
foreach my $word (split) {
$count{$word}++ if exists $titles{lc $word};
}
}
close $input;
close $titles;

print $output "$_ occurs $count{$_} times\n" for sort { $a cmp $b } keys %count;
close $output;


I used 2 hashes, but it could have been done with 1.


(This post was edited by FishMonger on Jan 15, 2014, 6:54 AM)


TJC
Novice

Jan 15, 2014, 7:40 AM

Post #5 of 7 (1840 views)
Re: [FishMonger] Loading words from one .txt file, searching another for these words and printing the frequency [In reply to] Can't Post

Thanks you very much! That solves my issues and provides the output I was after.


zapzap
User

Jan 15, 2014, 10:22 AM

Post #6 of 7 (1827 views)
Re: [FishMonger] Loading words from one .txt file, searching another for these words and printing the frequency [In reply to] Can't Post

Thank you FishMonger for your suggestions. I find it difficult to get away from using map because of its convenience. And thank you for the information of the inefficiency of map for larger files. Your approach was cleaner than the one I presented and I specifically enjoyed

Code
print $output "$_ occurs $count{$_} times\n"  for sort { $a cmp $b } keys %count;

Thank you
zap


Laurent_R
Veteran / Moderator

Jan 15, 2014, 12:09 PM

Post #7 of 7 (1821 views)
Re: [zapzap] Loading words from one .txt file, searching another for these words and printing the frequency [In reply to] Can't Post

The map function is very useful and practical, don't refrain from using it, but the main point in Fishmonger's post is that a hash is much better than an array for this type of problem, because the speed for accessing to the elements of a hash does not depend on the size of the hash (well, almost not), whereas checking all the elements of an array can becore very time consuming when the array grows.

In computer science terms, accessing the element of a hash is said to have a complexity of O(1) (i.e. it is independent of the size of the hash), while walking through an array has a complexity in O(n), n being the number of elements of the array. As soon as n becomes much larger than 1 (say 4 or 5), the hash will win hands off.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives