CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Code Question. Counting the frequency of specific words in document

 



moviesigh
New User

Jan 15, 2014, 11:28 PM

Post #1 of 6 (1592 views)
Code Question. Counting the frequency of specific words in document Can't Post

Hello,
I would like to count the frequency of certain keywords in the text file, which is sample.txt.
For example, I determine a main word as "Steve Jobs" and "Executive," and I would like to count the frequency of "stock option" and "package" within 10 words from "Steve Jobs" and "Executive" for the sample text below. The result that I expected is 4.

Sample text)
Stock option is the most popular compensation policy in the world these days. Steve Jobs also received huge amount of stock options, and the stock option was exercised before the fiscal year.
Different from his compensation package, the other executives received less amount of stock options.

To get the result, I used the code below and used the command that perl code.pl sample.txt "Steve Jobs, Executive" 10 "stock option, package"

However, The results are all "0." Could you please give me some advice to get the result I want? I am attaching the sample text and the code that I used. In the sample text, there are three same articles and it is divided by "Document ". So, I expect to get the results for the three articles. The result should be 4 for each article.

I am looking forward to your responses. I hope you all have a great weekend! I really appreciate it in advance.

Sean

Command) perl code.pl sample.txt "Steve Jobs, Executive" 10 "stock option, package"

Code
  
use strict;
use warnings;
use Data::Dumper;

my ($filename, $mainword_str, $distance, $search_str) = @ARGV;

my @mainword = split /\s*,\s*/, $mainword_str;
my @search = split /\s*,\s*/, $search_str;

my $content;
open my $fh, '<', $filename or die $!;
local $/ = undef;
$content = <$fh>;
close $fh;

my @docs = split 'Document ', $content;
foreach my $doc ( @docs ) {

my $count = 0;

my $mainword = '(' . (join '|', map { "\Q$_\E" } @mainword) . ')';
my $search = '(' . (join '|', map { "\Q$_\E" } @search) . ')';


for (my $dist = 0; $dist <= $distance; $dist++) {
while ( $doc =~ /
(?:^|\W)
$search*
(?=
(?:\W++\w++){$dist}
\W++\Q$mainword\E
)
/ixsg
)
{
print " found [$1] at ", $-[1], "\n";

$count++;
}

while ( $doc =~ /
(?:^|\W)
\Q$mainword\E
(?=
(?:\W++\w++){$dist}
\W++$search
)
/ixsg
)
{
print "-found [$1] at ", $-[1], "\n";
$count++;
}
}

print "match: $count\n";
}



(This post was edited by FishMonger on Jan 16, 2014, 6:08 AM)
Attachments: code.pl (1.32 KB)
  sample.txt (0.95 KB)


FishMonger
Veteran / Moderator

Jan 16, 2014, 7:14 AM

Post #2 of 6 (1580 views)
Re: [moviesigh] Code Question. Counting the frequency of specific words in document [In reply to] Can't Post

Your problem description isn't real clear and due to that, I don't see how your expected result should be 4.

My first suggestion would be to read the file in paragraph/record mode instead of slurping the entire file into a scalar. Then split each paragraph/record into an array of words.

Next, use the functions in the List::MoreUtils module to find the indexes of each of your search words and calculate the index offset between the desired groups of words.
http://search.cpan.org/~adamk/List-MoreUtils-0.33/lib/List/MoreUtils.pm


BillKSmith
Veteran

Jan 16, 2014, 8:17 AM

Post #3 of 6 (1573 views)
Re: [moviesigh] Code Question. Counting the frequency of specific words in document [In reply to] Can't Post

I find that the intent of your code is very clear. Your regular expressions do not work properly. I have not yet been able to fix them, but I have found a few problems.

You split your data into four "documents". You probably meant only three.

I suspect that all your other problems involve backslashes. I have already found that \Q...\E in your regex is escaping the "|" character ( inside $search and $mainword ) which you intend as the regex metacharacter for 'or'.

I will get back to your problem later.
Good Luck,
Bill


FishMonger
Veteran / Moderator

Jan 16, 2014, 8:42 AM

Post #4 of 6 (1571 views)
Re: [moviesigh] Code Question. Counting the frequency of specific words in document [In reply to] Can't Post

Instead of doing:

Code
my @mainword = split /\s*,\s*/, $mainword_str;   
my @search = split /\s*,\s*/, $search_str;

my $mainword = '(' . (join '|', map { "\Q$_\E" } @mainword) . ')';
my $search = '(' . (join '|', map { "\Q$_\E" } @search) . ')';


It would IMO be cleaner and more efficient to do it like this:

Code
$mainword_str =~ s/\s*,\s*/|/; 
$search_str =~ s/\s*,\s*/|/;

my $mainword = qr!($mainword_str)i!;
my $search = qr!($search_str)i!;



BillKSmith
Veteran

Jan 18, 2014, 3:54 AM

Post #5 of 6 (1528 views)
Re: [BillKSmith] Code Question. Counting the frequency of specific words in document [In reply to] Can't Post

This problem is much more difficult than I previously recognized. The difficulty has almost nothing to do with perl, but rather in finding a suitable algorithm.

Even if your regular expressions did exactly what you intended, your program would count a 'mainword' twice if there were 'searchwords' both before and after it.

I succeeded in fixing your first case (search after main). The second proved much more difficult. While attempting to debug it, I saw the problem described above, but could not think of a solution.

I ask the other gurus to try to suggest a correct algorithm.
Good Luck,
Bill


Laurent_R
Veteran / Moderator

Jan 18, 2014, 3:33 PM

Post #6 of 6 (1500 views)
Re: [BillKSmith] Code Question. Counting the frequency of specific words in document [In reply to] Can't Post

It probably can be solved with Perl's extended regular expressions, but it is indeed likely to be very difficult. I would think that it is far easier and clearer to solve the problem with a simple and basic parsing algorithm: reading the words of the input one by one and maintaining a list of state variables that will tell you whether the condition you are looking for is matched or not.

I have a pretty clear idea on how to do it, but probably won't have time to work on that this weekend: my son (who is completing this year a master in computer science) has been away since the day after Xmas and will be away again to his university abroad for another four weeks or so, this coming Sunday is the only chance for me to see him in between.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives