
moviesigh
New User
Jan 15, 2014, 11:28 PM
Post #1 of 6
(3121 views)
|
Code Question. Counting the frequency of specific words in document
|
Can't Post
|
|
Hello, I would like to count the frequency of certain keywords in the text file, which is sample.txt. For example, I determine a main word as "Steve Jobs" and "Executive," and I would like to count the frequency of "stock option" and "package" within 10 words from "Steve Jobs" and "Executive" for the sample text below. The result that I expected is 4. Sample text) Stock option is the most popular compensation policy in the world these days. Steve Jobs also received huge amount of stock options, and the stock option was exercised before the fiscal year. Different from his compensation package, the other executives received less amount of stock options. To get the result, I used the code below and used the command that perl code.pl sample.txt "Steve Jobs, Executive" 10 "stock option, package" However, The results are all "0." Could you please give me some advice to get the result I want? I am attaching the sample text and the code that I used. In the sample text, there are three same articles and it is divided by "Document ". So, I expect to get the results for the three articles. The result should be 4 for each article. I am looking forward to your responses. I hope you all have a great weekend! I really appreciate it in advance. Sean Command) perl code.pl sample.txt "Steve Jobs, Executive" 10 "stock option, package"
use strict; use warnings; use Data::Dumper; my ($filename, $mainword_str, $distance, $search_str) = @ARGV; my @mainword = split /\s*,\s*/, $mainword_str; my @search = split /\s*,\s*/, $search_str; my $content; open my $fh, '<', $filename or die $!; local $/ = undef; $content = <$fh>; close $fh; my @docs = split 'Document ', $content; foreach my $doc ( @docs ) { my $count = 0; my $mainword = '(' . (join '|', map { "\Q$_\E" } @mainword) . ')'; my $search = '(' . (join '|', map { "\Q$_\E" } @search) . ')'; for (my $dist = 0; $dist <= $distance; $dist++) { while ( $doc =~ / (?:^|\W) $search* (?= (?:\W++\w++){$dist} \W++\Q$mainword\E ) /ixsg ) { print " found [$1] at ", $-[1], "\n"; $count++; } while ( $doc =~ / (?:^|\W) \Q$mainword\E (?= (?:\W++\w++){$dist} \W++$search ) /ixsg ) { print "-found [$1] at ", $-[1], "\n"; $count++; } } print "match: $count\n"; }
(This post was edited by FishMonger on Jan 16, 2014, 6:08 AM)
|