CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Help/advice with a peice of code

 



Warren Bell
Deleted

Sep 2, 2000, 2:26 PM

Post #1 of 16 (3905 views)
Help/advice with a peice of code Can't Post

I have a part of a news script I'm working on that opens up a file of words then checks the input from a textarea against each word. It's working but I'm not sure it's correct syntax, or can be simplified. It's easier to see if I post the commented code:

# get the word list that the $text will be checked against
open (FILE,"$ratingfile") | | &error("Unable to open $ratingfile");
&lock(FILE);
@ratings = <FILE>;
close(FILE);

# split the text so in can be checked line-by-line
@searchtext = split(/\s+/,$text);

# for some reason I had to substitute out the ( and ) charactors or it would give me the error:
# / Smile\sf/: unmatched () in regexp at /usr/local/httpd/html/news/news.cgi line 425.
foreach $line (@searchtext) {
$line =~ s/\)//;
$line =~ s/\(//;
}

# look through each line of the form input
for ($a = 0; $a < @searchtext; $a++) {

$_ = $searchtext[$a];

# while looking though each line of
# the form input compare each line
# of the word list to it so each word
# of the form text gets compared to
# each work in the list
for ($r = 0; $r < @ratings; $r++) {

$_ = $ratings[$r];

# the \sf at the end is a marker
# in the text file
if (/$searchtext[$a]\sf/i) {
$fun++;

}
elsif (/$searchtext[$a]\si/i) {
$int++;

}
elsif (/$searchtext[$a]\so/i) {
$off++;

}

}

}


The text file would look like:

word f
word2 f
word3 f

word4 i
etc

Any ideas how I can do this easier or better and why I'm getting errors if I don't substitute out the ( and )?

TIA


Kanji
User

Sep 2, 2000, 10:25 PM

Post #2 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

By default, search metacharacters in the variable you search for are honoured as is, so if you did ...

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

$phrase = "an example of metacharacters";
$match = "example of (.*)?char";
if ( $phrase =~ /$match/ ) {
print $1;
}</pre><HR></BLOCKQUOTE>

The print statement will output "meta" as we saved it explicitly with the (.*) in the search pattern.

So if you only have one half of the parens, perl barfs because it can't find the other half (hence "umatched ()").

You can disable this behavious by placing the search text inside the \Q...\E escape(ie, /\Q$searchtext[$i]\E/). See perldoc perlre for more on \Q and \E.

As for an example of what dws suggested ...
<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

001 | %word = map { $_, 1 } split /\s+/, $text;
002 | open FILE, ...;
003 | while ( <FILE> ) {
004 | chomp;
005 | my ( $matched, $type ) = split;
006 | $score{ $type }++ if $word{ $matched };
007 | }
008 | close FILE;</pre><HR></BLOCKQUOTE>

Line 1 builds a hash of all the words in $text. map is very underused by people new to perl, but the code basically does the same as ...

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

@text = split /\s+/, $text;
foreach $t ( @text ) {
$word{ $t } = 1;
}</pre><HR></BLOCKQUOTE>

We than open and iterate over each line of the word file (2,3), and for each line remove the trailing newline (4) and split the line up into the word we want to try match for and the category it would fit it if it does match (5).

Finally, we check to see if the word was one of those searched for by checking the hash of search words (6), and if it is, we increment the counter for it's category.

That has the added benefit of neatly sidestepping the problem with metacharacters in regular expression as you're comparing two strings and not searching for one string in another. :-)

You can then see the numbers as $score{'i'}, $score{'f'}, and $score{'o'}, with the hash itself tying in nicely to what I showed you in your other thread.

[This message has been edited by Kanji (edited 09-03-2000).]


dws
Deleted

Sep 2, 2000, 11:26 PM

Post #3 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

Your approach to matching rating words with regular expressions can yield false hits (e.g., a rating term of "sex" matches "Essex").

Here's pseudocode for a quicker approach:

build a hash %wordcount of the "words" in $text
for each ( $ratingterm, $termtype )
$score{$termtype) += $wordcount{$ratingterm};



BigRich
Novice

Sep 3, 2000, 4:39 AM

Post #4 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

As long as the words you are using are simple, such as the ones you show (word f, word2 f, etc.) and don't contain any non-word characters such as hyphens, this will work. Otherwise you will have to adjust the expresion in grep.


open (FILE,"$ratingfile") or &error("Unable to open $ratingfile");
&lock(FILE);
@ratings = <FILE>;
close(FILE);

@searchtext = split(/\s+/,$text);

foreach (@ratings) {
chomp;
$li = $_;
$li =~ s/\s\w$//;
$results = grep /\b$li\b/i, @searchtext;
if (/\sf$/) { $fun += $results }
if (/\si$/) { $int += $results }
if (/\so$/) { $off += $results }
}

# Test it

print "Ocurrences of \$fun = $fun <p>";
print "Ocurrences of \$int = $int <p>";
print "Ocurrences of \$off = $off <p>";
exit;

[This message has been edited by BigRich (edited 09-03-2000).]

[This message has been edited by BigRich (edited 09-03-2000).]

[This message has been edited by BigRich (edited 09-03-2000).]


Warren Bell
Deleted

Sep 3, 2000, 9:11 AM

Post #5 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

Thanks, I'm going to try both examples.

dws:
I'm not too familiar with hashes and not sure how that would replace my code. Could you give me an example of how it would add a count for each seperate hit and what variable they would be in for me to use to compare which one's the highest later?

Thanks


Warren Bell
Deleted

Sep 3, 2000, 11:55 AM

Post #6 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

Ok, I tried that but it's not working. It's giving me the same result every time no matter what words I enter into the text area. It's also returning a huge number as the number of words it matched. It seems close, just somthing small wrong? Here's what I'm using from both threads:

%word = map { $_, 1 } split /\s+/, $text;

open (FILE,"$ratingfile") &#0124; &#0124; &error("Unable to open $ratingfile");
&lock(FILE);

while (<FILE> ) {
chomp;
my ( $matched, $type ) = split;
$score{ $type }++ if $word{ $matched };
}

close (FILE);


%vars = (

# I guess I had to change these because the code above
#assigned the count variable as the marker in the text file after the matched word.
"Funny" => $f, # $vars{'f'}
"Interesting" => $i, # $vars{'i'}
"Off-topic" => $o, # $vars{'o'}
);

($name) = sort { $vars{$b} <=> $vars{$a} }
keys %vars;

# the part in the comment used to return
# the amount of hits that the highest scoring
# variable had.
print FILE "<!--$vars{$name}-->$name";

Thanks I appreciate your help with this.


Warren Bell
Deleted

Sep 3, 2000, 1:11 PM

Post #7 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

I think I got it. Let me know if anythings wrong and thanks again. Here's the code:

%word = map { $_, 1 } split /\s+/, $text;

open (FILE,"$ratingfile") &#0124; &#0124; &error("Unable to open $ratingfile");
&lock(FILE);

while (<FILE> ) {
chomp;
my ( $matched, $type ) = split;
$score{ $type }++ if $word{ $matched };
}

close (FILE);


%vars = (
"Funny" => $score{'f'}, # $vars{'f'}
"Interesting" => $score{'i'}, # $vars{'i'}
"Off-topic" => $score{'o'}, # $vars{'o'}
);

($score) = sort { $vars{$b} <=> $vars{$a} }
keys %vars;

if ($vars{$score} >= 3) {

print FILE "<!--$vars{$score}-->$score";

}
else {

print FILE "None";

}


dws
Deleted

Sep 3, 2000, 9:02 PM

Post #8 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

That's pretty much what I had in mind. One nuance is that the map Kanjii suggested you use doesn't keep a count of the number of times a word appears in $text. This will throw your scoring off (at least the scoring will be different than in your original example). If that matters, try:
<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

for ( split /\s+/, $text ) {
$word{$_}++;
}</pre><HR></BLOCKQUOTE>




BigRich
Novice

Sep 4, 2000, 2:19 AM

Post #9 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

Warren, why did you butcher your code?

There is no need to do all of those unnecessary hash conversions.

I gave you a simple foreach loop that returned the values you were looking for, all you had to do was decide how to deal with those values. You could have even used the if/else statements from your other thread.

There are a couple of major problems I can see with that code you are working with now.

The first is that it doesn't work, at least not in the manner I believe you want it to work.

Second, it will only return exact matches for your $ratingfile entries. Using that code, if you are searching for the number of instances of the word "funny", and the text to be searched contains "Funny" or "(this is funny)" or "an example of funny;" or "it's funny," or "funny-bone" that code would return 0 instances and the way it is now there could be 100 instances of the word "funny" and it would only show 1 instance. I'm not sure but I don't believe that is what you want.

Just use your original code with my foreach loop and you'll get the results I believe you are looking for. If you have to tweak anything, it may be the if else statements, but there's no need to trash them just for the sake of using hashes.

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


open (FILE,"$ratingfile") &#0124; &#0124; &error("Unable to open $ratingfile");
&lock(FILE);
@ratings = <FILE>;
close(FILE);
@searchtext = split(/\s+/,$text);
foreach (@ratings) {
chomp;
$li = $_;
$li =~ s/\s\w$//;
$results = grep(/\b\U$li\E\b/i, @searchtext);
if (/\sf$/) { $fun += $results }
if (/\si$/) { $int += $results }
if (/\so$/) { $off += $results }
}
if (($int >= $fun) && ($int >= $off)) { print "text"; }
elsif (($fun >= $int) && ($fun >= $off)) { print "text2"; }
elsif (($off >= $fun) && ($off >= $int)) { print "text3"; }
</pre><HR></BLOCKQUOTE>


I made a test script using this exact code and ran two large news articles through it and it worked fine.

In the first test I just printed the values of $int, $fun, and $off and it counted every instance of the words in my sample $ratingfile.

I used your if/else statements in the second test and they worked fine.

Hashes are very powerful tools but they aren't needed here.

Let me know if you need any more help tweaking the script.

Good luck,

Rich


Warren Bell
Deleted

Sep 5, 2000, 4:59 PM

Post #10 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

First, I appreciate everyones help.

Well what I posted last is what I'm using now and it seems to work fine. I used the hash because I thought hashes were a cleaner way to go and used less resources because they're looking through a while loop instead of a foreach.

I don't want the script to detect 'funny-bone' if one of my key words in the rating file is 'funny'. So only if the word appears just as it appears in the ratings file (case insensative) it should count it.


dws
Deleted

Sep 5, 2000, 10:27 PM

Post #11 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

Hashes make some things very easy. If the goal was to score each occurance of a rating term,

$score{$type} += $word{$matched};

is an elegant way to go (though I would have picked different names).

Also, consider the possibility that you might one day want to add a new rating category. With the hash approach, that's trivial. With the "simple" approach, you need to change a bunch of scattered code.


Warren Bell
Deleted

Sep 5, 2000, 10:59 PM

Post #12 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

Well for now I'll use this because it's working fine and, like dwssaid, adding a new catigory later is easy. But I would like to make it case insensative. dws, is there a way I could do this? Mabe adding in somthing after it splits the $matched and $type?

%word = map { $_, 1 } split /\s+/, $text;

open (FILE,"$ratingfile") &#0124; &#0124; &error("Unable to open $ratingfile");
&lock(FILE);

while (<FILE> ) {
chomp;
my ( $matched, $type ) = split;
# maybe adding somthing here to make $matched case insensative?
$score{ $type }++ if $word{ $matched };
}

close (FILE);


BigRich
Novice

Sep 6, 2000, 6:33 AM

Post #13 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

I'm glad you've got it to work but I just wanted to clear up a couple of things as I don't think you fully understand the differences in each of the proposed methods.

Hashes were definately not the cleaner way to go in this instance. You took an extremely simple script and complicated it with those hashes.

That's ok in a small script with a few lines but when you start writing scripts with 4 0r 5 thousand lines or more you'll wish you had kept it as simple as possible.

As far as using less resources, the script you are using now uses more resources than what I proposed.

You are taking the incoming text and creating a hash using each word in that input. You then compare each word in your ratings file against each word in the hash.
Again, this doesn't matter in such a small script.

In this instance the while loop and foreach loop are doing the same thing. They are simply going through your ratings file and doing something with each word in the file.

My method uses grep in scalar context to return the number of instances that your ratings file word apears in the text to be searched. Much cleaner than converting the text to be searched into a hash, then comparing each word in the hash to the word from your ratings file.

If you don't want the script to detect funny-bone when searching for funny a simple adjustment to the expression in "grep" would have fixed that but as I stated earlier that script you are using will not perform a case insensative search as it is not a search but a comparison. "funny" won't match "Funny" or "(funny" or "funny," or "funny!" or "funny." it will only match "funny" if it is a word with a space before and after. No punctuation or capitalisation. You can fix this by putting every possible occurance of each word into your ratings but that may make your ratings file a little large.

Don't get me wrong, I'm all for bloated, kludgy, undebuggable code because I make a lot of money fixing that code and writing scripts,that actually work, for clients who are tired of fooling with that expertly written code. But there are enough expert programmers and code that they've written to keep me in business for quite a while, please don't join their ranks.

Anyway, good luck Warren.

I think I will leave these boards to the resident gurus.Keep up the good work guys! I may be able to retire early after all!

Rich




Kanji
User

Sep 6, 2000, 11:19 AM

Post #14 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

Use lc() or uc() in the appropriate spots ...

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

# Make words to match lowercase
%word = map { $_, 1 } split /\s+/, lc($text);
# [..]
# Make words to match against lowercase
$score{ $type }++ if $word{ lc($matched) };</pre><HR></BLOCKQUOTE>

Both your search words and words to search will then be in the same case regardless of however they were originally entered.

( Edit: see below ... he's quite right. ;)

[This message has been edited by Kanji (edited 09-06-2000).]


dws
Deleted

Sep 6, 2000, 12:15 PM

Post #15 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

lc(), though I'd apply it to $text before splitting. (One call vs. many.)


Warren Bell
Deleted

Sep 6, 2000, 4:43 PM

Post #16 of 16 (3905 views)
Re: Help/advice with a peice of code [In reply to] Can't Post

Got it, thanks.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives