CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Text::TFIDF

 



Dora
New User

Mar 27, 2016, 6:00 AM

Post #1 of 7 (5032 views)
Text::TFIDF Can't Post

Hello everyone!
I have a table @tous_mots and which contains all the words of two documents and I want to find the TFIDF score for each ward and print it.
Do you know how I could do it?
The following code doesn't work.
If you have any suggestions or idea, it would be really helpful, I am new with modules.
my $i;
foreach $i(@tous_mots){
my $a=new Text::TFIDF(file=>["a.sans_outils.txt","b.sans_outils.txt"]);

my $b=$a->TFIDF("a.sans_outils.txt",$mots1[$i]);
print "$mots1[$i] $b\n"
}
thank you in advance


BillKSmith
Veteran

Mar 27, 2016, 4:22 PM

Post #2 of 7 (5021 views)
Re: [Dora] Text::TFIDF [In reply to] Can't Post

Your main problem is that you do not understand perl's foreach loop. Please read that section of perlsyn. To learn to use perl's documentation tool, type:

Code
perldoc perldoc


Although it would probably work, it is a bad idea to use the variable names $a and $b. They are reserved for the sort routine.

Always place these two commands near the start of your script. They help to identify many errors.

Code
use strict; 
use warnings;


Your use of the module appears to be correct.

Please make these changes. Post again if you still need help. (You probably should be posting in the beginner's forum.)
Good Luck,
Bill


Dora
New User

Apr 3, 2016, 11:24 AM

Post #3 of 7 (4960 views)
Re: [BillKSmith] Text::TFIDF [In reply to] Can't Post

Thank you very much Bill!!
I changed the code and it seems much better, I have one last question. I am using the module Text::TFIDF for a french text and for the function TFIDF I get a lot of "Use of Uninitialized value in multiplication <*> at TFIDF". Do you have any idea why this is happenig? My prof in the university didn't know :(


Laurent_R
Veteran / Moderator

Apr 4, 2016, 7:28 AM

Post #4 of 7 (4949 views)
Re: [Dora] Text::TFIDF [In reply to] Can't Post

Please show your current code. It is very difficult to say why a program doesn't work without seeing it.


(This post was edited by Laurent_R on Apr 4, 2016, 7:29 AM)


BillKSmith
Veteran

Apr 4, 2016, 8:53 AM

Post #5 of 7 (4942 views)
Re: [Dora] Text::TFIDF [In reply to] Can't Post

Before you give up and post all your code and data, there are several things you should do. Try to limit the scope of your question.

Does the code appear to "work" despite the error?

Do you get the error for every word?

How are the words that cause the error special?

Are they all long words?
Do they all contain special characters?
Do the all appear in one document?
Both documents?
Neither document?

Does a short list of words always work?
Does the error occur with English documents?
Do the words in you list contain whitespace or other separation characters at their start or end.

Now create a very short program (one that we can execute) which demonstrates the error. The list should be no more than about five words (probably hard coded in the script). The program does not have to print correct answers, only demo the error. Post this program (and related documents) as attachments.
Good Luck,
Bill


Dora
New User

Apr 4, 2016, 10:51 AM

Post #6 of 7 (4937 views)
Re: [BillKSmith] Text::TFIDF [In reply to] Can't Post

yes, I see, you are right!! I will try to do so :


Chris Charley
User

Apr 7, 2016, 12:20 PM

Post #7 of 7 (4888 views)
Re: [Dora] Text::TFIDF [In reply to] Can't Post

Hi Dora,

You have some errors in the code you posted. Loop should be: foreach my $I (0 .. $#tous_mots) to get the index of the word's array. Also, though not an error, you shouldn't use $a and $b variables because they're special vars. for the sort routine (and some others).
You might want to use something like: my $tf = new Text::TFIDF(file => ["a.sans_outils.txt","b.sans_outils.txt"]);
. (Also, that should be properly declared before the loop begins).

my $b=$a->TFIDF("a.sans_outils.txt",$mots1[$i]); should avoid the $b variable and better written as my $wgt = $tf->TFIDF("a.sans_outils.txt",lc($tous_mots[$i]));

The print is using an array you didn't have earlier and you probably want print "$tous_mots[$i] $wgt\n";

Note that this module lowercases the words internally, so you should be lowercasing the word you give it. (see above where lc($tous_mots[$i]));

The code for Text::TFIDF can be examined and it shows the low case operation. You'll find this on line 93 (my $line = lc($_);).

You stated in your post I am using the module Text::TFIDF for a french text and for the function TFIDF I get a lot of "Use of Uninitialized value in multiplication <*> at TFIDF". Do you have any idea why this is happening?

The reason most likely is that any words you search for weight that have uppercase letters will not be found by the module because internally, it lowercases all the words in the document.

Hopefully, this will get you on the way to a solution.

The reason you are getting negative results is because of the the calculation for word frequency involves log base 10 and if the calculation yields a number lass than 1, the log will be negative. (See the IDF function in Text::TFIDF).

Here is a small program I wrote using the Text::TFIDF module.


Code
#!/usr/bin/perl 
use strict;
use warnings;
use 5.014;
use Text::TFIDF;

my $anna = do {local $/; <>}; # file == anna.book

my %words = map{ $_ => 1} map {lc} map{ s/[:;"',.?!]+//gr } split /\s+/, $anna;

# say scalar keys %words;

my $tf = Text::TFIDF->new(file => ['anna.book']);

for my $word (lc('Garçon'), keys %words) {
my $wgt = $tf->TFIDF('anna.book',$word);
$words{$word} = $wgt;
}

for my $word (sort {$words{$b} <=> $words{$a}} keys %words) {
printf "%-15s%.6f\n", $word, $words{$word};
}



(This post was edited by Chris Charley on Apr 10, 2016, 8:44 AM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives