CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
matching sentences

 



NuclearClam
Novice

Aug 6, 2002, 8:27 PM

Post #1 of 5 (4073 views)
matching sentences Can't Post

i haven't written perl in months, and i'm just starting to pick it up again, so if i make blatant mistakes, it's probably because i have quite a few holes in my knowledge Smile

i'm making a little prog that tells you the number of words, sentences, and paragraphs in a document. so far i have the words & paragraphs down (as well as most commonly-used wordsSmile), but i can't get the sentence regex to work. what i have so far is this (with an explanation afterwards):

[perl]
$count[1]++ if (/.*?(?:\.|[!?]+)$/);
$count[1]++ if (/.*?(?:\.|[!?]+)\s/g);
[/perl]

i wasn't sure if i could merge the last two items into (\s|$). my original doesn't work anyway so i wasn't able to determine if it works or not. anyway, the .*? is supposed to match everything until it comes to the first period, exclamation point, or question mark, then the (?:\.|[!\?]+) is to catch the ending period, or ! and ?. sometimes there are multiple periods in a sentence that don't necessarily denote the end of it, which is why i had to separate the punctuation marks (to allow for multiple ?s and !s but not for .s). and finally, the space is there to ensure that it doesn't catch a number with a decimal by mistake, or an acronym.

i know this is the most boring regex ever created, but it's certainly wracked my brain Smile

any help is greatly appreciated.




(This post was edited by NuclearClam on Aug 6, 2002, 10:46 PM)


fashimpaur
User

Aug 7, 2002, 7:10 AM

Post #2 of 5 (4060 views)
Re: [NuclearClam] matching sentences [In reply to] Can't Post

Nuke,

Try this:


Code
   
my $paragraph =
qq~ Now is the time for all good men to come to the
aid of their country. The quick brown fox jumped
over the lazy dog. A bird in the hand is worth two
in the bush. A stitch in time saves nine.~;


my $count = 0;

$count = $paragraph =~ s/\w+(?=\.)/$&/gm;

print "$count\n\n$paragraph";

Notice $count is 4. $paragraph remains unchanged.

Hope that helps,
Dennis

$a="c323745335d3221214b364d545".
"a362532582521254c3640504c3729".
"2f493759214b3635554c3040606a0",
print unpack"u*",pack "h*",$a,"\n\n";


NuclearClam
Novice

Aug 7, 2002, 12:47 PM

Post #3 of 5 (4053 views)
Re: [fashimpaur] matching sentences [In reply to] Can't Post

how would i do this with a file? i'd like to get by without having to put the file into a string. this is the part of my program that counts the words, paragraphs, etc. the stuff i've commented out is what needs fixin :\


Code
 while (<FILE>) { 

while (/(\w['\w-]*)/g)
{ $freq{lc $1}++;
$count[0]++; }

$count[2]++ if (/\n$/);

}

## $count[1] = (<FILE>) =~ s/\w+(?=\.)/$&/gm;
$count[2]++; # algorithm is off by one

i've also tried scalar(), which produces an error later in the program.

but then there's also the chance of a sentence ending with a closing parenthesis, bracket, or quote (for example "Take off one for every 'Zig'.") which \w+(?=\.) would ignore. it also miscounts acronyms.

thanks for giving me an idea though Smile i never think to use substitute, shift, tr, etc. in ways other than what they were designed for Unsure




(This post was edited by NuclearClam on Aug 7, 2002, 1:00 PM)


NuclearClam
Novice

Aug 7, 2002, 4:44 PM

Post #4 of 5 (4049 views)
Re: [NuclearClam] matching sentences [In reply to] Can't Post

i found a semi-solution... it matches all of the criteria that i'm looking for, but... if there are multiple sentences like "La... No... Yes. What? Whee..." it'll count them all but *one* sentence ending in "..." It does it every time. As soon as there are two sentences like that, one gets left out. i don't understand it. anyway, here's my terrible, terrible net of regex's that i was at least competent enough to separate:

Code
  my $paragraph = q~ Who. Nine... What? How now?! "She." (She). No... Seven... Eight... ~;   

my $count = 0;

$ac = '(?:\w\.){2,}'; #acronyms
$endpunc = '[!?]+(?:[\'\"]{0,2})';
$other = '[^\w\s]';
$wword = '(?:(?:(?:$other+\w+)|(?:\w+$other+))?)'; expand($wword);
$words = '(?:\w+|$wword)?(\w+)?|(?:$ac)'; expand($words);
$sent = '(?:^|\W+)(?:(?:$words)(?:\s+)?)+(?:$endpunc)'; expand($sent);
sub expand ($) { $_[0] =~ s/\$(\w+)/${$1}/g; }
$count = $paragraph =~ s/$sent/$&/g;

print $count;


it sucks to almost have it done and get stuck on one, insignificant bug :\

btw, the expanded regex looks like this:

Code
  /\W+(?:(?:(?:\w+|(?:[^\w\s]+\w+(?:[^\w\s]+)?))?(\w+)?|(?:(?:\w\.){2,}))(?:\s+)?)+(?:[!?]+[\'\"]{0,2})/

Unimpressed


(This post was edited by NuclearClam on Aug 8, 2002, 8:02 PM)


thebitch
User

Aug 9, 2002, 7:02 AM

Post #5 of 5 (4036 views)
Re: [NuclearClam] matching sentences [In reply to] Can't Post

That is one hairy regex.
Why do people enjoy pain?
Why don't people turn to CPAN first, then think later?
I suggest you abandon your current methodology and use a module.
http://search.cpan.org/search?mode=module&query=Sentence
http://search.cpan.org/search?dist=Lingua-EN-Sentence
Lingua::EN::Sentence Module for splitting text into sentences.
http://search.cpan.org/search?mode=module&query=LINGUA

On the other hand, if you want to continue with your approach you need to abandon the substitution operator, embrace \G and/or split, and embrace extended regular expressions.

http://perldoc.com/cgi-bin/htsearch?words=perlre

here's a "simple" non\G example ;) Enjoy



Code
#!/usr/bin/perl -w 
use strict;

my $WORD = q{
I have to laugh when I think
of the first cigar, because
it was probably just a bunch
of rolled-up tobacco leaves.

If you work on a lobster boat,
sneaking up behind people and
pinching them is probably a
joke that gets old real fast.

-- Jack Handey
};


my @MATCHES = split m{
( [\.\?\!]+) # sentance ending
# (?! \.\.\.) # except ...
# don't ask why, just a demo
}xsg # g means global, x means XXX ;)
, $WORD;

$.=0;

for(@MATCHES) {
++$.;
if( $. % 2 ) {
print "\n[sentence]\n";
print $_;

my @WORDS = split m{
\s+ # space one or more times
| # or
\b # word boundary
}sgx, $_;

## use Data::Dumper;die Dumper \@WORDS;
## eliminate empties, uncommend prev line to see
@WORDS = grep $_, @WORDS;

print "\n\t[word]\t$_\t[/word]\n" for @WORDS;

print "\n[/sentence]\n"
} else {
print "\n[mark]\n$_\n[/mark]\n";
}
}

__END__
# which outputs


[sentence]

I have to laugh when I think
of the first cigar, because
it was probably just a bunch
of rolled-up tobacco leaves
[word] I [/word]

[word] have [/word]

[word] to [/word]

[word] laugh [/word]

[word] when [/word]

[word] I [/word]

[word] think [/word]

[word] of [/word]

[word] the [/word]

[word] first [/word]

[word] cigar [/word]

[word] , [/word]

[word] because [/word]

[word] it [/word]

[word] was [/word]

[word] probably [/word]

[word] just [/word]

[word] a [/word]

[word] bunch [/word]

[word] of [/word]

[word] rolled [/word]

[word] - [/word]

[word] up [/word]

[word] tobacco [/word]

[word] leaves [/word]

[/sentence]

[mark]
.
[/mark]

[sentence]


If you work on a lobster boat,
sneaking up behind people and
pinching them is probably a
joke that gets old real fast
[word] If [/word]

[word] you [/word]

[word] work [/word]

[word] on [/word]

[word] a [/word]

[word] lobster [/word]

[word] boat [/word]

[word] , [/word]

[word] sneaking [/word]

[word] up [/word]

[word] behind [/word]

[word] people [/word]

[word] and [/word]

[word] pinching [/word]

[word] them [/word]

[word] is [/word]

[word] probably [/word]

[word] a [/word]

[word] joke [/word]

[word] that [/word]

[word] gets [/word]

[word] old [/word]

[word] real [/word]

[word] fast [/word]

[/sentence]

[mark]
.
[/mark]

[sentence]


-- Jack Handey

[word] -- [/word]

[word] Jack [/word]

[word] Handey [/word]

[/sentence]


 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives