CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Sentence identification and separation

 



ardibehest
New User

Jul 6, 2014, 6:20 PM

Post #1 of 5 (1692 views)
Sentence identification and separation Can't Post

I am revisiting the problem of sentence splitting. I have a Perl Script which splits a para into sentences, but acronyms and short forms create an issue
/code
#!/usr/bin/perl

use feature qw/say/;
use strict;
use warnings;

my $s;
my @arr;

while(<>) {
chomp $_;
$s .= $_ . " ";
}
@arr = $s =~ m/[A-Z].+?[;](?=[^.;][A-Z]|\s*$)/g;
foreach (@arr) {
say;
}
/code
I have identified a list of abbreviations from a large corpus (not necessarily exhaustive) which are given below and which I would like to integrate in the
script but since I am still learning Perl, I have not been able to integrate them. I am giving below the list of such cases. The list is not complete and can be added to.
/code
Abbr["Co."];
Abbr["Corp."];
Abbr["vs."];
Abbr["e.g."];
Abbr["etc."];
Abbr["ex."];
Abbr["cf."];
Abbr["eg."];
Abbr["Jan."];
Abbr["Feb."];
Abbr["Mar."];
Abbr["Apr."];
Abbr["Jun."];
Abbr["Jul."];
Abbr["Aug."];
Abbr["Sep."];
Abbr["Sept."];
Abbr["Oct."];
Abbr["Nov."];
Abbr["Dec."];
Abbr["jan."];
Abbr["feb."];
Abbr["mar."];
Abbr["apr."];
Abbr["jun."];
Abbr["jul."];
Abbr["aug."];
Abbr["sep."];
Abbr["sept."];
Abbr["oct."];
Abbr["nov."];
Abbr["dec."];
Abbr["ed."];
Abbr["eds."];
Abbr["repr."];
Abbr["trans."];
Abbr["vol."];
Abbr["vols."];
Abbr["rev."];
Abbr["est."];
Abbr["b."];
Abbr["m."];
Abbr["bur."];
Abbr["d."];
Abbr["r."];
Abbr["M."];
Abbr["Dept."];
Abbr["Mr."];
Abbr["Jr."];
Abbr["Ms."];
Abbr["Mrs."];
Abbr["Dr."];
/code
How do I integrate these and ensure that when the script encounters the above exceptions, it does not treat the full-stop as a sentence delimiter as in in the examples below?
/code
Mr. Smith said today is Jun. 15th.
Jones Inc. filed for bankruptcy.
/code
A couple of examples for such integration would suffice and I will integrate the rest.
Many thanks


Code



      
    


BillKSmith
Veteran

Jul 6, 2014, 8:58 PM

Post #2 of 5 (1689 views)
Re: [ardibehest] Sentence identification and separation [In reply to] Can't Post

You can ignore any abbreviation that ends in a period and does not contain any other periods (the only exception in your list is 'e.g.') with a three step process.

1) Change the period at the end of every abbreviations to constant arbitrary string. ( I used '#$%^')

2) Convert the result as before.

3) Replace every instance of the arbitrary string with a period.

The regex in your code does not match your example. My replacement is intended for this example. It is not a full solution.

My list of abbreviations is clearly incomplete. Again it is intended only for this example.


Code
use feature qw/say/; 
use strict;
use warnings;
my $s;
my @arr;

while (<DATA>) {
chomp $_;
$s .= $_;
}

my $TERMINATOR = '#$%^';
my $ABBREVIATIONS = qr/Co|Corp|vs|etc|Mr|Inc|Jun/;
$s =~ s/($ABBREVIATIONS)\./$1$TERMINATOR/g;
@arr = $s =~ m/([A-Z].+?[;.])/g;
foreach my $sentence (@arr) {
$sentence =~ s/\Q$TERMINATOR\E/./g;
print $sentence, "\n";
}
__DATA__
Mr. Smith said today is Jun. 15th.
Jones Inc. filed for bankruptcy.

Good Luck,
Bill


ardibehest
New User

Jul 7, 2014, 2:40 AM

Post #3 of 5 (1683 views)
Re: [BillKSmith] Sentence identification and separation [In reply to] Can't Post

Many thanks. I tested the script out on the data file. I had called the perl script sen2.pl. I added the usual header. I got the following dump:

Code
Name "main::DATA" used only once: possible typo at sen2.pl line 8. 
readline() on unopened filehandle DATA at sen2.pl line 8.
Use of uninitialized value $s in substitution (s///) at sen2.pl line 15.
Use of uninitialized value $s in pattern match (m//) at sen2.pl line 16.

I am enclosing the perl script:

Code
#!/usr/bin/perl  
use feature qw/say/;
use strict;
use warnings;
my $s;
my @arr;

while (<DATA>) {
chomp $_;
$s .= $_;
}

my $TERMINATOR = '#$%^';
my $ABBREVIATIONS = qr/Co|Corp|vs|etc|Mr|Inc|Jun/;
$s =~ s/($ABBREVIATIONS)\./$1$TERMINATOR/g;
@arr = $s =~ m/([A-Z].+?[;.])/g;
foreach my $sentence (@arr) {
$sentence =~ s/\Q$TERMINATOR\E/./g;
print $sentence, "\n";
}

I understood the logic but I am a newbie in Perl and cannot understand what possibly went wrong. Did I goof up somewhere ?
I forgot to mention that I work in a Windows environment.
Mnay thanks for all your help.


Zhris
Enthusiast

Jul 7, 2014, 2:44 AM

Post #4 of 5 (1681 views)
Re: [ardibehest] Sentence identification and separation [In reply to] Can't Post

Hi,

Bill provided a standalone script the uses an end of file DATA section. You'll want to revert back to using the diamond op instead of the DATA handle:


Code
while (<DATA>) { 
while (<>) {


Chris


BillKSmith
Veteran

Jul 7, 2014, 4:41 AM

Post #5 of 5 (1675 views)
Re: [ardibehest] Sentence identification and separation [In reply to] Can't Post

Chris correctly explained your error messages, but the resulting script is unlikely to do what you want. I provided a complete working EXAMPLE that correctly processes the sample of data you provided. In the text of my original post, I indicated the areas that you had to generalize. Ask for more help if you need it, but do not expect this forum to provide production code.
Good Luck,
Bill

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives