Home: Perl Programming Help: Intermediate:
Sentence identification and separation



ardibehest
New User

Jul 6, 2014, 6:20 PM


Views: 4001
Sentence identification and separation

I am revisiting the problem of sentence splitting. I have a Perl Script which splits a para into sentences, but acronyms and short forms create an issue
/code
#!/usr/bin/perl

use feature qw/say/;
use strict;
use warnings;

my $s;
my @arr;

while(<>) {
chomp $_;
$s .= $_ . " ";
}
@arr = $s =~ m/[A-Z].+?[;](?=[^.;][A-Z]|\s*$)/g;
foreach (@arr) {
say;
}
/code
I have identified a list of abbreviations from a large corpus (not necessarily exhaustive) which are given below and which I would like to integrate in the
script but since I am still learning Perl, I have not been able to integrate them. I am giving below the list of such cases. The list is not complete and can be added to.
/code
Abbr["Co."];
Abbr["Corp."];
Abbr["vs."];
Abbr["e.g."];
Abbr["etc."];
Abbr["ex."];
Abbr["cf."];
Abbr["eg."];
Abbr["Jan."];
Abbr["Feb."];
Abbr["Mar."];
Abbr["Apr."];
Abbr["Jun."];
Abbr["Jul."];
Abbr["Aug."];
Abbr["Sep."];
Abbr["Sept."];
Abbr["Oct."];
Abbr["Nov."];
Abbr["Dec."];
Abbr["jan."];
Abbr["feb."];
Abbr["mar."];
Abbr["apr."];
Abbr["jun."];
Abbr["jul."];
Abbr["aug."];
Abbr["sep."];
Abbr["sept."];
Abbr["oct."];
Abbr["nov."];
Abbr["dec."];
Abbr["ed."];
Abbr["eds."];
Abbr["repr."];
Abbr["trans."];
Abbr["vol."];
Abbr["vols."];
Abbr["rev."];
Abbr["est."];
Abbr["b."];
Abbr["m."];
Abbr["bur."];
Abbr["d."];
Abbr["r."];
Abbr["M."];
Abbr["Dept."];
Abbr["Mr."];
Abbr["Jr."];
Abbr["Ms."];
Abbr["Mrs."];
Abbr["Dr."];
/code
How do I integrate these and ensure that when the script encounters the above exceptions, it does not treat the full-stop as a sentence delimiter as in in the examples below?
/code
Mr. Smith said today is Jun. 15th.
Jones Inc. filed for bankruptcy.
/code
A couple of examples for such integration would suffice and I will integrate the rest.
Many thanks


Code



      
    


BillKSmith
Veteran

Jul 6, 2014, 8:58 PM


Views: 3998
Re: [ardibehest] Sentence identification and separation

You can ignore any abbreviation that ends in a period and does not contain any other periods (the only exception in your list is 'e.g.') with a three step process.

1) Change the period at the end of every abbreviations to constant arbitrary string. ( I used '#$%^')

2) Convert the result as before.

3) Replace every instance of the arbitrary string with a period.

The regex in your code does not match your example. My replacement is intended for this example. It is not a full solution.

My list of abbreviations is clearly incomplete. Again it is intended only for this example.


Code
use feature qw/say/; 
use strict;
use warnings;
my $s;
my @arr;

while (<DATA>) {
chomp $_;
$s .= $_;
}

my $TERMINATOR = '#$%^';
my $ABBREVIATIONS = qr/Co|Corp|vs|etc|Mr|Inc|Jun/;
$s =~ s/($ABBREVIATIONS)\./$1$TERMINATOR/g;
@arr = $s =~ m/([A-Z].+?[;.])/g;
foreach my $sentence (@arr) {
$sentence =~ s/\Q$TERMINATOR\E/./g;
print $sentence, "\n";
}
__DATA__
Mr. Smith said today is Jun. 15th.
Jones Inc. filed for bankruptcy.

Good Luck,
Bill


ardibehest
New User

Jul 7, 2014, 2:40 AM


Views: 3992
Re: [BillKSmith] Sentence identification and separation

Many thanks. I tested the script out on the data file. I had called the perl script sen2.pl. I added the usual header. I got the following dump:

Code
Name "main::DATA" used only once: possible typo at sen2.pl line 8. 
readline() on unopened filehandle DATA at sen2.pl line 8.
Use of uninitialized value $s in substitution (s///) at sen2.pl line 15.
Use of uninitialized value $s in pattern match (m//) at sen2.pl line 16.

I am enclosing the perl script:

Code
#!/usr/bin/perl  
use feature qw/say/;
use strict;
use warnings;
my $s;
my @arr;

while (<DATA>) {
chomp $_;
$s .= $_;
}

my $TERMINATOR = '#$%^';
my $ABBREVIATIONS = qr/Co|Corp|vs|etc|Mr|Inc|Jun/;
$s =~ s/($ABBREVIATIONS)\./$1$TERMINATOR/g;
@arr = $s =~ m/([A-Z].+?[;.])/g;
foreach my $sentence (@arr) {
$sentence =~ s/\Q$TERMINATOR\E/./g;
print $sentence, "\n";
}

I understood the logic but I am a newbie in Perl and cannot understand what possibly went wrong. Did I goof up somewhere ?
I forgot to mention that I work in a Windows environment.
Mnay thanks for all your help.


Zhris
Enthusiast

Jul 7, 2014, 2:44 AM


Views: 3990
Re: [ardibehest] Sentence identification and separation

Hi,

Bill provided a standalone script the uses an end of file DATA section. You'll want to revert back to using the diamond op instead of the DATA handle:


Code
while (<DATA>) { 
while (<>) {


Chris


BillKSmith
Veteran

Jul 7, 2014, 4:41 AM


Views: 3984
Re: [ardibehest] Sentence identification and separation

Chris correctly explained your error messages, but the resulting script is unlikely to do what you want. I provided a complete working EXAMPLE that correctly processes the sample of data you provided. In the text of my original post, I indicated the areas that you had to generalize. Ask for more help if you need it, but do not expect this forum to provide production code.
Good Luck,
Bill