CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
Syntax for string variances in Perl

 



cybex
Novice

Nov 28, 2013, 1:55 AM

Post #1 of 10 (29665 views)
Syntax for string variances in Perl Can't Post

I am a Perl beginner with a working script that needs a small modification. I am having a problem getting a working regular expression. I have the following that seems to work somewhat but not well enough. Basically it is crap and I need help...

I have tried the following:
"m/^[\(\s]*S[UBJECT]\s*1/i"
"m/^[\[(*\s*]S[UBJECT][*\)*\s*\d*\s*\)*\s*]:*\s/"
"m/\(*\s*S[UBJECT]*\)*\s*\d*\s*\)*\s*:*\s/ "
"m/\(*\s*S[UBJECT]*\)*\s*\d*\s*\)*\s*:*\s/"

As you can see I know almost nothing about this so any help is greatly appreciated.

I am trying to find any of the following identifier; "SUBJECT", at the beginning of any paragraph: (This is not every possible variance but you get the idea)
"S "
"S: "
S#:
SUBJECT:
SUBJECT #:
(S)
(S#)
(SUBJECT)
(SUBJECT):
(SUBJECT #):
"SUBJECT "

"S" is always capitalized and if only the "S" it is followed by either a space or a colon and a space.
If spelled out entirely the string will always be capitalized, "SUBJECT". The "#" could represent any number.

I do not want to hit on these occurrences if they appear later in the paragraph:
"subject"
"Subject"

I believe the "^" will prevent false positives from occurring.

Example:

SUBJECT:  Improve ashamed married subject expense bed her comfort pursuit mrs. Four time took ye your as fail lady. Up greatest am exertion or marianne. Subject occasional terminated insensible and inhabiting. So know do fond to half on. Provided so as doubtful on striking required.

S:  Improve ashamed married subject expense bed her comfort pursuit mrs. Four time took ye your as fail lady. Up greatest am exertion or marianne. Subject occasional terminated insensible and inhabiting. So know do fond to half on. Provided so as doubtful on striking required.


Laurent_R
Veteran / Moderator

Nov 28, 2013, 11:04 AM

Post #2 of 10 (29662 views)
Re: [cybex] Syntax for string variances in Perl [In reply to] Can't Post

First, yes, using ^ at the start of the regex pattern will guarantee that will match the rest of the pattern only at the start of a line (assuming you are reading the file line by line).

There are many ways of doing this. One possible one:


Code
 m/^\(?"?S(\s|: |\d:|\)?)|UBJECT|\d?\)?\s?"?/


I haven't thoroughly checked (no time now), but I think it should more or less match all your cases.


cybex
Novice

Nov 30, 2013, 3:55 AM

Post #3 of 10 (29592 views)
Re: [Laurent_R] Syntax for string variances in Perl [In reply to] Can't Post

@Laurent_R: Thank you for the help, but that expression did not work for all of the requirements.

Can someone explain why this is working and suggest improvements? RegexBuddy's testing interface shows this working but does not work inside Perl.

m/^|\s*\(?SUBJECT\)*\s*\d{0,2}\)*\s*:*\s*|\s{2}\(?S\)*\d{0,2}\)*\s*:*\s*|/

S This is a subject. Subject's here should not be caught.
S: This is a subject. Subject's here should not be caught.
S2: This is a subject. Subject's here should not be caught.
SUBJECT: This is a subject. Subject's here should not be caught.
SUBJECT 3: This is a subject. Subject's here should not be caught.
(S) This is a subject. Subject's here should not be caught.
(S7) This is a subject. Subject's here should not be caught.
(SUBJECT) This is a subject. Subject's here should not be caught.
(SUBJECT): This is a subject. Subject's here should not be caught.
(SUBJECT 2): This is a subject. Subject's here should not be caught.
SUBJECT This is a subject. Subject's here should not be caught.

Why will it only work with leading and trailing "|"?


FishMonger
Veteran / Moderator

Nov 30, 2013, 6:19 AM

Post #4 of 10 (29586 views)
Re: [cybex] Syntax for string variances in Perl [In reply to] Can't Post


Code
#!/usr/bin/perl 

use strict;
use warnings;
use YAPE::Regex::Explain;

my $regex = 'm/^|\s*\(?SUBJECT\)*\s*\d{0,2}\)*\s*:*\s*|\s{2}\(?S\)*\d{0,2}\)*\s*:*\s*|/';

print YAPE::Regex::Explain->new($regex)->explain();



Code
c:\test>explain-regex.pl 
The regular expression:

(?-imsx:m/^|\s*\(?SUBJECT\)*\s*\d{0,2}\)*\s*:*\s*|\s{2}\(?S\)*\d{0,2}\)*\s*:*\s*|/)

matches as follows:

NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
m/ 'm/'
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
SUBJECT 'SUBJECT'
----------------------------------------------------------------------
\)* ')' (0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\d{0,2} digits (0-9) (between 0 and 2 times
(matching the most amount possible))
----------------------------------------------------------------------
\)* ')' (0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
:* ':' (0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\s{2} whitespace (\n, \r, \t, \f, and " ") (2
times)
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
S 'S'
----------------------------------------------------------------------
\)* ')' (0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
\d{0,2} digits (0-9) (between 0 and 2 times
(matching the most amount possible))
----------------------------------------------------------------------
\)* ')' (0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
:* ':' (0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------



(This post was edited by FishMonger on Nov 30, 2013, 6:20 AM)


Laurent_R
Veteran / Moderator

Nov 30, 2013, 10:45 AM

Post #5 of 10 (29567 views)
Re: [cybex] Syntax for string variances in Perl [In reply to] Can't Post


In Reply To
@Laurent_R: Thank you for the help, but that expression did not work for all of the requirements.


Well, you could make the effort to say which requirement it does not match.

This is the test I've just made under the Perl debugger:


Code
  DB<3> x @c 
0 '"S "'
1 'S: '
2 'S2:'
3 'SUBJECT:'
4 'SUBJECT 3:'
5 '(S)'
6 '(S4)'
7 '(SUBJECT)'
8 '(SUBJECT 5):'
9 '"SUBJECT "'
DB<4> @d = grep {m/^\(?"?S(\s|: |\d:|\)?)|UBJECT|\d?\)?\s?"?/} @c

DB<5> x @d
0 '"S "'
1 'S: '
2 'S2:'
3 'SUBJECT:'
4 'SUBJECT 3:'
5 '(S)'
6 '(S4)'
7 '(SUBJECT)'
8 '(SUBJECT 5):'
9 '"SUBJECT "'


So the regular expression pattern I gave you seems to be matching every single one of the nine items in my @c array, which presumably reflects exactly the requirement you posted. Now, it could be that you added somewhere a space that I cannot see since you did not use code tags, or changed something. Or maybe it is matching too much. But then please explain exactly what you think does not work.


(This post was edited by Laurent_R on Nov 30, 2013, 10:52 AM)


Kenosis
User

Nov 30, 2013, 2:02 PM

Post #6 of 10 (29553 views)
Re: [Laurent_R] Syntax for string variances in Perl [In reply to] Can't Post

Laurent_R,

Excellent regex(!), but my guess is that the double-quotes are not literally part of what needs to be matched, but are only used to show trailing spaces.


(This post was edited by Kenosis on Nov 30, 2013, 2:03 PM)


cybex
Novice

Nov 30, 2013, 11:52 PM

Post #7 of 10 (29529 views)
Re: [Kenosis] Syntax for string variances in Perl [In reply to] Can't Post

That is absolutely correct, I just noticed that the regex was looking for the quotes. I put them in to show the whitespaces. Great catch and spot on.

I am testing the rest and will report back with what your expression yielded. I wasn't trying to be vague, I simply didn't know how to determine what it was hitting on because it was returning everything in the source.

I thank each of you for you help and tolerance of my ignorance with this.


Laurent_R
Veteran / Moderator

Dec 1, 2013, 2:46 AM

Post #8 of 10 (29522 views)
Re: [cybex] Syntax for string variances in Perl [In reply to] Can't Post

OK, two corrections: removing the quotes, and corrected an error that made the regex match too much.

Try this:


Code
m/^\(?S(\s|: |\d:?|\)|UBJECT)/


Run under the debugger:

Code
  DB<11>  @c = ('S ', 'S: ', 'S2:', 'SUBJECT:', 'SUBJECT 3:', '(S)', '(S4)', '(SUBJECT)', '(SUBJECT 5):', 'SUBJECT ', 'foobar', "Simson") 

DB<14> @d = grep {m/^\(?S(\s|: |\d:?|\)|UBJECT)/} @c

DB<15> x @d
0 'S '
1 'S: '
2 'S2:'
3 'SUBJECT:'
4 'SUBJECT 3:'
5 '(S)'
6 '(S4)'
7 '(SUBJECT)'
8 '(SUBJECT 5):'
9 'SUBJECT '


As you can see it matches the first nine words, representing the input you described, but does not match foobar and Simson.


cybex
Novice

Dec 1, 2013, 3:25 PM

Post #9 of 10 (29469 views)
Re: [Laurent_R] Syntax for string variances in Perl [In reply to] Can't Post

Source:

Code
  S  This is a subject. Subject's here should not be caught. 1 

S: This is a subject. Subject's here should not be caught. 2

S2: This is a subject. Subject's here should not be caught. 3

SUBJECT: This is a subject. Subject's here should not be caught. 4

SUBJECT 3: This is a subject. Subject's here should not be caught. 5

(S) This is a subject. Subject's here should not be caught. 6

(S7) This is a subject. Subject's here should not be caught. 7

(SUBJECT) This is a subject. Subject's here should not be caught. 8

(SUBJECT): This is a subject. Subject's here should not be caught. 9

(SUBJECT 2): This is a subject. Subject's here should not be caught. 10

SUBJECT This is a subject. Subject's here should not be caught. 11



Script code:

Code
   foreach my $text ( $tree->findvalues('//p') ) { 
if ( $text =~ m/^\(?S(\s|: |\d:?|\)|UBJECT)/ ) {
open my $fh, ">>", $directory . $filename . ".txt" or die $!;
#$text =~ s/^[\w+*\s*]://;
#$text =~ s/^\s+//;
print {$fh} $text, "\n\n";
close $fh;
}
}



Return:

Code
S2: This is a subject. Subject's here should not be caught. 3 

SUBJECT: This is a subject. Subject's here should not be caught. 4

SUBJECT 3: This is a subject. Subject's here should not be caught. 5

(S7)  This is a subject. Subject's here should not be caught. 7

(SUBJECT) This is a subject. Subject's here should not be caught. 8

(SUBJECT): This is a subject. Subject's here should not be caught. 9

(SUBJECT 2): This is a subject. Subject's here should not be caught. 10


As you can see 1, 2, 6, and 11 are not matching on my system.

System Info: (Perl was installed via Strawberry Perl 5.18.1.1-32bit)

Code
C:\Documents and Settings\cybex>ver 

Microsoft Windows XP [Version 5.1.2600]

C:\Documents and Settings\cybex>perl -v

This is perl 5, version 18, subversion 1 (v5.18.1) built for MSWin32-x86-multi-thread-64int
Copyright 1987-2013, Larry Wall



I am not sure if the windows version differs from the Linux version.


Laurent_R
Veteran / Moderator

Dec 2, 2013, 1:10 AM

Post #10 of 10 (29408 views)
Re: [cybex] Syntax for string variances in Perl [In reply to] Can't Post

Just a quick answer (no time now to test anything): 1, 6 and 11 are not matching because they are not starting with the 'S' or '(S', but with a white space.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives