CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: General Discussions: General Questions:
postive and negative lookahead assertions

 



iThunder
Novice

Nov 23, 2015, 8:22 AM

Post #1 of 14 (16945 views)
postive and negative lookahead assertions Can't Post

Hello,

I have following text stored in a text file

line1 brown fox
line2 black owl
line3 red dear

and i am running following perl script

perl -ne 'print if /(line.*)(?=.*fox)/' text.txt

The output is

line1 brown fox

But if i do negative lookahead assertion i.e.

perl -ne 'print if /(line.*)(?!.*fox)/' text.txt

it prints all three lines of the text file (but i expect it to print only last two lines).

Can anyone help with why is it matching all three lines in negative lookahead assertion?


Laurent_R
Veteran / Moderator

Nov 23, 2015, 10:46 AM

Post #2 of 14 (16938 views)
Re: [iThunder] postive and negative lookahead assertions [In reply to] Can't Post

The problem is that negative assertions have to match exactly. Or, said in another way, the regex engine is able to find a match where the negative assertion is not matched. Compare these:

Code
$ echo 'line1 brown fox 
> line2 black owl
> line3 red dear
> ' | perl -ne 'print if /line\d+\s+\w+\s+(?!.*?fox)/'
line2 black owl
line3 red dear

HP ~
$ echo 'line1 brown fox
> line2 black owl
> line3 red dear
> ' | perl -ne 'print if /line\d+\s+.+(?!.*?fox)/'
line1 brown fox
line2 black owl
line3 red dear

In the first case, "fox" is really the next thing to match, the negative assertion must match it and have the overall match to fail.

In the second case, the regex engine is able to find a match where the negative assertion fails and therefore the overall match succeeds.

Consider also this example, to see what the regex engine really matched:

Code
$ echo 'line1 brown fox 
> line2 black owl
> line3 red dear
> ' | perl -ne 'print "$1\n" if /(line\d+\s+.+?)(?!.*?fox)/'
line1 brown f
line2 b
line3 r

I have the feeling I am not being entirely clear, but I hope you still get the point.


iThunder
Novice

Nov 23, 2015, 9:32 PM

Post #3 of 14 (16922 views)
Re: [Laurent_R] postive and negative lookahead assertions [In reply to] Can't Post

Thanks. i got the idea for non-greedy match (as explained in your example)
How will it work when using greedy match(.*)?
so if i have line

line1 brown fox

and i use the perl -ne 'print if /(line.*)(?=.*fox)/' then the first dot-star (line.*) operator will match the entire string (line1 brown fox) and should do backtracking (when checking lookhead assertion). But the lookahead statement also has dot-star operator in it. Will that cause a backtracking too and how?


Chris Charley
User

Nov 24, 2015, 12:15 PM

Post #4 of 14 (16909 views)
Re: [iThunder] postive and negative lookahead assertions [In reply to] Can't Post

I cannot comment on why your 2 regexs worked or, in the second case, failed to work. I am not well versed in regexs.

However if written as below, (in the posted program), both cases, (fox or not fox), function as desired.


Code
 
#!/usr/bin/perl
use strict;
use warnings;

my $re1 = '(?=.+fox)^line.+$';
my $re2 = '(?!.+fox)^line.+$';

for my $re ($re1, $re2) {
print "\tUsing regular expression $re\n";
while (<DATA>) {
print if /$re/;
}
seek DATA, 0, 0;
print "\n\n";
}

__DATA__
line1 brown fox
line2 black owl
line3 red dear



Output is:

Code
C:\Old_Data\perlp>perl test2.pl 
Using regular expression (?=.+fox)^line.+$
line1 brown fox


Using regular expression (?!.+fox)^line.+$
line2 black owl
line3 red dear



(This post was edited by Chris Charley on Nov 25, 2015, 9:20 AM)


Laurent_R
Veteran / Moderator

Nov 24, 2015, 11:30 PM

Post #5 of 14 (16891 views)
Re: [iThunder] postive and negative lookahead assertions [In reply to] Can't Post

As far as I understand it, it is not really a question of greedy versus non greedy match, nor a question of backtracking. My first two examples used greedy match and I used non greedy match just to show how the regex engine proceeded to get the match.

The real important point was the difference between the first and the second example: can the regex engine find a way to make the negative assertion to fail (and this the overall regex to succeed)?

This is an issue of semantics of negative assertions. The regex engine is trying very very hard to get a match, and the negative assertion will make it fail only if it is bound to fail. If the regex engine is able to find a match path in which it can circumvent the negative assertion, it will use it. That's the essence of the difference between my first and second example.


Chris Charley
User

Nov 25, 2015, 5:41 PM

Post #6 of 14 (16864 views)
Re: [iThunder] postive and negative lookahead assertions [In reply to] Can't Post

Using the perl regex debugger, I hope to show why your second regex, /(line.*)(?!.*fox)/, succeeded to match the first line, line1 brown fox, when you really meant for it to fail instead. The output from the debugger is kind of dense, but I'll try to describe it to make some sense. I got some explanation of what the numbers in the output meant, (bytecode numbers for the bytecode tree the regex prepares during the compilation phase for the regex engine). See a good explanation of them in the last post in this web page here. Below is the output from the debugger for your regex - I bolded the start of the compilation phase and then the start of the execution phase.

Code
Compiling REx "(line.*)(?!.*fox)" 
Final program:
1: OPEN1 (3)
3: EXACT <line> (5)
5: STAR (7)
6: REG_ANY (0)
7: CLOSE1 (9)
9: UNLESSM[0] (17)
11: STAR (13)
12: REG_ANY (0)
13: EXACT <fox> (15)
15: SUCCEED (0)
16: TAIL (17)
17: END (0)
anchored "line" at 0 (checking anchored) minlen 4
Guessing start of match in sv for REx "(line.*)(?!.*fox)" against "line1 brown fox"
Found anchored substr "line" at offset 0...
Guessed: match at offset 0
Matching REx "(line.*)(?!.*fox)" against "line1 brown fox"
0 <> <line1 brow> | 1:OPEN1(3)
0 <> <line1 brow> | 3:EXACT <line>(5)
4 <line> <1 brown fo> | 5:STAR(7)
REG_ANY can match 11 times out of 2147483647...
15 <e1 brown fox> <> | 7: CLOSE1(9)
15 <e1 brown fox> <> | 9: UNLESSM[0](17)
15 <e1 brown fox> <> | 11: STAR(13)
REG_ANY can match 0 times out of 2147483647...
failed...
15 <e1 brown fox> <> | 17: END(0)
Match successful!
Freeing REx: "(line.*)(?!.*fox)"

In the compilation phase, the number for the bytecode tree nodes are listed with the description to the right of them.

OPEN! - opening parenthesis
----EXACT (line) - the exact text to be matched
----STAR - the asterisk
--------REG_ANY - indicates the '.' token
CLOSE! - closing parenthesis
UNLESSM[-3] - unless match
----STAR (9) - the asterrisk
--------REG_ANY - indicates the '.' token
----EXACT (fox) - the exact text to be matched
----SUCCEED (0) - not sure about this one
TAIL - closing parenthesis
END

Now the execution phase:

0 <> <line1 brow> | 1:OPEN1(3)
beginning with the opening parenthesis

0 <> <line1 brow> | 3:EXACT <line>(5)
made the exact match (<line> in the left bracket below.
Note the unmatched text is in the bracket to the right, <1 brown fo>


4 <line> <1 brown fo> | 5:STAR(7)
REG_ANY can match 11 times out of 2147483647...
Now the STAR * modifier will capture the complete text, (next line, <e1 brown fox>). NOTE: all the text has been captured and there is no text left for the lookahead to match. Thus, the entire match will succeed

15 <e1 brown fox> <> | 7: CLOSE1(9)
Closing the capturing parenthesis

15 <e1 brown fox> <> | 9: UNLESSM[0](17)
15 <e1 brown fox> <> | 11: STAR(13)
REG_ANY can match 0 times out of 2147483647...
failed...
Failed to match 'fox' in the look ahead and so 'succeeded'

15 <e1 brown fox> <> | 17: END(0)
Match successful!
Freeing REx: "(line.*)(?!.*fox)"

------------------------------------------------------------------------------

This was a learning experience for me and I know how to read somewhat the debugger output now. :-)
(at least for this somewhat simple case).

Hope you make sense of my explanation. There was no backtracking and the regex succeeded, (where you wanted it to fail), because the first '.*' ate the entire line leaving no 'fox' for the lookahead to detect and cause it to fail.

The right regex would have been ^(line.*$)(?<!fox).

The debug program is below.


Code
#!/usr/bin/perl 
use strict;
use warnings;
use re qw/debug/;

$_ = 'line1 brown fox';
/(line.*)(?!.*fox)/;


A program with a good regex is below:

Code
#!/usr/bin/perl 
use strict;
use warnings;

my $re1 = '^(line.*fox)$'; # no look-around assertions needed
my $re2 = '^(line.*$)(?<!fox)';

for my $re ($re1, $re2) {
print "\tUsing regular expression $re\n";
while (<DATA>) {
print "$1\n" if /$re/;
}
seek DATA, 0, 0;
print "\n";
}

__DATA__
line1 brown fox
line2 black owl
line3 red dear



(This post was edited by Chris Charley on Nov 25, 2015, 5:44 PM)


iThunder
Novice

Nov 27, 2015, 5:21 PM

Post #7 of 14 (16782 views)
Re: [Chris Charley] postive and negative lookahead assertions [In reply to] Can't Post

Thanks for detailed explanation and providing that debugger code. All of it is really really helpful :).
I did test two more codes...this time removing the (.*) from my second group to make it more simpler i.e.
1)
perl reg.pl
Compiling REx "(line.*)(?!fox)"
Final program:
1: OPEN1 (3)
3: EXACT <line> (5)
5: STAR (7)
6: REG_ANY (0)
7: CLOSE1 (9)
9: UNLESSM[0] (15)
11: EXACT <fox> (13)
13: SUCCEED (0)
14: TAIL (15)
15: END (0)
anchored "line" at 0 (checking anchored) minlen 4
Matching REx "(line.*)(?!fox)" against "line1 brown fox"
Intuit: trying to determine minimum start position...
Found anchored substr "line" at offset 0...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <line1 brow> | 1:OPEN1(3)
0 <> <line1 brow> | 3:EXACT <line>(5)
4 <line> <1 brown fo> | 5:STAR(7)
REG_ANY can match 11 times out of 2147483647...
15 <e1 brown fox> <> | 7: CLOSE1(9)
15 <e1 brown fox> <> | 9: UNLESSM[0](15)
15 <e1 brown fox> <> | 11: EXACT <fox>(13)
failed...
15 <e1 brown fox> <> | 15: END(0)
Match successful!
Freeing REx: "(line.*)(?!fox)"



2).
perl reg.pl
Compiling REx "(line.*)(?=fox)"
Final program:
1: OPEN1 (3)
3: EXACT <line> (5)
5: STAR (7)
6: REG_ANY (0)
7: CLOSE1 (9)
9: IFMATCH[0] (15)
11: EXACT <fox> (13)
13: SUCCEED (0)
14: TAIL (15)
15: END (0)
anchored "line" at 0 (checking anchored) minlen 4
Matching REx "(line.*)(?=fox)" against "line1 brown fox"
Intuit: trying to determine minimum start position...
Found anchored substr "line" at offset 0...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <line1 brow> | 1:OPEN1(3)
0 <> <line1 brow> | 3:EXACT <line>(5)
4 <line> <1 brown fo> | 5:STAR(7)
REG_ANY can match 11 times out of 2147483647...
12 <e1 brown > <fox> | 7: CLOSE1(9)
12 <e1 brown > <fox> | 9: IFMATCH[0](15)
12 <e1 brown > <fox> | 11: EXACT <fox>(13)
15 <e1 brown fox> <> | 13: SUCCEED(0)
subpattern success...
12 <e1 brown > <fox> | 15: END(0)
Match successful!
Freeing REx: "(line.*)(?=fox)"

In the second code, only ?! was replaced with ?= in second capture group (?=fox).
Now with this code, my first capture group (line.*) is matching upto 12 characters (line1 brown ) instead of all 15 characters (code 1). Shouldn't it match all the way to end of line? Why changes in second capture group are effecting the regex match of first group?

thanks...really appreciate your response!


Chris Charley
User

Nov 27, 2015, 7:57 PM

Post #8 of 14 (16772 views)
Re: [iThunder] postive and negative lookahead assertions [In reply to] Can't Post

This is my first rodeo :-), so at this point I may not be able to explain how the regex engine steps through the nodes for your 2 examples. Its late and I'll try looking at it tomorrow.

However I found this How can test I regular expressions using multiple RE engines? and in Pat's post, he lists some analyzers and The Regex Coach looks like a possible help especially the section The parse tree. Might give it a try here.


Laurent_R
Veteran / Moderator

Nov 28, 2015, 2:33 AM

Post #9 of 14 (16759 views)
Re: [iThunder] postive and negative lookahead assertions [In reply to] Can't Post

Let my try again to explain the semantics of negative lookaround assertions, at least as I understand it.

You've got to understand that a negative lookaround assertion will prevent the regex engine from matching only if it can't do otherwise. Consider these tests under the Perl debugger:

Code
  DB<1> $_ = "tip top"; 

DB<2> print "true\n" if /tip/;
true

DB<3> print "true\n" if /tip(?!top)/
true

DB<4> print "true\n" if /tip.(?!top)/

DB<5> print "true\n" if /tip.+(?!top)/
true

DB<6> print $1 if /(tip.+)(?!top)/;
tip top
DB<7> print $1 if /(tip.+?)(?!top)/;
tip t

In line DB<4>, the regex engine CAN'T avoid matching the negative assertion so that the overall regex fails. In line DB<5>, the regex engine CAN find a way of matching the string and not matching negative assertion.

Line DB<6> shows how it can do it: the first subexpression matches the whole string "tip top" (because of greediness) and, therefore, the negative assertion can no longer match and fails, so that the overall regex succeeds. Line DB<7> shows the same with a non-greedy quantifier: the regex engines actually matches just enough with the first subexpression in order to prevent the negative assertion from succeeding.

My previous explanations were perhaps not clear enough, I hope this makes now this clearer.


BillKSmith
Veteran

Nov 28, 2015, 1:46 PM

Post #10 of 14 (16740 views)
Re: [iThunder] postive and negative lookahead assertions [In reply to] Can't Post

I think that Chris's first reply was correct. The /.*/ should be part of the assertion, not the match. This can be written more simply:

Code
use strict; 
use warnings;
my @tests = (
"line1 brown fox\n",
"line2 black owl\n",
"line3 red dear\n",
);
print "Positive lookahead\n";
foreach (@tests) {
print if /line(?=.*fox)/;
}
print "Negative lookahead\n";
foreach (@tests) {
print if /line(?!.*fox)/;
}

Good Luck,
Bill


Laurent_R
Veteran / Moderator

Nov 29, 2015, 1:10 AM

Post #11 of 14 (16722 views)
Re: [BillKSmith] postive and negative lookahead assertions [In reply to] Can't Post


In Reply To
I think that Chris's first reply was correct. The /.*/ should be part of the assertion, not the match.


Yes, sure, but why?

As I said, the whole point is whether the regex engine is able to find a matching path where the negative assertion fails. If the /.*/ is part of the match, then the regex engine can find a matching path where the negative assertion fails, and the regex engine will not backtrack to try to have the negative assertion to succeed (and thus the overall regex to fail).

Just another couple of examples:

Code
  DB<1> $_ = "line1 brown fox\n"; 

DB<2> print if /line\d+\s+\w+\s+(?!fox)/;

DB<3> print if /line\d+\s+\w+.+(?!fox)/;
line1 brown fox

DB<4> print if /line\d+\s+\w+\s+(?!fix)/;
line1 brown fox

If the match subexpression is able to match part of the characters of the negative lookahead subexpression, the assertion will fail and the overall match will be able to succeed. I have shown in my previous post what the regex engine is able to match in such a case, both with a greedy and and non greedy quantifier in the match subexpression.


BillKSmith
Veteran

Nov 30, 2015, 9:06 PM

Post #12 of 14 (16667 views)
Re: [Laurent_R] postive and negative lookahead assertions [In reply to] Can't Post

The original question was "why does one of my regex fail to get the expected result in one of three test cases?". You have shown that none of the six test cases were doing exactly what was intended. The correct results were something like the broken clock that is right twice a day.

To form a correct regex, we need an reasonably accurate statement of the problem. (How else could we possibly tell if we are right?) I propose: "The regex should match the string if the string contains 'line' and does (or does not in the negative case) contain 'fox' in the remainder of the string." Note that when we translate this sentence into regex symbols, the assertion must apply to the entire remainder, not some substring of it.

This is what I have done. The issue of greediness never arises because there are no greedy (or non-greedy) operators in the match. The following test demonstrates that this pair of regex pass all six test cases.


Code
use strict; 
use warnings;
use Test::Simple tests => 6;

my $REGEX_POS = qr/ line (?=.*fox) /x;
my $REGEX_NEG = qr/ line (?!.*fox) /x;

$_ = "line1 brown fox\n";
ok( /$REGEX_POS/, 'Positive lookahead Match');
ok( !/$REGEX_NEG/, 'Negative lookahead No-Match');

$_ = "line2 black own\n";
ok( !/$REGEX_POS/, 'Positive lookahead No-Match');
ok( /$REGEX_NEG/, 'Negative lookahead Match');

$_ = "line3 red dear\n";
ok( !/$REGEX_POS/, 'Positive lookahead No-Match');
ok( /$REGEX_NEG/, 'Negative lookahead Match');


OUTPUT:

Code
1..6 
ok 1 - Positive lookahead Match
ok 2 - Negative lookahead No-Match
ok 3 - Positive lookahead No-Match
ok 4 - Negative lookahead Match
ok 5 - Positive lookahead No-Match
ok 6 - Negative lookahead Match

Good Luck,
Bill


Laurent_R
Veteran / Moderator

Dec 1, 2015, 2:38 PM

Post #13 of 14 (16645 views)
Re: [BillKSmith] postive and negative lookahead assertions [In reply to] Can't Post

Hi Bill,

I think there must be a misunderstanding here. Although I was answering to your post, it did not mean in any way that I disagreed with it. I only begged to explain the reason why your code succeeded and the OP's code did not.

Quote
The issue of greediness never arises because there are no greedy (or non-greedy) operators in the match.

Agreed, the issue of greediness never arises in your code, since you have no quantifier at all in your matching part, and that's why it works easily (and, BTW, I fully agree that's the way it should be).

But my whole point since the beginning was that the OP's code:

Code
perl -ne 'print if /(line.*)(?!.*fox)/' text.txt

did not work as expected because of the ".*" within the match subexpression, which gave the regex engine an opportunity to find a match not impaired by the negative assertion (and, BTW, a non greedy quantifier, ".*?", would have exactly the same result in this specific case).

And my whole point in my various posts was not so much to provide a solution (although I think I actually did), but rather to answer the OP's deeper question and to explain the OP why his or her code did not work, namely that the ".*" within the match subexpression prevented the negative lookahead assertion from matching anything, thereby allowing the whole regex to match despite the negative assertion, which was obviously not the desired result. And all of my examples were aimed at showing this. The two examples in my very first post already showed how the presence of ".+" within the match changed completely the result (made the negative assertion to fail), whereas, without that ".+", the negative assertion succeeded (and the overall regex thus failed).

These things are not very easy to explain and quite possibly I was not clear enough (maybe it is also due to English not being my mother tongue -- sorry if my English is too poor), but I think I showed quite relatively clearly several times and already in my very first post and its code examples that the failure of the negative lookahead assertion was due to the catch-all-the-rest-of-the-line ".*" (or, in this case, equivalent ".+") within the match subexpression preceding the negative assertion.


BillKSmith
Veteran

Dec 1, 2015, 8:11 PM

Post #14 of 14 (16632 views)
Re: [Laurent_R] postive and negative lookahead assertions [In reply to] Can't Post

Laurent,

I think that I was the one who was not clear. My first paragraph was intended to convey my agreement with you. (You answered the original question very well.) The remainder of my post was to provide information on how to go about writing the correct regex and verifying that it is correct. I suspect that the OP's original mistake had more to do with his failing to understand his own requirement than it did with his understanding of the regex engine.

Never doubt your ability with English. I wish that I could write as well as you. Until you told us otherwise, it never occurred to me that English was not your native language.
Good Luck,
Bill

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives