CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
Matching repetitions of an exact length

 



pdaftpguru
New User

May 18, 2011, 7:26 PM

Post #1 of 7 (5254 views)
Matching repetitions of an exact length Can't Post

Hi

If I need to find character repetitions of an exact length, what's the best way to do it?

Here, I want to find the 'c' and 'f' substrings, since there are 3 characters in each :

my $letters = "aabbbbbcccdeeeeeefffgggg";

My attempts fail since they match within repetitions of more than 3 characters.
I can't/don't know how to use lookbehind to anchor the group - if it's feasible.

The only way I can get this to work seems clumsy :

while ($letters =~ m/((.)\2\2+)/g)
{
length($1) == 3 && print "3-character sequence found [$1].\n";
}

Is there a better way? Advice greatly appreciated.

Thanks.


miller
User

May 18, 2011, 9:15 PM

Post #2 of 7 (5253 views)
Re: [pdaftpguru] Matching repetitions of an exact length [In reply to] Can't Post

You're doing it the best way possible. Taking advantage of greedy matching is totally the way to go. The only thing I'd change is I'd capture length 2+ strings to make it simpler


Code
use strict; 
use warnings;

my $letters = "aabbbbbcccdeeeeeefffgggg";

while ($letters =~ m/((.)\2+)/g) {
length($1) == 3 && print "3-character sequence found [$1].\n";
}


- Miller


rovf
Veteran

May 24, 2011, 6:47 AM

Post #3 of 7 (5040 views)
Re: [pdaftpguru] Matching repetitions of an exact length [In reply to] Can't Post

This would be a *nearl* correct solution which picks up only the "runs" of length 3:


Code
my @runs=$letters =~ m( 
(.) # \1 some character
(?!\1) # negative look-ahead: next character must be different
((.)\3\3) # Run of 3 identical characters
(?!\3) # negative look-ahead: next character must be different
)gx;


However, this solution has a drawback: It does not retrieve a 3-character-run at the *beginning* of the string.

A workaround would be to put at the string a character which we know won't occur within the string, for example


Code
my @runs=(chr(0).$letters) =~ m(...)gx;


This works, but is a hack.

Maybe someone can improve my solution by making it work without this hack?


miller
User

May 24, 2011, 12:54 PM

Post #4 of 7 (5018 views)
Re: [rovf] Matching repetitions of an exact length [In reply to] Can't Post

Capturing using the list mode won't work as it will return all the subgroups, not just the 3 character string.

However, you can enhance your regex by simply adding the option for the beginning of the string.


Code
use strict; 
use warnings;

my $letters = "aaabbbbbcccdeeeeeefffgggg";

while ($letters =~ m{
(?:
(.) # \1 some character
(?!\1) # negative look-ahead: next character must be different
|
^
)
((.)\3\3) # Run of 3 identical characters
(?!\3) # negative look-ahead: next character must be different
}gx) {
print $2, "\n";
}


As I said before though, this is extremely messy, and it's much cleaner to just rely on greedy matching and do length filtering after the fact. Yes, the above works, but it requires decoding to understand.

- Miller


rovf
Veteran

May 24, 2011, 2:47 PM

Post #5 of 7 (5014 views)
Re: [miller] Matching repetitions of an exact length [In reply to] Can't Post


Quote
Capturing using the list mode won't work as it will return all the subgroups, not just the 3 character string.


Hmmm... I tried my example, and for the data supplied, it *only* captured 3-character-subgroups, as required. This is not related to list mode, but to negative look-ahead. I just don't like it because of the "char(0)"-hack.


miller
User

May 24, 2011, 3:48 PM

Post #6 of 7 (5010 views)
Re: [rovf] Matching repetitions of an exact length [In reply to] Can't Post

Yes, you can get around the hack by using the enhanced regex that I built from yours.

However, list mode will still return all the capturing subgroups, not just $2.


Code
 
my $letters = "aabbbbbcccdeeeeeefffgggg";

my @runs = $letters =~ m(
(.) # \1 some character
(?!\1) # negative look-ahead: next character must be different
((.)\3\3) # Run of 3 identical characters
(?!\3) # negative look-ahead: next character must be different
)gx;

print join(',', @runs), "\n";

=prints
b,ccc,c,e,fff,f
=cut


It's easily solved though, just put it in a while loop so that $2 can be pulled explicitly.

- M


rovf
Veteran

May 24, 2011, 10:44 PM

Post #7 of 7 (4994 views)
Re: [miller] Matching repetitions of an exact length [In reply to] Can't Post


Quote
However, list mode will still return all the capturing subgroups, not just $2.


Ah, of course, now I see it. Thank you for pointing this out. So instead of list mode, we really need an explicit loop here.

I should have seen it myself. When testing it, I had used a loop, and I rewrote it to list-mode assigning when posting, because I thought this would be more clever. In fact, this was silly....

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives