May 25, 2005, 5:33 PM
Post #1 of 3
It seems that the seemingly simple problems are the ones that get you. I'm still a newbie so this is probably peanuts for the gurus here.
First, I am using ActivePerl 5.8 on a WinXP machine. I have been coding in Perl for about a week now so I'm still very green. I am trying to extract a pattern from a long string.
I have a long string with repeated patterns of
junk [foo] fred 6,000 [/bar] junk [foo] wilma betty [/bar]
When any character inside the [foo][/bar] is homogenous no problem extracting that pattern, but when it has any numbers in it, regex doesn't want to put that atom in the $1 backreference, but instead gives me more than I want. Here's an example:
SCENARIO 1 CODE:
###Version 1 - homogenous characters betwen [foo][/bar]
# this string has homogogenous characters in-between [foo][/bar]
$string = "xxx[foo]aaaaaaaa[/bar]xxx[foo]bbbbbbb[/bar]xxx";
# match the string that has alpha characters
while($string =~ m/(\[foo]\w+\[\/bar])/sig)
print $1, "\n";
SCENARIO 1 RESULTS:
# everything prints as expected
# perl extracted my match pattern
But now here's my conundrum. I want my pattern to recognize instances when the meat (characters) inside the [foo][/bar] delimiters is a mixture of numbers with commas and letters but NOT just letters. I want to be able to recognize and accept only things like:
[foo] - fred 6,000 blah barney 69 >= [/bar]
[foo] blah betty blah wilma [/bar]
My problem is that whenever I introduce characters in-between [foo][/bar] that isn't a homogenous type like strictly alphabetical (abc) or strictly numeric (123), perl extracts more than it should.
Let me show you what I mean.
SCENARIO 2 CODE:
###Version 2 - extracting comma'ed number surrounded by weird characters
# this string has a comma'ed number surrounded by all sorts of crazy characters
$string = "xxx[foo]a>a1,300a=}[/bar]xxx[foo]bbbbbbb[/bar]xxx";
# trying to match the string that has a comma and anything around it, as long as it's within [foo][/bar] delimiter
# here i use the . pattern character because we have crazy characters surrounding the ,
while($string =~ m/(\[foo].*\,+.*\[\/bar])/sig)
print $1, "\n";
SCENARIO 2 RESULTS:
# perl seems to have extracted more than my match pattern
# Why does it extract more than it should?! This appended string doesn't even have a , in it!
And then when I try to extract based on a number character it gives me the same results.
SCENARIO 3 CODE:
$string = "xxx[foo]aa1,300aaa[/bar]xxx[foo]bbbbbbb[/bar]xxx";
while($string =~ m/(\[foo].*\d+.*[\/bar])/sig)
print $1, "\n";
SCENARIO 3 RESULTS:
# the same as Scenario 2 results
So basically, whenever i try to pattern anything within the [foo][/bar] string I get more than I want. What complicates matters is that the meat in-between the [foo][/bar] wrappers can be any character but I only want to extract the meat which contains numbers with commas and some words but NOT the ones with words only.
I know this is a very long post but this problem has overwhelmed my psyche. Ahhh! I think writing it out helped me really understand the problem. I'd appreciate any suggestions.
Thanks for reading.