
pebreo
Novice
May 25, 2005, 5:33 PM
Post #1 of 3
(8258 views)
|
matching condundrum
|
Can't Post
|
|
Hello all, It seems that the seemingly simple problems are the ones that get you. I'm still a newbie so this is probably peanuts for the gurus here. Background: First, I am using ActivePerl 5.8 on a WinXP machine. I have been coding in Perl for about a week now so I'm still very green. I am trying to extract a pattern from a long string. I have a long string with repeated patterns of e.g. junk [foo] fred 6,000 [/bar] junk [foo] wilma betty [/bar] When any character inside the [foo][/bar] is homogenous no problem extracting that pattern, but when it has any numbers in it, regex doesn't want to put that atom in the $1 backreference, but instead gives me more than I want. Here's an example: SCENARIO 1 CODE:
###Version 1 - homogenous characters betwen [foo][/bar] # this string has homogogenous characters in-between [foo][/bar] $string = "xxx[foo]aaaaaaaa[/bar]xxx[foo]bbbbbbb[/bar]xxx"; # match the string that has alpha characters while($string =~ m/(\[foo]\w+\[\/bar])/sig) { print $1, "\n"; } SCENARIO 1 RESULTS:
# everything prints as expected # perl extracted my match pattern [foo]abcdef[/bar] [foo]ghijk[/bar] The Problem: But now here's my conundrum. I want my pattern to recognize instances when the meat (characters) inside the [foo][/bar] delimiters is a mixture of numbers with commas and letters but NOT just letters. I want to be able to recognize and accept only things like:
[foo] - fred 6,000 blah barney 69 >= [/bar] But NOT:
[foo] blah betty blah wilma [/bar] My problem is that whenever I introduce characters in-between [foo][/bar] that isn't a homogenous type like strictly alphabetical (abc) or strictly numeric (123), perl extracts more than it should. Let me show you what I mean. SCENARIO 2 CODE:
###Version 2 - extracting comma'ed number surrounded by weird characters # this string has a comma'ed number surrounded by all sorts of crazy characters $string = "xxx[foo]a>a1,300a=}[/bar]xxx[foo]bbbbbbb[/bar]xxx"; # trying to match the string that has a comma and anything around it, as long as it's within [foo][/bar] delimiter # here i use the . pattern character because we have crazy characters surrounding the , while($string =~ m/(\[foo].*\,+.*\[\/bar])/sig) { print $1, "\n"; } SCENARIO 2 RESULTS:
# perl seems to have extracted more than my match pattern # Why does it extract more than it should?! This appended string doesn't even have a , in it! [foo]a>a1,300a=}[/bar]xxx[foo]bbbbbbb[/bar]xxx And then when I try to extract based on a number character it gives me the same results. SCENARIO 3 CODE:
$string = "xxx[foo]aa1,300aaa[/bar]xxx[foo]bbbbbbb[/bar]xxx"; while($string =~ m/(\[foo].*\d+.*[\/bar])/sig) { print $1, "\n"; } SCENARIO 3 RESULTS:
# the same as Scenario 2 results [foo]aa1,300aaa[/bar]xxx[foo]bbbbbbb[/bar] Summary So basically, whenever i try to pattern anything within the [foo][/bar] string I get more than I want. What complicates matters is that the meat in-between the [foo][/bar] wrappers can be any character but I only want to extract the meat which contains numbers with commas and some words but NOT the ones with words only. I know this is a very long post but this problem has overwhelmed my psyche. Ahhh! I think writing it out helped me really understand the problem. I'd appreciate any suggestions. Thanks for reading. Best, Paul
|