
pebreo
Novice
May 27, 2005, 12:44 PM
Post #1 of 7
(2305 views)
|
|
quantified atoms
|
Can't Post
|
|
Interesting subject, huh? I don't know of any other scripting language that is so eclectic with it's nomenclature. Well, anyway, here's my question. I have an html file with tables with some having dozens of rows and others only a few. I want to extract only those tables with a minimum amount of rows. Here's a sample source of html:
# TABLE WITH 3 ROWS <table> <tr> 1 row </tr> <tr> 2 rows </tr> <tr> 3 rows </tr> </table> # TABLE WITH 1 ROW <table> <tr> 1 row </tr> </table> I am able to extract the tables with no problem using a pattern that includes all tables. But I am unable to tell regex that I want it only to match whenever <tr>.+</tr> is encountered three or more time.
# this works: # this one extracts tables with rows no matter how many <tr> </tr> tags are in it $string =~ m|(<table[^>]*>.+?<tr[^>]*>.+?</tr[^>]*>.+?</table[^>]*>)|sig); # this doesn't work: # why doesn't this one work? can't i use (?:<tr>.+? </tr>){3,} # i even tried (?=<tr>.+? </tr>){3,} $string =~ m|(<table[^>]*>.+?(?:<tr[^>]*>.+?</tr[^>]*>){3,}.+?</table[^>]*>)|sig); Why doesn't regex recognize that quantified atom of (<tr> </tr>)?. I'm cringing at the thought that I might have to do some recursive regex algorithm. I'd appreciate any ideas. Thanks for reading. pebreo
|