CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
quantified atoms

 



pebreo
Novice

May 27, 2005, 12:44 PM

Post #1 of 7 (5043 views)
quantified atoms Can't Post

 
Interesting subject, huh? I don't know of any other scripting language that is so eclectic with it's nomenclature. Smile
Well, anyway, here's my question.

I have an html file with tables with some having dozens of rows and others only a few. I want to extract only those tables with a minimum amount of rows. Here's a sample source of html:


Code
# TABLE WITH 3 ROWS 
<table>
<tr> 1 row </tr>
<tr> 2 rows </tr>
<tr> 3 rows </tr>
</table>

# TABLE WITH 1 ROW
<table>
<tr> 1 row </tr>
</table>



I am able to extract the tables with no problem using a pattern that includes all tables. But I am unable to tell regex that I want it only to match whenever <tr>.+</tr> is encountered three or more time.


Code
# this works: 
# this one extracts tables with rows no matter how many <tr> </tr> tags are in it
$string =~ m|(<table[^>]*>.+?<tr[^>]*>.+?</tr[^>]*>.+?</table[^>]*>)|sig);

# this doesn't work:
# why doesn't this one work? can't i use (?:<tr>.+? </tr>){3,}
# i even tried (?=<tr>.+? </tr>){3,}
$string =~ m|(<table[^>]*>.+?(?:<tr[^>]*>.+?</tr[^>]*>){3,}.+?</table[^>]*>)|sig);


Why doesn't regex recognize that quantified atom of (<tr> </tr>)?.
I'm cringing at the thought that I might have to do some recursive regex algorithm. I'd appreciate any ideas.


Thanks for reading.
pebreo


KevinR
Veteran


May 27, 2005, 2:04 PM

Post #2 of 7 (5041 views)
Re: [pebreo] quantified atoms [In reply to] Can't Post

sounds like a job better suited to an html aware module like HTML::TokeParser. Are there no data cells in the table rows?

<td>content</td>

if I were to try this without a parser I might try something like this (assume html code is consistently written like you posted which is rarley the case with html):


Code
#!perl  
use strict;
use warnings;

my @AofA;
my $i = -1;
my $flag = 0;
while(<DATA>) {
chomp;
($i++,$flag = 1) if (index($_,'<table') > -1);
if ($flag == 1) {
push @{$AofA[$i]},$1 while (/<tr>(.*?)<\/tr>/ig);
}
$flag = 0 if (index($_,'</table>') > -1);
}
for (@AofA) {
print "@{$_}\n" if scalar(@{$_}) == 3;
}

__DATA__
<table>
<tr> 1 row </tr><tr> 2 rows </tr><tr> 3 rows </tr>
</table>
<table>
<tr> 1 row </tr>
</table>
<table>
<tr> 1 row </tr>
<tr> 2 rows </tr>
<tr> 3 rows </tr>
</table>
<table>
<tr> 1 row </tr>
</table>


like many perl coders are fond of repeating: TIMTOW

Wink
-------------------------------------------------


(This post was edited by KevinR on May 27, 2005, 6:21 PM)


pebreo
Novice

May 28, 2005, 10:51 AM

Post #3 of 7 (5028 views)
Re: [KevinR] quantified atoms [In reply to] Can't Post

 
"TIMTOW" I think that's true. I like that!
Thanks for the idea. I ended taking the string and doing another match on it. Like this:



Code
foreach ($string =~  m|(<table[^>]*>.+?(?=<tr[^>]*>.+?</tr[^>]*>).+?</table[^>]*>)|sig) 
{
$nstring = $_;
# print $nstring;
print "------- new string ----\n";
if($nstring =~ m|.*(.*<tr[^>]*>.*?</tr[^>]*>.*){5,}?.*|sig)
{
print $nstring;
}

print "---- end string -----\n";
}


Thanks for ur reply.
Smile


KevinR
Veteran


May 28, 2005, 11:23 AM

Post #4 of 7 (5026 views)
Re: [pebreo] quantified atoms [In reply to] Can't Post

thats an unusual way to use "foreach", maybe you mean to use "if". "foreach" is generally for looping through a list, and "if" is for a single string.


How are you feeding data into that code block?
-------------------------------------------------


pebreo
Novice

May 28, 2005, 11:34 AM

Post #5 of 7 (5025 views)
Re: [KevinR] quantified atoms [In reply to] Can't Post

 
I put the file into one long string:

Code
open(INPUT, "<$filename"); 
undef $/;
my $string = <INPUT>;
$/ = "\n"; #Restore for normal behaviour later in script
close INPUT;



KevinR
Veteran


May 28, 2005, 11:59 AM

Post #6 of 7 (5021 views)
Re: [pebreo] quantified atoms [In reply to] Can't Post

hmm... doesn't work for me:


Code
#!perl   
use strict;
use warnings;
use CGI qw/:standard/;
print header;
print '<plaintext>';
undef $/;
my $string = <DATA>;
print "$string\n\n\n";
my $nstring;
foreach ($string =~ m|(<table[^>]*>.+?(?=<tr[^>]*>.+?</tr[^>]*>).+?</table[^>]*>)|sig)
{
$nstring = $_;
# print $nstring;
print "------- new string ----\n";
if($nstring =~ m|.*(.*<tr[^>]*>.*?</tr[^>]*>.*){5,}?.*|sig)
{
print $nstring;
}

print "---- end string -----\n";
}

__DATA__
<table>
<tr> 1 row </tr><tr> 2 rows </tr><tr> 3 rows </tr>
</table>
<table>
<tr> 1 row </tr>
</table>
<table>
<tr> 1 row </tr>
<tr> 2 rows </tr>
<tr> 3 rows </tr>
</table>
<table>
<tr> 1 row </tr>
</table>


prints:

Code
<table>   
<tr> 1 row </tr><tr> 2 rows </tr><tr> 3 rows </tr>
</table>
<table>
<tr> 1 row </tr>
</table>
<table>
<tr> 1 row </tr>

<tr> 2 rows </tr>
<tr> 3 rows </tr>
</table>
<table>
<tr> 1 row </tr>
</table>



------- new string ----
---- end string -----
------- new string ----
---- end string -----
------- new string ----
---- end string -----
------- new string ----
---- end string -----

-------------------------------------------------


pebreo
Novice

May 30, 2005, 8:21 AM

Post #7 of 7 (4993 views)
Re: [KevinR] quantified atoms [In reply to] Can't Post

Right, because the quantifier is {5,}. I had to mess around with the quantifier to get it to work the way I wanted.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives