
langqinren
New User
Jun 27, 2009, 1:16 PM
Post #1 of 2
(1212 views)
|
|
Perl String Match Problem
|
Can't Post
|
|
I am writing a parser to extract information from web pages. Example, http://forums.sun.com/thread.jspa?messageID=10247372#10247372 (or see attachment) I am trying to extract the post content of this page. So I read the page source of this page to $page_src and I try to use string match to extract the corresponding portion of the source. Here is my code to do the match. I identified the start tag: <tr class="white"> and end tag <td><div class="pad5x10"> <\/div><\/td> <\/tr> of the post content. (.|\n)*? will match any characters as well as new lines. My code works for other pages but failed when parsing the above linked page (or attached). Can one point out the problem in my code? Really appreciate!
while ($page_src =~ /<tr class=\"white\">((.|\n)*?)<td><div class=\"pad5x10\"> <\/div><\/td>\s+<\/tr>/g) { my $match_str = $1; print $match_str . "\n"; }
(This post was edited by langqinren on Jun 27, 2009, 1:17 PM)
|