CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
having trouble matching from html

 



dante
New User

Jul 4, 2009, 4:10 PM

Post #1 of 4 (3196 views)
having trouble matching from html Can't Post

I'm new to perl, but I was able to use regexp to match and find some of the information I wanted but for others it just doesn't seem to work, and for the life of me I cannot find a problem with it.

Code
<h2 class="title"><a target="_self" class="usg-AFQjCNGt6xjO2z3eqMAvpRbEgFn6NFqeKA sig2-of_mxHbBLzr0HLDwOeNcuA" href="http://www.google.com">title of the article</a></h2>

The article titles are marked by the h2 tags and I also want the url for the article, which I replaced with google for this example.

The code I'm trying to use right now is:

Code
$content =~ m/<h2 class="title"><a target="_self" class=".*" href="http:\/\/www\.(.*)">(.*)<\/a>/;


So that I can find both the title and the url of the article. I have a feeling it is probably just a stupid mistake that I cant find because I have an equally complex match that works just fine


dante
New User

Jul 4, 2009, 6:39 PM

Post #2 of 4 (3177 views)
Re: [dante] having trouble matching from html [In reply to] Can't Post

From trial and error it seems that it is the underscore in target="_self" that causes the problem, which i dont understand as it doesn't need to be escaped and it is considered a character


dante
New User

Jul 4, 2009, 6:43 PM

Post #3 of 4 (3176 views)
Re: [dante] having trouble matching from html [In reply to] Can't Post

I think i might have figured it out, looks like i was checking my code from the page source, rather than from what i actually grab which has <a target="_blank"

I think, ill let you know if i solve this and it can be closed


ichi
User

Jul 4, 2009, 10:19 PM

Post #4 of 4 (3170 views)
Re: [dante] having trouble matching from html [In reply to] Can't Post

here's a simpler way, using splitting on fields

Code
my $string = '"<h2 class="title"><a target="_self" class="usg-AFQjCNGt6xjO2z3eqMAvpRbEgFn6NFqeKA sig2-of_mxHbBLzr0HLDwOeNcuA" href="http://www.google.com">title of the article</a></h2>'; 
@s = split /href=\"/,$string; #split on href="
@title = split /\">/ , $s[-1]; #split on first ">
print $title[0];

output

Code
# ./test.pl 
http://www.google.com


 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives