CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Why is this code not working

 



perl-doctor
Novice

Oct 23, 2011, 9:40 AM

Post #1 of 10 (685 views)
Why is this code not working Can't Post


Code
#!/usr/bin/perl  
use strict;


open(FH, "<index.html") or die "cannot open file $!";
while(<FH>)
{
if (/\<a\shref=\"*\"/ =~ <FH>)
{
print \1;
}
}
close(FH);

This is supposed to open the file, which is on disk.
I just cant figure why it is not working.


FishMonger
Veteran / Moderator

Oct 23, 2011, 1:39 PM

Post #2 of 10 (670 views)
Re: [perl-doctor] Why is this code not working [In reply to] Can't Post

"is not working" is a poor problem statement.

In what way is it not working?

Are you receiving the "cannot open file" error message? If so, what is the output of $!?

I see several problems with the while loop.
1) The regex should be bound to the $_ var, not <FH>.
2) \1 is used inside a regex, not afterwards. You probably meant $1.
3) Since you didn't use any capturing parens, $1 won't contain the value you're wanting to capture.

What are you wanting to accomplish?


perl-doctor
Novice

Oct 23, 2011, 1:49 PM

Post #3 of 10 (668 views)
Re: [FishMonger] Why is this code not working [In reply to] Can't Post

Okay the code itself runs fine, my disappointment comes when I realize that it runs and probably does what I ask it to, but that it does not return any html <a href* tags back, which is what I want it to. I thought it was very simple, however I cannot seem to wrap my head around the RE thingy.


FishMonger
Veteran / Moderator

Oct 23, 2011, 2:07 PM

Post #4 of 10 (664 views)
Re: [perl-doctor] Why is this code not working [In reply to] Can't Post

Don't use a regex for parsing html. Use one of the html parsers on cpan.

HTML::LinkExtor - Extract links from an HTML document
http://search.cpan.org/~gaas/HTML-Parser-3.69/lib/HTML/LinkExtor.pm

HTML::LinkExtractor - Extract links from an HTML document
http://search.cpan.org/~podmaster/HTML-LinkExtractor-0.13/LinkExtractor.pm


perl-doctor
Novice

Oct 23, 2011, 2:12 PM

Post #5 of 10 (662 views)
Re: [FishMonger] Why is this code not working [In reply to] Can't Post

Well I would probably do that, I just feel irritated that I can't figure it out, it is probably gonna be completely unpractically to use regex for html any way. ?
However I would still wonna know how to do it with regex, even though I am not gonna use it.


FishMonger
Veteran / Moderator

Oct 23, 2011, 3:04 PM

Post #6 of 10 (658 views)
Re: [perl-doctor] Why is this code not working [In reply to] Can't Post


Code
open my $fh, '<', 'index.html' or die "cannot open index.html $!";  
while(<$fh>) {
if ( /<a href="(.*?)"/i ) {
print $1;
}
}
close $fh;



perl-doctor
Novice

Oct 23, 2011, 3:22 PM

Post #7 of 10 (657 views)
Re: [FishMonger] Why is this code not working [In reply to] Can't Post

Okay thanks, I can understand that by myself, there is a BIG difference between understanding what you read and creating it yourself :D
Thanks.


perl-doctor
Novice

Oct 24, 2011, 10:41 AM

Post #8 of 10 (632 views)
Re: [perl-doctor] Why is this code not working [In reply to] Can't Post


Code
#!/usr/bin/perl  
use strict;
use 5.010;

open my $fh, '<', 'index.html' or die "cannot open index.html $!";
while(<$fh>) {
if ( /<a href="http:\/\/(.*?[^"])">(.*?)<\/a>/i ) {
print $1, " - ", $2;
print "\n";
while()
{
if ( /<a href="http:\/\/(.*?[^"])">(.*?)<\/a>/i =~ $1) {#SHOULDN'T THIS MATCH $3 AND $4
print $3, " - ", $4;
print "\n";
}
}
}
}
close $fh;


That is the code I got now, but the problem got that, I want my script to continue down to the second level, however the first link on the page is a link back to the page it is loading itself, so it is going into a eternal loop. How do I break that ?

Greetings perl-doctor


FishMonger
Veteran / Moderator

Oct 24, 2011, 11:15 AM

Post #9 of 10 (629 views)
Re: [perl-doctor] Why is this code not working [In reply to] Can't Post


Quote
if ( /<a href="http:\/\/(.*?[^"])">(.*?)<\/a>/i =~ $1) {#SHOULDN'T THIS MATCH $3 AND $4

Why would you think that? You only have 2 sets of capturing parens, so only $1 and $2 will be defined.

Using regex's to accomplish what you want is a really bad choice of approach.

Use WWW::Mechanize - Handy web browsing in a Perl object.
http://search.cpan.org/~jesse/WWW-Mechanize-1.70/lib/WWW/Mechanize.pm


perl-doctor
Novice

Oct 24, 2011, 11:40 AM

Post #10 of 10 (625 views)
Re: [FishMonger] Why is this code not working [In reply to] Can't Post

I know RegEx are a bad idea, I just want to do. Even if it means a lot of late nights figuring out a lot of weird details first.
I am trying to make a spider of sorts, but on a lesser net of sites, and defined by interests. As to not cover thousands of websites.
Perhaps I will try the WWW::Mechanix, but that is something I have to install through CPAN right ?

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives