CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Get html from other page

 



Gregorio
User

Sep 30, 2000, 7:42 AM

Post #1 of 13 (1060 views)
Get html from other page Can't Post

Is there a way to get() html from antoher page from another site? without having to match the information tag for tag. I only want to get one <TABLE></TABLE> from another site.


dws
Deleted

Sep 30, 2000, 9:01 AM

Post #2 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

Since this is an Intermediate forum, here's an intermediate answer.

Break the problem in two. The first part is how to get a remote page. The easiest way is to use LWP::Simple (which you can get from CPAN, or via ActivateState's PPM if you're on Win32). The documentation in that module gives an example. It's only a few lines of code. And, not surprisingly, the function from LWP that you'll use is get().

The second part of the problem is extracting the table. There are a couple of approaches to this, depending on the shape of the table, and what you want to do with the table contents. You might be able to get by with a single regular expression, or you might need something a bit more complicated.

Solve the first part. Then, if you're unable to solve the second, post another message.

Good luck.


Gregorio
User

Sep 30, 2000, 9:06 AM

Post #3 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

Actually, i've been using lwp::simple. Problem is that i don't know how to get the <TABLE></TABLE> without matching every part, is there a way to do this?


dws
Deleted

Sep 30, 2000, 11:06 AM

Post #4 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

If it's the only table on the page, <BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>

$html =~ m#(<table>.*</table> )#</pre><HR></BLOCKQUOTE> will extract it. Otherwise, you're going to have to figure out how to chip away at the page (divide and conquer) until only what you want is left.


Gregorio
User

Sep 30, 2000, 7:06 PM

Post #5 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

i'm still not getting it. i set up this:
<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


if $html=~ m#(<table>.*</table> )#) {
($tablecode) = ($1);
}
</pre><HR></BLOCKQUOTE>
is that right?

[This message has been edited by Gregorio (edited 09-30-2000).]


dws
Deleted

Sep 30, 2000, 8:30 PM

Post #6 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

That depends. Is there only one table in the html? Are the table tags in lower case?



Gregorio
User

Oct 1, 2000, 1:28 PM

Post #7 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

there is more than one table. it starts with <table width=600 cellpadding=0 cellspacing=0 border=0>


dws
Deleted

Oct 2, 2000, 12:08 PM

Post #8 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

Given that, how would you tailor a regular expression to extract just the table you want?

Take a shot at it, and we'll work from there.


Gregorio
User

Oct 2, 2000, 12:30 PM

Post #9 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

i tried the code below:
<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


if ($html =~ m#<table width=600 cellpadding=0 cellspacing=0 border=0>.*</table>#) {
($tablecode) = ($1);
}
</pre><HR></BLOCKQUOTE>
and i got nothing, so i tried an easier one such as the title:
<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


if ($html =~ m#<TITLE>.*</TITLE>#) {
($tablecode) = ($1);
}
</pre><HR></BLOCKQUOTE>
and still got nothing. i must be missing something here.


dws
Deleted

Oct 2, 2000, 1:18 PM

Post #10 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

I assume that you've verified that $html does indeed contain html.

Part of your matching problem may be case dependence. Add an 'i' modifier, to make the match case independent, then try the <title> match again. If you can't make that one work, there's a deeper issue somewhere.




Gregorio
User

Oct 5, 2000, 2:28 PM

Post #11 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

well i tried the i modifier and it wasn't helping unfortunately, and i'm sure $html gets the page because it displayes it. are you sure that *. grabs html too? do you know of an example script that does this?


dws
Deleted

Oct 6, 2000, 10:16 AM

Post #12 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

Let's leave HTML out of this. It's incidental. The problem is that you have a chunk of text in a variable, and you're trying to use a regular expression to extract part of it. What you're extracting happens to be HTML, but so what? Regular expressions don't treat HTML any different from any other text.

It's time to re-read perlre (available on-line via the command perldoc perlre). Since the chunk of text you're dealing with probably has embedded newlines, pay particular attention to the m and s modifiers, and how they influence the meaning of the '.' metacharacter. You will likely have to use one of these modifiers.



Gregorio
User

Oct 6, 2000, 12:45 PM

Post #13 of 13 (1060 views)
Re: Get html from other page [In reply to] Can't Post

thanks for the help. i found a script that splits up pages and can extract chunks of html. thanks for the help.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives