CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
need help: program retrieving html

 



limner
Novice

Feb 18, 2014, 11:04 AM

Post #1 of 19 (3020 views)
need help: program retrieving html Can't Post

Hi

i'm creating a small perl program in order to get the html source from pages in order to check them.

i wrote this:

#!/usr/local/bin/perl

use HTTP::Cookies;
use LWP::UserAgent;

$ua = LWP::UserAgent->new;
$ua->cookie_jar(HTTP::Cookies->new(file => "cookies.txt", autosave => 1));
#$ua->agent("Mozilla 5.0");


$url=" site address";
$v1="2014-03-02";
$v2="2014-03-03";
$req = HTTP::Request->new(GET => $url, $checkin => $v1, $checkout => $v2);
$req->header('Accept' => 'text/html');


$res = $ua->request($req);


if ($res->is_success)
{
$filename="dati_html_test.txt";
open MYFILE, ">:utf8", $filename;
print MYFILE $res->decoded_content; # or whatever
close (MYFILE);
}
else
{
print "Error: " . $res->status_line . "\n";
}


The problem i have are those:
1) it seems that the website thinks that i'm a bot: i'm unable to set a proprer user agent

2) it seems that the url parameters are not understood by the webserver because it always answer me with a html page like i didn't submit any parameter.


any help?


Zhris
Enthusiast

Feb 18, 2014, 2:02 PM

Post #2 of 19 (3006 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post

Hi,

Its really difficult to definitively suggest what could be the cause of yours issues without knowing the URL you are trying to hit.

1) You are a bot. Perhaps try setting the agent to your own legitimate user agent: http://whatsmyuseragent.com/.

2) It looks as though you are trying to provide URL parameters while instantiating a HTTP::Request object. From looking at the documentation, it doesn't look as though you can provide URL parameters in this way ( it accepts method, uri, header, content ). Append the parameters to the URL directly or by using the URI module / query_form method.

I advise testing your code on one of your own sites first to ensure it is working as expected, before attempting to adjust for another site, which may have restrictions in place.

Chris


(This post was edited by Zhris on Feb 18, 2014, 2:06 PM)


Kenosis
User

Feb 18, 2014, 2:22 PM

Post #3 of 19 (3000 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post

Cross-posted at PerlMonks.


limner
Novice

Feb 18, 2014, 2:26 PM

Post #4 of 19 (2996 views)
Re: [Kenosis] need help: program retrieving html [In reply to] Can't Post

yes, i also asked to Perlmonks, is it forbidden?

Anyway i already try the suggestion about to put the parameters directly to the url in this way:

#!/usr/local/bin/perl


use HTTP::Cookies;
use LWP::UserAgent;

$ua = LWP::UserAgent->new;
$ua->cookie_jar(HTTP::Cookies->new(file => "Booking_cookies.txt", autosave => 1));
#$ua->agent("Baidu spider-ads");


$url="site url?parameters";
$req = HTTP::Request->new(GET => $url);
$req->header('Accept' => 'text/html');


$res = $ua->request($req);


if ($res->is_success)
{
$filename="dati_html_test.txt";
open MYFILE, ">:utf8", $filename;
print MYFILE $res->decoded_content; # or whatever
close (MYFILE);
}
else
{
print "Error: " . $res->status_line . "\n";
}


but i always have the same result: a page without the parameters.

How could i set the correct user agent?


(This post was edited by limner on Feb 20, 2014, 7:56 AM)


davido
New User

Feb 18, 2014, 2:44 PM

Post #5 of 19 (2990 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post

Where, in your script, do you populate the variables used in this line of code?

$req = HTTP::Request->new(GET => $url, $checkin => $v1, $checkout => $v2);


(Hint: You haven't initialized $checkin and $checkout.)


Zhris
Enthusiast

Feb 18, 2014, 2:58 PM

Post #6 of 19 (2986 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post

UPDATE: scrap what I mentioned below, it doesn't return the correct content. Please see later posts.

Hi,

When I execute this simplified version, it appears to return the correct content?

plain text: http://test.massweb.co.uk/bookingsrequest.pl
html: http://test.massweb.co.uk/bookingsrequest.pl?html=1


Code
#!/usr/bin/perl 
use strict;
use warnings;
use LWP::UserAgent;

my $url = 'http://www.booking.com/hotel/it/sleeping-beauty.it.html?checkin=2014-03-02&checkout=2014-03-03';

my $ua = LWP::UserAgent->new;

my $req = HTTP::Request->new( GET => $url );

my $res = $ua->request( $req );

if ( $res->is_success )
{
print $res->decoded_content;
}
else
{
print "Error: " . $res->status_line . "\n";
}


Chris


(This post was edited by Zhris on Feb 18, 2014, 3:45 PM)


limner
Novice

Feb 18, 2014, 3:14 PM

Post #7 of 19 (2976 views)
Re: [Zhris] need help: program retrieving html [In reply to] Can't Post

Hi Chris

First of all thanks you for you help.

i saw your result: you were able to obtain my same result: your html or plain text is like the request without the url parameters (checkin and checkout)
d
this is what i have obtained (like you):http://test.massweb.co.uk/bookingsrequest.pl?html=1

This is what i would like to get: http://www.booking.com/hotel/it/sleeping-beauty.it.html?checkin=2014-03-02&checkout=2014-03-03

If you see, the center of the page, where you can see the rooms, in your example there is the list of the rooms, in the real page there are the rooms available with price and dates.

Sorry for my english, i hope to be enough clear....and thanks again for the help!

(This post was edited by limner on Feb 18, 2014, 3:16 PM)


Zhris
Enthusiast

Feb 18, 2014, 3:15 PM

Post #8 of 19 (2975 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post

It looks as though parts of the page are loaded with javascript / ajax, which will undoubtedly cause issues. One option might be to try using WWW::Mechanize::Firefox. You could even look deeper into the source code and find the child URLs used to fetch the content you require.


(This post was edited by Zhris on Feb 18, 2014, 4:56 PM)


limner
Novice

Feb 18, 2014, 3:19 PM

Post #9 of 19 (2969 views)
Re: [Zhris] need help: program retrieving html [In reply to] Can't Post

WWW::Mechanize::Firefox ?

i don't know it....is this package available for activeperl?


Zhris
Enthusiast

Feb 18, 2014, 3:28 PM

Post #10 of 19 (2960 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post

Is there a particular part of the webpage you need. For example when I look in firebug under the Net section, I can see a client side request is made to the following page http://www.booking.com/get_hotel_pricecalendar?hotel_id=404478&days=30&start_date=2014-03-02&currency=GBP&nights=1&lang=it&_=1392765818031 which returns some useful json data. There are tonnes of client side requests made and you could find ones that are useful to you.


(This post was edited by Zhris on Feb 18, 2014, 3:30 PM)


limner
Novice

Feb 18, 2014, 3:36 PM

Post #11 of 19 (2952 views)
Re: [Zhris] need help: program retrieving html [In reply to] Can't Post

interesting, i'm looking for room name, price and data...maybe there is something else in firebug that i could find...i'll try togheter with WWW::Mechanize::Firefox and i will try them both!


Zhris
Enthusiast

Feb 18, 2014, 3:42 PM

Post #12 of 19 (2951 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post

Don't bother with WWW::Mechanize::Firefox if you are going to fetch the data directly from the child URLs discovered that generate static content, stick to LWP::UserAgent. Let us know how you get on. Goodluck.


Kenosis
User

Feb 18, 2014, 5:12 PM

Post #13 of 19 (2923 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post

yes, i also asked to Perlmonks, is it forbidden?

No, not at all, limner. However, it's considered polite to share that you've cross-posted, otherwise responders may invest time on crafting a solution for a problem that's already been solved elsewhere.


Laurent_R
Veteran / Moderator

Feb 19, 2014, 12:00 AM

Post #14 of 19 (2893 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post


In Reply To
yes, i also asked to Perlmonks, is it forbidden?


No, of course not, but it is good to state it upfront to avoid duplicate work. If your post has already received a satisfactory answer somewhere else, no need to have people here spend time on a solved issue.


limner
Novice

Feb 19, 2014, 2:51 PM

Post #15 of 19 (2843 views)
Re: [Laurent_R] need help: program retrieving html [In reply to] Can't Post

Hi to all :-)

sorry, i didn't write that i also asked to Perlmonks....
Anyway i still was not able to solve my problem.

I tried using your suggestions but i was not able to make an http call like a browser in order to get the same result

It seems that the package LWP::UserAgent do not perform like a normal browser: do you have any suggestion about another package that can be used in order to get the html source of a page like a browser does?

Thanks in advance to all
Limner

P.S.
before i forget: i'm using Strawberry Perl


(This post was edited by limner on Feb 19, 2014, 2:52 PM)


Zhris
Enthusiast

Feb 19, 2014, 10:09 PM

Post #16 of 19 (2811 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post

Use WWW::Mechanize::Firefox which acts like Firefox.


limner
Novice

Feb 20, 2014, 2:47 AM

Post #17 of 19 (2794 views)
Re: [Zhris] need help: program retrieving html [In reply to] Can't Post

Wow
ths is the first time i find some difficulties installing a perl module....but after a couple of hours, i was able to install WWW:Mechanize::Firefox.

Now i will try to study this module and to find the right way to use it for my purpose...

Thanks, i will the to you (Forum members) how will it go

Thanks in advance to all


P.S.
using strawberry perl on Windows7 x64.


(This post was edited by limner on Feb 20, 2014, 3:14 AM)


limner
Novice

Feb 20, 2014, 7:56 AM

Post #18 of 19 (2763 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post

Hi to all!

after less than 10 minutes i was able to achieve my goal ;-)
Using WWW::Mechanize::Firefox;


Thanks again to all community, and especially to Zhris: thank you for your time!

Limner


Zhris
Enthusiast

Feb 22, 2014, 7:35 AM

Post #19 of 19 (2602 views)
Re: [limner] need help: program retrieving html [In reply to] Can't Post

No problem, glad to have helped.

Chris

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives