Home: Perl Programming Help: Intermediate:
LWP Browser->Get Challenge



rkellerjr
Novice

Aug 27, 2013, 7:33 AM


Views: 11958
LWP Browser->Get Challenge

I convert data for a living and have not dealt with browsers or the internet directly with Perl. However, recently a client asked us to directly download their data from their secure website. This was something new (and exciting!) that I had not done so I went off, did some research, and wrote a program that has worked fairly well. Recently I ran the program to download the data and received a certificate error, which I had never seen before. OK, so, researched that, added code, now it by-passes that. However, my other challenge I have not been able to resolve is this... the program doesn't download the entire page of data any longer. It gets maybe 90% - 95% of the page and then stops and moves on to the next page of data. The only difference I can think of is that I upgraded from Activestate 5.10 to 5.16 but, I wouldn't think that would make a difference but it might. If I use the URL directly in my browser (any page of data) the entire page of data downloads just fine so ... I'm not sure what you guys might need to help out but, I need to be conscience of proprietary information.

Here is the major piece of code doing the work, with names changed to protect the innocent. :)

while ($more) {
$page++;
$url = "https://[server name is here]/[path information here]/$element/HAY/?page=$page";
$filepage = "0" x (3 - length($page)) . $page;
$response = $browser->get($url,':content_file' => $tempxml,);
$file = "$output\\$element" . "_" . $filepage . ".xml";
$response = $browser->get($url,':content_file' => $file,);
die "Couldn't get $url\n" unless defined $response;
$more = &check_tmp;
unlink ("temp.xml");
print "Completed $element page \($page\) file \($filepage\) \($more\) ...\
}

Because there is more than one page of data and I do not know the last page of data I use a temp.xml file to download the data then check the file to see if it has data, if it does I copy it to another location then delete temp.xml and basically grab the next page of data and loop that until no more page data is available.

To get past the certificate issue I added code...

$browser = LWP::UserAgent->new(ssl_opts => { verify_hostname => 0,
SSL_verify_mode => SSL_VERIFY_NONE});

I also have browser credentials, etc. that work fine. So, any clue as to why I am no longer getting the entire page of XML data any longer?

And thanks for your time folks!


FishMonger
Veteran / Moderator

Aug 27, 2013, 8:13 AM


Views: 11956
Re: [rkellerjr] LWP Browser->Get Challenge

Maybe it's timing out or the filesize is greater than the configured max_size.

Have you checked what headers were assigned in $response?


FishMonger
Veteran / Moderator

Aug 27, 2013, 8:19 AM


Views: 11955
Re: [rkellerjr] LWP Browser->Get Challenge

BTW, it appears that you're not using the strict pragma, which is a mistake. All Perl scripts you write should begin by loading these 2 pragmas.

Code
use strict; 
use warnings;

The strict pragma will require you to declare your vars, which is done with the 'my' or 'our' keywords. 99.5 percent of the time you'll want to use the 'my' keyword.
e.g.,

Code
my $url = "https://[server name is here]/[path information here]/$element/HAY/?page=$page";


Please use the code tags like I've done when posting your code.


(This post was edited by FishMonger on Aug 27, 2013, 8:20 AM)


rkellerjr
Novice

Aug 27, 2013, 9:50 AM


Views: 11950
Re: [FishMonger] LWP Browser->Get Challenge

How do I print $response and get that information?

While I appreciate the education on Perl practices, I gave you just a snippet of a much larger program and the education isn't needed in that area.

For example, that snippet of code is within a subroutine and it starts...

sub get_page_data {
my ($element, $output, $pricecode) = @_;
my ($response, $url, $page, $more);


(This post was edited by rkellerjr on Aug 27, 2013, 9:51 AM)


FishMonger
Veteran / Moderator

Aug 27, 2013, 10:16 AM


Views: 11944
Re: [rkellerjr] LWP Browser->Get Challenge

The get method you're using returns an HTTP::Response object and the next step would normally be to check the status of the response, which is done by doing something like this:

Code
if ($response->is_success) { 
# do something
}
else {
print $response->header;
die $response->status_line;
}


You may want to print the header within each of those blocks.

For specific info on which headers you should be looking for can be found in the related modules that are being used in these calls.

LWP http://search.cpan.org/~gaas/libwww-perl-6.05/lib/LWP.pm

LWP::UserAgent http://search.cpan.org/~gaas/libwww-perl-6.05/lib/LWP/UserAgent.pm

HTTP::Response http://search.cpan.org/~gaas/HTTP-Message-6.06/lib/HTTP/Response.pm

Additional info can be found in the links provided in the "SEE ALSO" section of the modules.


(This post was edited by FishMonger on Aug 27, 2013, 10:19 AM)


Laurent_R
Veteran / Moderator

Aug 27, 2013, 10:22 AM


Views: 11942
Re: [rkellerjr] LWP Browser->Get Challenge

The reason for which you should use these strict and warnings pragmas is that it will help you find bugs or faulty construct.

BTW, another point:


Code
&check_tmp;


That syntax to call a subroutine is outdated and deprecated. It has been replaced about 19 years ago by this:


Code
check_tmp();



FishMonger
Veteran / Moderator

Aug 27, 2013, 10:33 AM


Views: 11940
Re: [rkellerjr] LWP Browser->Get Challenge


Code
$filepage = "0" x (3 - length($page)) . $page;

Ouch!

It would be better to use the sprintf function.

Code
$filepage = sprintf("%03d", $page);


And

Code
$file = "$output\\$element" . "_" . $filepage . ".xml";

Would be better written as:

Code
$file = "$output/${element}_$filepage.xml";


To add onto Laurent's suggestion, using & when calling a sub has certain side effects which you normally don't want. So, it's best not to use the & unless you really do need those side effects.

Additionally, vars should be declared in the smallest scope that the require. It appears that several of your vars are only needed within the while loop, so that's where they should be declared.


rkellerjr
Novice

Aug 28, 2013, 7:38 AM


Views: 11929
Re: [FishMonger] LWP Browser->Get Challenge

Here is one of the header responses. The only difference is the Content-Length of each file. I've checked all files downloaded and all of them do not have the complete content of the XML data downloaded. The last 5% or so is cut off as I stated above.

200
Cache-Control: private
Date: Wed, 28 Aug 2013 14:28:35 GMT
Server: Microsoft-IIS/7.0
Content-Length: 102241
Content-Type: application/xml; charset=utf-8
Client-Aborted: die
Client-Date: Wed, 28 Aug 2013 14:28:34 GMT
Client-Peer: 195.10.226.140:443
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /CN=Emerald
Client-SSL-Cert-Subject: /CN=Emerald
Client-SSL-Cipher: AES256-SHA
Client-SSL-Socket-Class: IO::Socket::SSL
Client-SSL-Warning: Peer hostname match with certificate not verified
Set-Cookie: ASP.NET_SessionId=ljlcwpn5vqz05dr35qwbztq2; path=/; HttpOnly
X-AspNet-Version: 4.0.30319
X-Died: read failed: Inappropriate I/O control operation at C:/perl5_16/site/lib/LWP/Protocol/http.pm line 414.
X-Powered-By: ASP.NET
Completed models page (12) file (012) (yes) ...

You also mentioned a file size "cap" of some sort, here are the file sizes...

40674
Completed models page (1) file (001) (yes) ...
38724
Completed models page (2) file (002) (yes) ...
38259
Completed models page (3) file (003) (yes) ...
48501
Completed models page (4) file (004) (yes) ...
50421
Completed models page (5) file (005) (yes) ...
54426
Completed models page (6) file (006) (yes) ...
43147
Completed models page (7) file (007) (yes) ...
71234
Completed models page (8) file (008) (yes) ...
62076
Completed models page (9) file (009) (yes) ...
48237
Completed models page (10) file (010) (yes) ...
94105
Completed models page (11) file (011) (yes) ...
102241
Completed models page (12) file (012) (yes) ...


FishMonger
Veteran / Moderator

Aug 28, 2013, 7:57 AM


Views: 11926
Re: [rkellerjr] LWP Browser->Get Challenge

This part caught my attention.

Quote
Client-Aborted: die


Here's a related portion of the module documentation

Quote
$ua->max_size( $bytes )

Get/set the size limit for response content. The default is undef, which means that there is no limit. If the returned response content is only partial, because the size limit was exceeded, then a "Client-Aborted" header will be added to the response. The content might end up longer than max_size as we abort once appending a chunk of data makes the length exceed the limit. The "Content-Length" header, if present, will indicate the length of the full content and will normally not be the same as length($res->content).



FishMonger
Veteran / Moderator

Aug 28, 2013, 8:07 AM


Views: 11923
Re: [rkellerjr] LWP Browser->Get Challenge

Can you post your check_tmp() sub? It may have some type of logic error that is prematurely returning a false value which could account for the loss of data.

You may want to manually download each file and compare the sizes with what you get via the script.


(This post was edited by FishMonger on Aug 28, 2013, 8:09 AM)


rkellerjr
Novice

Aug 28, 2013, 8:12 AM


Views: 11920
Re: [FishMonger] LWP Browser->Get Challenge

Since I do not know the content_length it seems setting an arbitrary number may or may not get my results as my examples are only a dozen of about 50 files I will be downloading with varying content and sizes. Is it possible, since I'm doing two "get" calls, one for a temp file to make sure I've downloaded the last file and then one to actually download the file to the appropriate location and file name that after I do the temp call I retrieve the content_length, then apply that...

$response = $browser->get($url,':content_file' => $tempxml,);

and then do the second call?

$mycontentlength = $response->content_length;
$response->max_size($mycontentlength);
$file = "$output\\$element" . "_" . $filepage . ".xml";
$response = $browser->get($url,':content_file' => $file,);


FishMonger
Veteran / Moderator

Aug 28, 2013, 8:13 AM


Views: 11919
Re: [rkellerjr] LWP Browser->Get Challenge

I don't know why I didn't catch this before, but why are you using 2 get requests for the same url in the loop?


rkellerjr
Novice

Aug 28, 2013, 8:19 AM


Views: 11917
Re: [rkellerjr] LWP Browser->Get Challenge

That didn't work and I wouldn't use $response, I'd use $browser I believe since that is my LWP::UserAgent variable which I set certficates off and my credentials for the webpage.


FishMonger
Veteran / Moderator

Aug 28, 2013, 8:20 AM


Views: 11916
Re: [rkellerjr] LWP Browser->Get Challenge


Quote

Code
$mycontentlength = $response->content_length; 
$response->max_size($mycontentlength);


I don't think you should do that. I have not done any testing, but I suspect that you may need extra room for header overhead. If you set max_size to undef (which is supposed to be the default), then it should accept any size of download (short of any browser or server timeout).


rkellerjr
Novice

Aug 28, 2013, 8:23 AM


Views: 11914
Re: [FishMonger] LWP Browser->Get Challenge

Because I do not know when I've downloaded the last page of data. Unfortunately the client as XX amount of pages to download. So, I download it, check for content, if there is content, download it again to the appropriate place. I could just "copy" the downloaded temp file, definitely different ways to do it but regardless, the temp.xml file, the first call, doesn't contain all the data either so both are problematic.


rkellerjr
Novice

Aug 28, 2013, 8:27 AM


Views: 11911
Re: [rkellerjr] LWP Browser->Get Challenge

That didn't work either. I added....

$browser = LWP::UserAgent->new(ssl_opts => { verify_hostname => 0,
SSL_verify_mode => SSL_VERIFY_NONE});
$browser->credentials (
'[some server data',
'[some server data]',
'[login:password'
);

$browser->max_size(undef);


rkellerjr
Novice

Aug 28, 2013, 8:30 AM


Views: 11907
Re: [rkellerjr] LWP Browser->Get Challenge

As far as server timeouts, I haven't checked with the client but, each file only takes a few seconds to download so I wouldn't think that would be the cause.


FishMonger
Veteran / Moderator

Aug 28, 2013, 8:36 AM


Views: 11905
Re: [rkellerjr] LWP Browser->Get Challenge

Making 2 get requests doesn't make any sense. Drop one of them and use the error checking methods provided by the module to check for success/failure. Then post your updated code and its results.

PLEASE use the code tags if you want additional help.


FishMonger
Veteran / Moderator

Aug 28, 2013, 9:23 AM


Views: 11900
Re: [rkellerjr] LWP Browser->Get Challenge

Here's another important clue in your header output.

Quote
X-Died: read failed: Inappropriate I/O control operation at C:/perl5_16/site/lib/LWP/Protocol/http.pm line 414.



rkellerjr
Novice

Aug 28, 2013, 9:28 AM


Views: 11898
Re: [FishMonger] LWP Browser->Get Challenge

OK, doing it your way it never stops, it never gives me a failure so it just continues and continues and continues. This is why I changed the code so I could read the content and verify whether the file had data. I have 12 pages that actually contain data. I had to kill the program at page 21. When called their server serves up an XML file regardless of whether it contains actual data. So when I make the call to download a page of data, it always gives an XML page with header information but blank below that if it doesn't have data. This gives a legitimate XML file which has no data. So, I created my routine to read the file after downloading and if it doesn't contain actual data (I search for a certain data element) then I'm done and move on to the next batch. I had forgotten why I did that so please, moving forward, don't assume I'm writing bad code, let's tackle my problem of not getting all the data within a downloaded XML file, so I'm not wasting my companies time re-writing code I really don't need to re-write. Mucho appreciated :) The code below is what I changed per your request. Output was the same as it always has been, partially downloaded XML files.


Code
  
$response = $browser->get($url,':content_file' => $file,);
if ($response->is_success) {
print "Completed $element page \($page\) file \($filepage\) \($more\) ..\n";
&get_page_data ($element, $output, $pricecode, $page);
} else {
#$test = $response->code;
$test2 = $response->headers_as_string;
#$test2 = $response->content_length;
die "$test2\n";
}


Now, having said all that I have included the above logic where it makes sense within the confines of the logic I need to achieve my goals. Here is the snippet of code after removing the above and adding the success check so that the code is "more correct" and in line with what you'd like to see.


Code
sub get_page_data { 
my ($element, $output, $pricecode) = @_;
my ($response, $url, $page, $more);
print "Downloading $element Info ...\n";
`mkdir $output` unless (-d "$output");
$more = "yes";
$page = 0;
while ($more) {
$page++;

$url = "https://[Server and path]/$element/HAY/?page=$page";

$filepage = "0" x (3 - length($page)) . $page;
$response = $browser->get($url,':content_file' => $tempxml,);
if ($response->is_success) {
if ($more = &check_xml) {
$file = "$output\\$element" . "_" . $filepage . ".xml";
$response = $browser->get($url,':content_file' => $file,); # Or I could change this to a copy statement which I might later.
}
} else {
#$test = $response->code;
$test2 = $response->headers_as_string;
#$test2 = $response->content_length;
die "$test2\n";
}

unlink ("temp.xml");
print "Completed $element page \($page\) file \($filepage\) \($more\) ...\
}
print "\n";
}



rkellerjr
Novice

Aug 28, 2013, 9:36 AM


Views: 11894
Re: [FishMonger] LWP Browser->Get Challenge


In Reply To
Here's another important clue in your header output.

Quote
X-Died: read failed: Inappropriate I/O control operation at C:/perl5_16/site/lib/LWP/Protocol/http.pm line 414.



OK, but I do not know what that means nor how to fix it.

Also, I changed the second $response call to a copy statement so I replaced...

Code
$response = $browser->get($url,':content_file' => $file,);


with this....


Code
copy ("$tempxml", "$file");



rkellerjr
Novice

Aug 28, 2013, 9:42 AM


Views: 11892
Re: [rkellerjr] LWP Browser->Get Challenge

Something I just noticed, all the files downloaded are 16k. Every single one of them and no matter what I change max_size to they remain 16k in size.


FishMonger
Veteran / Moderator

Aug 28, 2013, 9:57 AM


Views: 11883
Re: [rkellerjr] LWP Browser->Get Challenge

What happens when you manually go to the url and download the file?


rkellerjr
Novice

Aug 28, 2013, 10:04 AM


Views: 11879
Re: [FishMonger] LWP Browser->Get Challenge

When I copy/paste the URL into my browser it shows the entire file in the browser. Also, I changed the max_size to 100000 and it didn't make a difference.


(This post was edited by rkellerjr on Aug 28, 2013, 10:05 AM)


FishMonger
Veteran / Moderator

Aug 28, 2013, 11:41 AM


Views: 11864
Re: [rkellerjr] LWP Browser->Get Challenge

16k is considerably less than the expected size as well as considerably less than what you previously stated which was 90 to 95% of the expected size. So, was your original estimate completely wrong or are there other details that you've left out?

If all of the files are the exact same 16k size, then that leads to the next obvious question. Do they all have the same contents?


Quote
moving forward, don't assume I'm writing bad code

I never said that you were writing bad code, however, you do have lots of questionable code. For example, this statement.

Code
if ($more = &check_xml)


1) It's already been pointed out that you shouldn't use & when executing the sub.

2) The conditional is not comparing the 2 values to see if they're the same. It's assigning the return value of the sub to $more and then evaluating that var in boolean context. Since you previously assigned $more the string 'yes', I can only assume that the sub returns either 'yes' or 'no'. In boolean context strings will evaluate as true, which is probably not what you intended.

There are a least a dozen other examples in your code, some of them we've already pointed out. We point these out so that you can correct them which will make your code more readable, maintainable, easier to troubleshoot and have fewer bugs.


(This post was edited by FishMonger on Aug 28, 2013, 11:42 AM)


rkellerjr
Novice

Aug 29, 2013, 5:14 AM


Views: 4587
Re: [FishMonger] LWP Browser->Get Challenge

Each file has different content so they are not the same.


FishMonger
Veteran / Moderator

Aug 29, 2013, 6:34 AM


Views: 4583
Re: [rkellerjr] LWP Browser->Get Challenge

Is the contents of each file what you expect upto that point in the file, or do they contain data that you didn't expect?

Can you give me the link so that I can run some tests?

This response header sent up a red flag for me.

Quote
X-Died: read failed: Inappropriate I/O control operation at C:/perl5_16/site/lib/LWP/Protocol/http.pm line 414.


First, it's telling you that there was an I/O error and that error is most likely the reason your data was truncated.

Second, it's coming from http.pm and since you're accessing an https page, I'd expect the error to come from https.pm.

Try adding protocols_allowed([ 'https' ]) to the constructor to see if that makes any difference.


rkellerjr
Novice

Aug 29, 2013, 6:46 AM


Views: 4581
Re: [FishMonger] LWP Browser->Get Challenge

I wish I could give you the link Ron but our agreement with the customer doesn't allow me to.

This morning I decided to modify the program and download all of the XML files to see what would happen. There are five data groups I download and we've only been discussing one. What I found was that not everything is 16k. I get some at 27k, 18k, 19k, etc. and some are complete files while most are not.

It does look like they are cutting off my download stream. I'll research how to add the protocol and let you know the results.

By the way, a big thanks for sticking with me on this.


rkellerjr
Novice

Aug 29, 2013, 6:59 AM


Views: 4579
Re: [rkellerjr] LWP Browser->Get Challenge

OK, that didn't change anything. I do have the https.pm in the LWP\protocol folder. Got the same "X-Died: read failed" in http.pm. The line of code I added was...

Code
$browser->protocols_allowed(['https']);



FishMonger
Veteran / Moderator

Aug 29, 2013, 7:02 AM


Views: 4577
Re: [rkellerjr] LWP Browser->Get Challenge

You may want to add one or more callback handlers that monitor different parts of the transmission.

http://search.cpan.org/~gaas/libwww-perl-6.05/lib/LWP/UserAgent.pm#Handlers


rkellerjr
Novice

Aug 29, 2013, 7:32 AM


Views: 4574
Re: [FishMonger] LWP Browser->Get Challenge

Oh boy, something else to learn :)


rkellerjr
Novice

Aug 29, 2013, 8:21 AM


Views: 4570
Re: [rkellerjr] LWP Browser->Get Challenge

Here's something interesting. I turned on the show_progress and notice the first two lines of the first call to download the first page. Ring any bells? I have an e-mail to the client to verify our login/password information but, if I do not have that right, why would it download partial pages? The client indicates they have changed nothing on their end.

** GET https://dataserve.epitomy.com/users/ari/models/HAY/?page=1 ==> 401 Unauthorized (3s)
** GET https://dataserve.epitomy.com/users/ari/models/HAY/?page=1 ==> 200 OK (1s)
Cache-Control: private
Date: Thu, 29 Aug 2013 15:06:39 GMT
Server: Microsoft-IIS/7.0
Content-Length: 40674
Content-Type: application/xml; charset=utf-8
Client-Aborted: die
Client-Date: Thu, 29 Aug 2013 15:06:39 GMT
Client-Peer: 195.10.226.140:443
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /CN=Emerald
Client-SSL-Cert-Subject: /CN=Emerald
Client-SSL-Cipher: AES256-SHA
Client-SSL-Socket-Class: IO::Socket::SSL
Client-SSL-Warning: Peer hostname match with certificate not verified
Set-Cookie: ASP.NET_SessionId=csegdurfxyeolbvxfcbxwodf; path=/; HttpOnly
X-AspNet-Version: 4.0.30319
X-Died: read failed: Inappropriate I/O control operation at C:/perl5_16/site/lib/LWP/Protocol/http.pm line 414.
X-Powered-By: ASP.NET


FishMonger
Veteran / Moderator

Aug 29, 2013, 9:11 AM


Views: 4564
Re: [rkellerjr] LWP Browser->Get Challenge

It appears that it's having an initial problem with the authentication and does a redo and the second attempt succeeds.

Then while reading the data it encounters an I/O error and dies with the "read failed" error.

The relevant code is:

Code
    my $complete; 
$response = $self->collect($arg, $response, sub {
my $buf = ""; #prevent use of uninitialized value in SSLeay.xs
my $n;
READ:
{
$n = $socket->read_entity_body($buf, $size);
unless (defined $n) {
redo READ if $!{EINTR} || $!{EAGAIN};
die "read failed: $!"; # <-- line 414
}
redo READ if $n == -1;
}
$complete++ if !$n;
return \$buf;


Since I can't do any direct testing, I'm not sure what code changes to suggest.

My best suggestion is to post a question on perlmonks. A good number of them are well known module authors and maintainers.

The only other suggestion would be to use a higher level module that subclasses these LWP modules. Specifically, I'm referring to WWW::Mechanize which is the most commonly used module for navigating web pages.

http://search.cpan.org/~ether/WWW-Mechanize-1.73/lib/WWW/Mechanize.pm


rkellerjr
Novice

Aug 29, 2013, 9:25 AM


Views: 4562
Re: [FishMonger] LWP Browser->Get Challenge

Thanks Ron. It was also suggested that I attempt to use cURL instead of LWP and I have gotten it to work with that so I'll use cURL for now.

Thanks for your patience and your help.