CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
LWP Browser->Get Challenge

 

First page Previous page 1 2 Next page Last page  View All


rkellerjr
Novice

Aug 27, 2013, 7:33 AM

Post #1 of 34 (2357 views)
LWP Browser->Get Challenge Can't Post

I convert data for a living and have not dealt with browsers or the internet directly with Perl. However, recently a client asked us to directly download their data from their secure website. This was something new (and exciting!) that I had not done so I went off, did some research, and wrote a program that has worked fairly well. Recently I ran the program to download the data and received a certificate error, which I had never seen before. OK, so, researched that, added code, now it by-passes that. However, my other challenge I have not been able to resolve is this... the program doesn't download the entire page of data any longer. It gets maybe 90% - 95% of the page and then stops and moves on to the next page of data. The only difference I can think of is that I upgraded from Activestate 5.10 to 5.16 but, I wouldn't think that would make a difference but it might. If I use the URL directly in my browser (any page of data) the entire page of data downloads just fine so ... I'm not sure what you guys might need to help out but, I need to be conscience of proprietary information.

Here is the major piece of code doing the work, with names changed to protect the innocent. :)

while ($more) {
$page++;
$url = "https://[server name is here]/[path information here]/$element/HAY/?page=$page";
$filepage = "0" x (3 - length($page)) . $page;
$response = $browser->get($url,':content_file' => $tempxml,);
$file = "$output\\$element" . "_" . $filepage . ".xml";
$response = $browser->get($url,':content_file' => $file,);
die "Couldn't get $url\n" unless defined $response;
$more = &check_tmp;
unlink ("temp.xml");
print "Completed $element page \($page\) file \($filepage\) \($more\) ...\
}

Because there is more than one page of data and I do not know the last page of data I use a temp.xml file to download the data then check the file to see if it has data, if it does I copy it to another location then delete temp.xml and basically grab the next page of data and loop that until no more page data is available.

To get past the certificate issue I added code...

$browser = LWP::UserAgent->new(ssl_opts => { verify_hostname => 0,
SSL_verify_mode => SSL_VERIFY_NONE});

I also have browser credentials, etc. that work fine. So, any clue as to why I am no longer getting the entire page of XML data any longer?

And thanks for your time folks!


FishMonger
Veteran / Moderator

Aug 27, 2013, 8:13 AM

Post #2 of 34 (2355 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

Maybe it's timing out or the filesize is greater than the configured max_size.

Have you checked what headers were assigned in $response?


FishMonger
Veteran / Moderator

Aug 27, 2013, 8:19 AM

Post #3 of 34 (2354 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

BTW, it appears that you're not using the strict pragma, which is a mistake. All Perl scripts you write should begin by loading these 2 pragmas.

Code
use strict; 
use warnings;

The strict pragma will require you to declare your vars, which is done with the 'my' or 'our' keywords. 99.5 percent of the time you'll want to use the 'my' keyword.
e.g.,

Code
my $url = "https://[server name is here]/[path information here]/$element/HAY/?page=$page";


Please use the code tags like I've done when posting your code.


(This post was edited by FishMonger on Aug 27, 2013, 8:20 AM)


rkellerjr
Novice

Aug 27, 2013, 9:50 AM

Post #4 of 34 (2349 views)
Re: [FishMonger] LWP Browser->Get Challenge [In reply to] Can't Post

How do I print $response and get that information?

While I appreciate the education on Perl practices, I gave you just a snippet of a much larger program and the education isn't needed in that area.

For example, that snippet of code is within a subroutine and it starts...

sub get_page_data {
my ($element, $output, $pricecode) = @_;
my ($response, $url, $page, $more);


(This post was edited by rkellerjr on Aug 27, 2013, 9:51 AM)


FishMonger
Veteran / Moderator

Aug 27, 2013, 10:16 AM

Post #5 of 34 (2343 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

The get method you're using returns an HTTP::Response object and the next step would normally be to check the status of the response, which is done by doing something like this:

Code
if ($response->is_success) { 
# do something
}
else {
print $response->header;
die $response->status_line;
}


You may want to print the header within each of those blocks.

For specific info on which headers you should be looking for can be found in the related modules that are being used in these calls.

LWP http://search.cpan.org/~gaas/libwww-perl-6.05/lib/LWP.pm

LWP::UserAgent http://search.cpan.org/~gaas/libwww-perl-6.05/lib/LWP/UserAgent.pm

HTTP::Response http://search.cpan.org/~gaas/HTTP-Message-6.06/lib/HTTP/Response.pm

Additional info can be found in the links provided in the "SEE ALSO" section of the modules.


(This post was edited by FishMonger on Aug 27, 2013, 10:19 AM)


Laurent_R
Veteran / Moderator

Aug 27, 2013, 10:22 AM

Post #6 of 34 (2341 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

The reason for which you should use these strict and warnings pragmas is that it will help you find bugs or faulty construct.

BTW, another point:


Code
&check_tmp;


That syntax to call a subroutine is outdated and deprecated. It has been replaced about 19 years ago by this:


Code
check_tmp();



FishMonger
Veteran / Moderator

Aug 27, 2013, 10:33 AM

Post #7 of 34 (2339 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post


Code
$filepage = "0" x (3 - length($page)) . $page;

Ouch!

It would be better to use the sprintf function.

Code
$filepage = sprintf("%03d", $page);


And

Code
$file = "$output\\$element" . "_" . $filepage . ".xml";

Would be better written as:

Code
$file = "$output/${element}_$filepage.xml";


To add onto Laurent's suggestion, using & when calling a sub has certain side effects which you normally don't want. So, it's best not to use the & unless you really do need those side effects.

Additionally, vars should be declared in the smallest scope that the require. It appears that several of your vars are only needed within the while loop, so that's where they should be declared.


rkellerjr
Novice

Aug 28, 2013, 7:38 AM

Post #8 of 34 (2328 views)
Re: [FishMonger] LWP Browser->Get Challenge [In reply to] Can't Post

Here is one of the header responses. The only difference is the Content-Length of each file. I've checked all files downloaded and all of them do not have the complete content of the XML data downloaded. The last 5% or so is cut off as I stated above.

200
Cache-Control: private
Date: Wed, 28 Aug 2013 14:28:35 GMT
Server: Microsoft-IIS/7.0
Content-Length: 102241
Content-Type: application/xml; charset=utf-8
Client-Aborted: die
Client-Date: Wed, 28 Aug 2013 14:28:34 GMT
Client-Peer: 195.10.226.140:443
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /CN=Emerald
Client-SSL-Cert-Subject: /CN=Emerald
Client-SSL-Cipher: AES256-SHA
Client-SSL-Socket-Class: IO::Socket::SSL
Client-SSL-Warning: Peer hostname match with certificate not verified
Set-Cookie: ASP.NET_SessionId=ljlcwpn5vqz05dr35qwbztq2; path=/; HttpOnly
X-AspNet-Version: 4.0.30319
X-Died: read failed: Inappropriate I/O control operation at C:/perl5_16/site/lib/LWP/Protocol/http.pm line 414.
X-Powered-By: ASP.NET
Completed models page (12) file (012) (yes) ...

You also mentioned a file size "cap" of some sort, here are the file sizes...

40674
Completed models page (1) file (001) (yes) ...
38724
Completed models page (2) file (002) (yes) ...
38259
Completed models page (3) file (003) (yes) ...
48501
Completed models page (4) file (004) (yes) ...
50421
Completed models page (5) file (005) (yes) ...
54426
Completed models page (6) file (006) (yes) ...
43147
Completed models page (7) file (007) (yes) ...
71234
Completed models page (8) file (008) (yes) ...
62076
Completed models page (9) file (009) (yes) ...
48237
Completed models page (10) file (010) (yes) ...
94105
Completed models page (11) file (011) (yes) ...
102241
Completed models page (12) file (012) (yes) ...


FishMonger
Veteran / Moderator

Aug 28, 2013, 7:57 AM

Post #9 of 34 (2325 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

This part caught my attention.

Quote
Client-Aborted: die


Here's a related portion of the module documentation

Quote
$ua->max_size( $bytes )

Get/set the size limit for response content. The default is undef, which means that there is no limit. If the returned response content is only partial, because the size limit was exceeded, then a "Client-Aborted" header will be added to the response. The content might end up longer than max_size as we abort once appending a chunk of data makes the length exceed the limit. The "Content-Length" header, if present, will indicate the length of the full content and will normally not be the same as length($res->content).



FishMonger
Veteran / Moderator

Aug 28, 2013, 8:07 AM

Post #10 of 34 (2322 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

Can you post your check_tmp() sub? It may have some type of logic error that is prematurely returning a false value which could account for the loss of data.

You may want to manually download each file and compare the sizes with what you get via the script.


(This post was edited by FishMonger on Aug 28, 2013, 8:09 AM)


rkellerjr
Novice

Aug 28, 2013, 8:12 AM

Post #11 of 34 (2319 views)
Re: [FishMonger] LWP Browser->Get Challenge [In reply to] Can't Post

Since I do not know the content_length it seems setting an arbitrary number may or may not get my results as my examples are only a dozen of about 50 files I will be downloading with varying content and sizes. Is it possible, since I'm doing two "get" calls, one for a temp file to make sure I've downloaded the last file and then one to actually download the file to the appropriate location and file name that after I do the temp call I retrieve the content_length, then apply that...

$response = $browser->get($url,':content_file' => $tempxml,);

and then do the second call?

$mycontentlength = $response->content_length;
$response->max_size($mycontentlength);
$file = "$output\\$element" . "_" . $filepage . ".xml";
$response = $browser->get($url,':content_file' => $file,);


FishMonger
Veteran / Moderator

Aug 28, 2013, 8:13 AM

Post #12 of 34 (2318 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

I don't know why I didn't catch this before, but why are you using 2 get requests for the same url in the loop?


rkellerjr
Novice

Aug 28, 2013, 8:19 AM

Post #13 of 34 (2316 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

That didn't work and I wouldn't use $response, I'd use $browser I believe since that is my LWP::UserAgent variable which I set certficates off and my credentials for the webpage.


FishMonger
Veteran / Moderator

Aug 28, 2013, 8:20 AM

Post #14 of 34 (2315 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post


Quote

Code
$mycontentlength = $response->content_length; 
$response->max_size($mycontentlength);


I don't think you should do that. I have not done any testing, but I suspect that you may need extra room for header overhead. If you set max_size to undef (which is supposed to be the default), then it should accept any size of download (short of any browser or server timeout).


rkellerjr
Novice

Aug 28, 2013, 8:23 AM

Post #15 of 34 (2313 views)
Re: [FishMonger] LWP Browser->Get Challenge [In reply to] Can't Post

Because I do not know when I've downloaded the last page of data. Unfortunately the client as XX amount of pages to download. So, I download it, check for content, if there is content, download it again to the appropriate place. I could just "copy" the downloaded temp file, definitely different ways to do it but regardless, the temp.xml file, the first call, doesn't contain all the data either so both are problematic.


rkellerjr
Novice

Aug 28, 2013, 8:27 AM

Post #16 of 34 (2310 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

That didn't work either. I added....

$browser = LWP::UserAgent->new(ssl_opts => { verify_hostname => 0,
SSL_verify_mode => SSL_VERIFY_NONE});
$browser->credentials (
'[some server data',
'[some server data]',
'[login:password'
);

$browser->max_size(undef);


rkellerjr
Novice

Aug 28, 2013, 8:30 AM

Post #17 of 34 (2306 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

As far as server timeouts, I haven't checked with the client but, each file only takes a few seconds to download so I wouldn't think that would be the cause.


FishMonger
Veteran / Moderator

Aug 28, 2013, 8:36 AM

Post #18 of 34 (2304 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

Making 2 get requests doesn't make any sense. Drop one of them and use the error checking methods provided by the module to check for success/failure. Then post your updated code and its results.

PLEASE use the code tags if you want additional help.


FishMonger
Veteran / Moderator

Aug 28, 2013, 9:23 AM

Post #19 of 34 (2299 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

Here's another important clue in your header output.

Quote
X-Died: read failed: Inappropriate I/O control operation at C:/perl5_16/site/lib/LWP/Protocol/http.pm line 414.



rkellerjr
Novice

Aug 28, 2013, 9:28 AM

Post #20 of 34 (2297 views)
Re: [FishMonger] LWP Browser->Get Challenge [In reply to] Can't Post

OK, doing it your way it never stops, it never gives me a failure so it just continues and continues and continues. This is why I changed the code so I could read the content and verify whether the file had data. I have 12 pages that actually contain data. I had to kill the program at page 21. When called their server serves up an XML file regardless of whether it contains actual data. So when I make the call to download a page of data, it always gives an XML page with header information but blank below that if it doesn't have data. This gives a legitimate XML file which has no data. So, I created my routine to read the file after downloading and if it doesn't contain actual data (I search for a certain data element) then I'm done and move on to the next batch. I had forgotten why I did that so please, moving forward, don't assume I'm writing bad code, let's tackle my problem of not getting all the data within a downloaded XML file, so I'm not wasting my companies time re-writing code I really don't need to re-write. Mucho appreciated :) The code below is what I changed per your request. Output was the same as it always has been, partially downloaded XML files.


Code
  
$response = $browser->get($url,':content_file' => $file,);
if ($response->is_success) {
print "Completed $element page \($page\) file \($filepage\) \($more\) ..\n";
&get_page_data ($element, $output, $pricecode, $page);
} else {
#$test = $response->code;
$test2 = $response->headers_as_string;
#$test2 = $response->content_length;
die "$test2\n";
}


Now, having said all that I have included the above logic where it makes sense within the confines of the logic I need to achieve my goals. Here is the snippet of code after removing the above and adding the success check so that the code is "more correct" and in line with what you'd like to see.


Code
sub get_page_data { 
my ($element, $output, $pricecode) = @_;
my ($response, $url, $page, $more);
print "Downloading $element Info ...\n";
`mkdir $output` unless (-d "$output");
$more = "yes";
$page = 0;
while ($more) {
$page++;

$url = "https://[Server and path]/$element/HAY/?page=$page";

$filepage = "0" x (3 - length($page)) . $page;
$response = $browser->get($url,':content_file' => $tempxml,);
if ($response->is_success) {
if ($more = &check_xml) {
$file = "$output\\$element" . "_" . $filepage . ".xml";
$response = $browser->get($url,':content_file' => $file,); # Or I could change this to a copy statement which I might later.
}
} else {
#$test = $response->code;
$test2 = $response->headers_as_string;
#$test2 = $response->content_length;
die "$test2\n";
}

unlink ("temp.xml");
print "Completed $element page \($page\) file \($filepage\) \($more\) ...\
}
print "\n";
}



rkellerjr
Novice

Aug 28, 2013, 9:36 AM

Post #21 of 34 (2293 views)
Re: [FishMonger] LWP Browser->Get Challenge [In reply to] Can't Post


In Reply To
Here's another important clue in your header output.

Quote
X-Died: read failed: Inappropriate I/O control operation at C:/perl5_16/site/lib/LWP/Protocol/http.pm line 414.



OK, but I do not know what that means nor how to fix it.

Also, I changed the second $response call to a copy statement so I replaced...

Code
$response = $browser->get($url,':content_file' => $file,);


with this....


Code
copy ("$tempxml", "$file");



rkellerjr
Novice

Aug 28, 2013, 9:42 AM

Post #22 of 34 (2291 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

Something I just noticed, all the files downloaded are 16k. Every single one of them and no matter what I change max_size to they remain 16k in size.


FishMonger
Veteran / Moderator

Aug 28, 2013, 9:57 AM

Post #23 of 34 (2282 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

What happens when you manually go to the url and download the file?


rkellerjr
Novice

Aug 28, 2013, 10:04 AM

Post #24 of 34 (2278 views)
Re: [FishMonger] LWP Browser->Get Challenge [In reply to] Can't Post

When I copy/paste the URL into my browser it shows the entire file in the browser. Also, I changed the max_size to 100000 and it didn't make a difference.


(This post was edited by rkellerjr on Aug 28, 2013, 10:05 AM)


FishMonger
Veteran / Moderator

Aug 28, 2013, 11:41 AM

Post #25 of 34 (2263 views)
Re: [rkellerjr] LWP Browser->Get Challenge [In reply to] Can't Post

16k is considerably less than the expected size as well as considerably less than what you previously stated which was 90 to 95% of the expected size. So, was your original estimate completely wrong or are there other details that you've left out?

If all of the files are the exact same 16k size, then that leads to the next obvious question. Do they all have the same contents?


Quote
moving forward, don't assume I'm writing bad code

I never said that you were writing bad code, however, you do have lots of questionable code. For example, this statement.

Code
if ($more = &check_xml)


1) It's already been pointed out that you shouldn't use & when executing the sub.

2) The conditional is not comparing the 2 values to see if they're the same. It's assigning the return value of the sub to $more and then evaluating that var in boolean context. Since you previously assigned $more the string 'yes', I can only assume that the sub returns either 'yes' or 'no'. In boolean context strings will evaluate as true, which is probably not what you intended.

There are a least a dozen other examples in your code, some of them we've already pointed out. We point these out so that you can correct them which will make your code more readable, maintainable, easier to troubleshoot and have fewer bugs.


(This post was edited by FishMonger on Aug 28, 2013, 11:42 AM)

First page Previous page 1 2 Next page Last page  View All
 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives