CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner: Re: [Zhris] a little script that makes use of LWP::Simple: Edit Log



dilbert
User

Jan 31, 2018, 10:44 AM


Views: 7808
Re: [Zhris] a little script that makes use of LWP::Simple

hello dear Chris, hello dear all. ;)



note: first of all dear Chris - this is just a quick posting - (at the moment i am not at home - but i wanted to share some ideas with you - and i wanted to say many thanks for the continued help!!

To begin with the begining: The scraper runs so well - i have tried it out. This is just overwhelming.

Besides the results that this script brings - it is a great chance to see how (!!) Perl works - and this is a great chance to learn and to digg deeper into Perl.

BTW: not only to the above mentioned page of volunteering organizations (alltogether more than 6000 records)

even more: - but also to the volunteering projects - Search results: 506 projects found see https://europa.eu/youth/volunteering_en and especially: https://europa.eu/youth/volunteering/project_en?field_eyp_country_value=All&country=&type_1=All&topic=&date_start=&date_end=
Search results: 506 projects found

The data structure is (allmost) the same. So the scraper can be applied here too: wonderful : as mentioned above: I want to learn more and more - Perl is pretty difficult - and for me it seems to be harder to dive into Perl than to learn PHP and Python.

But anyway - Perl is very very powerful - and as we see - we can do such alot of things. I try to get more XPath knowledge -and i will try to apply the solution on more than only one target - perhaps there is a kind of a swiss-army knife solution of a robust scraper/parser that works for such (pretty difficult) sites with cells like we have in this European target.

conclusio: After the parsing of the data i try to store all the d ata in a MySQL-DB. This is the next step in the process of this little project.

As for the next step - the looping over different pages: i have done a quick search in order to find some solutions that may give some hints how we can do this "pagination-thing":

First of all: like the idea of incrementing the page - as you say "behind the scenes it automatically increments the page when there are no nodes left until there are no pages left." Chris, this is a great idea.

i have done a quick search on the solutions - that could be applied - more or less:


a Bash-Solution https://stackoverflow.com/questions/35423019/iterate-through-urls

if I wanted to [..] download a bunch of .ts files from a website, and the url format is http://example.com/video-1080Pxxxxx.ts

Question: Where the xxxxx is a number from 00000 to 99999 (required zero padding), how would I iterate through that in bash so that it tries every integer starting at 00000, 00001, 00002, etc.?

Loop over the integer values from 0 to 99999, and use printf to pad to 5 digits.


Code
for x in {0..99999}; do 
zx=$(printf '%05d' $x) # zero-pad to 5 digits
url="http://example.com/video-1080P${zx}.ts"
... # Do something with url - done


a Perl-solution

https://stackoverflow.com/questions/14983580/for-loop-in-perl-how-to-do-something-with-each-url


Code
my @arr(somting somting1);  
for my $i(0 .. $#arr){
my url = get (www.$arr[$i].com);
do something with url...
}



and ... Andy Lesters says: Don't use index variables like that to loop over your array. Iterate over the array directly like this:


Code
my @domains = ( ... ); 
for my $domain ( @domains ) {
my $url = "http://www.$host.com";
...
}


and he adds the following note: and, as others have said, always use strict; use warnings;, especially as a beginner.



some thougths - that i derived from a perlmaven idea:

https://perlmaven.com/simple-way-to-fetch-many-web-pages
Finally we arrived giving an example of downloading many pages using HTTP::Tiny


Code
    use strict; 
use warnings;
use 5.010;

use HTTP::Tiny;

my @urls = qw(
https://perlmaven.com/
https://cn.perlmaven.com/
https://br.perlmaven.com/
);

my $ht = HTTP::Tiny->new;

foreach my $url (@urls) {
say "Start $url";
my $response = $ht->get($url);
if ($response->{success}) {
say 'Length: ', length $response->{content};
} else {
say "Failed: $response->{status} $response->{reasons}";
}
}


The code is, quite straight forward. We have a list of URLs in the @urls array. An HTTP::Tiny object is created and assigned to the $ht variable. The in a for-loop we go over each url and fetch it. In order to save space in this article I only printed the size of each page. This is the result: inally we arrived giving an example of downloading many pages using HTTP::Tiny.


Well Chris, very nice examples - here a Python Example - using Beatiful Soup:

BeautifulSoup looping through urls: https://stackoverflow.com/questions/27752860/beautifulsoup-looping-through-urls
Follow the pagination by making an endless loop and follow the "Next" link until it is not found.


dear Chris dear all - this was just a quick posting - (at the moment i am not at home - but i wanted to share some ideas with you - and i wanted to say many thanks for the continued help!!


In the next few days i try to figure out how to apply a solution for itterating over all the pages and ... last but not least - try to apply all the dataset to a MySQL - DB.

but that can wait at the moment...


i keep coming back to this wonderful thread on a regular base - and yes: i ll keep you informed how it is going on here.

regards
dilbert

- i love this place!!! ;)
keep up this - great place for idea exchhange and knowledge transfer - it is a great place for discussing ideas - and yes - for learning.!!!! ;)


(This post was edited by dilbert on Jan 31, 2018, 10:49 AM)


Edit Log:
Post edited by dilbert (User) on Jan 31, 2018, 10:47 AM
Post edited by dilbert (User) on Jan 31, 2018, 10:48 AM
Post edited by dilbert (User) on Jan 31, 2018, 10:49 AM


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives