CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner: Re: [dilbert] since php-parser attempts failed i need to get a perl-approach: Edit Log



Zhris
Enthusiast

Feb 18, 2018, 1:11 PM


Views: 10089
Re: [dilbert] since php-parser attempts failed i need to get a perl-approach

Hi Dilbert,


Quote
After parsing each page, check for the existence of the next link at the bottom


That is an excellent idea, and likely your best option. In a rough script I tested I fetched the total results using //span[@class="ey_badge"] then the max page using my $page_max = $results / 21; $page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max ) ;. But stick to your plan.


Quote
if we have an array from which we load the urls that need to be visited - we would come across all the pages.


Yes this would be fine. Preferably, use a URI object to update the urls page param per iteration of a loop until page max is achieved.


Quote
i would be more than glad to have some insights into this project - it sounds very promising


Its a work in progress, a first draft. I need to modularise more components as the interface from a users perspective isn't too clean nor intuative. It also needs to cover various other scenarios to ensure its capable of every possible scrape. The code won't make much sense alone, but I plugged in Europa and here is a snippet:


Code
our $iterator_organizations = sub 
{
my ( $browser, $parent ) = @_;

my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;

my $nodes = $browser->nodes( url => $url );

my $iterator = sub
{
return shift @$nodes;
};

return ( $iterator, 1 );
};

our $iterator_organizations_b = sub
{
my ( $browser, $parent ) = @_;

my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;
my $uri = URI->new( $url );
my $xpath = q#//div[@class="vp ey_block block-is-flex"]#;
my $nodes = [ ];
my $page = 0;

my $results = $parent->{results};
my $page_max = $results / 21;
$page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max ) ;

my $iterator_uri = sub
{
$uri->query_form( page => $page++ );

return $page > 2 ? undef : $uri ; # $page_max;
};

my $iterator_node = sub
{
unless ( @$nodes )
{
my $uri = $iterator_uri->( ) // return undef;

my $options = $page == 1 ? { tree => $parent->{_node} } : { url => $uri->as_string };

$nodes = $browser->nodes( %$options, xpath => $xpath );
}

return shift @$nodes;
};

return ( $iterator_node, 0 );
};

our $iterator_organization = sub
{
my ( $browser, $parent ) = @_;

my $url = $parent->{internal_url};

my $nodes = $browser->nodes( url => $url );

my $iterator = sub
{
return shift @$nodes;
};

return ( $iterator, 1 );
};

#########################

sub organizations
{
my ( $self, $options ) = ( shift, { @_ } );

my $map =
[
$Massweb::Browser::Europa::iterator_organizations,
results => q#.//span[@class="ey_badge"]#,
organizations =>
[
$Massweb::Browser::Europa::iterator_organizations_b,
internal_url => [ q#.//a/@href#, $Massweb::Browser::Europa::handler_url ],
external_url => [ q#.//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, $Massweb::Browser::handler_trim ],
title => q#.//h4#,
topics => [ q#.//div[@class="org_cord"]#, $Massweb::Browser::handler_val, $Massweb::Browser::handler_list_colon ],
location => [ q#.//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, $Massweb::Browser::handler_trim ],
hand => [ q#.//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, $Massweb::Browser::handler_trim, $Massweb::Browser::handler_list_comma ],
pic_number => [ q#.//p[contains(.,'PIC no')]#, $Massweb::Browser::handler_val ],
recruiting => [ q#boolean(.//i[@class="fa fa-user-times fa-lg"])#, $Massweb::Browser::handler_bool_rev ],
_ => \&organization,
],
];

my $organizations = $self->browse( map => $map );

return $organizations;
}

sub organization
{
my ( $self, $options ) = ( shift, { @_ } );

my $map =
[
sub { $Massweb::Browser::Europa::iterator_organization->( $_[0], $options ) },
#title => q#.//h1#,
description => q#.//div[@class="ey_vp_detail_page"]/p#,
];

my $organization = $self->browse( map => $map );

return $organization;
}


In general the map represents the resultant data structure. The iterators purpose is quite straight forward, it should return a node each time it is called, or undef to finish. The organizations paging iterator shifts off each node from an array, once the array is empty it calls a url iterator which increments the page, until there are no pages left.

Its too complex to go into too much detail right now, but hopefully in the near future.


Quote
i want to have a little database that runs locally - with those data of my favorite-plugins.


Keep working at it, if you bump into a specific issue, or theres an aspect you don't understand, feel absolutely free to share with us and we will do our best to help you move forward. Try to produce a working script, start by looping over each plugin from a hardcoded array, fetching the relevant pages content and putting it through an xpath module of your choice.

Best regards,

Chris


(This post was edited by Zhris on Feb 18, 2018, 1:12 PM)


Edit Log:
Post edited by Zhris (Enthusiast) on Feb 18, 2018, 1:12 PM


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives