CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner: Re: [Zhris] since php-parser attempts failed i need to get a perl-approach: Edit Log



dilbert
User

May 7, 2018, 10:50 AM


Views: 11262
Re: [Zhris] since php-parser attempts failed i need to get a perl-approach

hello dear Zhris, hello dear Bill,

first of all many many thanks for the quick reply of you both:
well the first steps in creating an approach can be found here:

http://perlguru.com/gforum.cgi?post=84635;sb=post_latest_reply;so=ASC;forum_view=forum_view_collapsed;guest=52395127


We have a little script that extracts the data out of each block and cleans it up a little.

Well - a great step : At this point the browse function is generic, it takes an input ref which contains the url and xpaths of the parent and children in order to construct the output ref.

Well - as i am not so advanced in Perl i think that the extension of this script towards a " kind of mechanize "

goes a bit over my head: I need to have some smaller steps to arrange the parts of the job : collection the results of 6000 pages.

But yes: it gives me a great idea of an approach we might take, it does not yet navigate across each page, we have a gerat great starting point to use it as a basis.


Well we have- a great step - now with that i have to go ahead.... in a stepwise progression



Code
use strict;  
use warnings FATAL => qw#all#;
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
use Data::Dumper;

my $handler_relurl = sub { q#https://europa.eu# . $_[0] };
my $handler_trim = sub { $_[0] =~ s#^\s*(.+?)\s*$#$1#r };
my $handler_val = sub { $_[0] =~ s#^[^:]+:\s*##r };
my $handler_split = sub { [ split $_[0], $_[1] ] };
my $handler_split_colon = sub { $handler_split->( qr#; #, $_[0] ) };
my $handler_split_comma = sub { $handler_split->( qr#, #, $_[0] ) };

my $conf =
{
url => q#https://europa.eu/youth/volunteering/evs-organisation_en#,
parent => q#//div[@class="vp ey_block block-is-flex"]#,
children =>
{
internal_url => [ q#//a/@href#, [ $handler_relurl ] ],
external_url => [ q#//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, [ $handler_trim ] ],
title => [ q#//h4# ],
topics => [ q#//div[@class="org_cord"]#, [ $handler_val, $handler_split_colon ] ],
location => [ q#//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, [ $handler_trim ] ],
hand => [ q#//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, [ $handler_trim, $handler_split_comma ] ],
pic_number => [ q#//p[contains(.,'PIC no')]#, [ $handler_val ] ],
}
};

print Dumper browse( $conf );

sub browse
{
my $conf = shift;

my $ref = [ ];

my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 );
my $response = $lwp_useragent->get( $conf->{url} );
die $response->status_line unless $response->is_success;
my $content = $response->decoded_content;

my $html_treebuilder_xpath = HTML::TreeBuilder::XPath->new_from_content( $content );
my @nodes = $html_treebuilder_xpath->findnodes( $conf->{parent} );
for my $node ( @nodes )
{
push @$ref, { };

while ( my ( $key, $val ) = each %{$conf->{children}} )
{
my $xpath = $val->[0];
my $handlers = $val->[1] // [ ];

$val = ($node->findvalues( qq#.$xpath# ))[0] // next;
$val = $_->( $val ) for @$handlers;
$ref->[-1]->{$key} = $val;
}
}

return $ref;
}


Output of the first block:

Code

{
'internal_url' => 'https://europa.eu/youth/volunteering/organisation/948417016_en',
'external_url' => 'http://www.apd.ge',
'location' => 'Tbilisi, Georgia',
'title' => '"Academy for Peace and Development" Union',
'topics' => [
'Access for disadvantaged',
'Youth (Participation, Youth Work, Youth Policy)',
'Intercultural/intergenerational education and (lifelong)learning'
],
'pic_number' => '948417016',
'hand' => [
'Receiving',
'Sending'
]
}




Dear all - this gives me a great idea of an approach we might take: we have a gerat great starting point to use it as a basis.

Well we have- a great step - now with that i have to go ahead.... in a stepwise progression - in small bits


(This post was edited by dilbert on May 7, 2018, 11:04 AM)


Edit Log:
Post edited by dilbert (User) on May 7, 2018, 11:04 AM


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives