CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
since php-parser attempts failed i need to get a perl-approach

 



dilbert
User

Feb 16, 2018, 3:32 AM

Post #1 of 14 (6700 views)
since php-parser attempts failed i need to get a perl-approach Can't Post

hello dear Perl-Gurus


i tried to retrieve the contents of a div from the external site withg PHP, and XPath. See below the story and - subsequently my very very first steps in a Perl-approach to this problem (below the php-explanations.


What happened:
as it is sometimes a bit tricky i tried several attempts -and used various approaches - in PHP - now i want to try out Perl.


See the php-Story: as it goes..

This is an excerpt from the page, showing the relevant code: note: i try to add all
- also to add @ on the class and a at the end on my query, After that, i use saveHTML() to get it. see my test:



goal: i need the following data:

Version:
Last updated:
Active installations:
Tested up

see for example the following - view-source:https://wordpress.or...wp-job-manager/

Version: 1.29.3
Last updated: 5 days ago
Active installations: 100,000+

i want to have a little database that runs locally - with those data of my favorite-plugins. So i want to fetch the data automatically - with a chron job.
Well after the PHP-trials, i need to know how to do this in perl instead - i want to try out this in perl



btw: this is my XPath: //*[@id="post-15991"]/div[4]/div[1]
this is the URL: https://wordpress.org/plugins/wp-job-manager/


Try to retrieve the contents of a div from the external site withg PHP, and XPath

This is an excerpt from the page, showing the relevant code: note: i try to add all
- also to add @ on the class and a at the end on my query, After that,
i use saveHTML() to get it. see my test:


see the subsequent code:


Code
<?php 

$remote = "https://wordpress.org/plugins/participants-database/";
$doc = new DOMDocument();
@$doc->loadHTMLFile($remote);
$xpath = new DOMXpath($doc);
$node = $xpath->query('//*[@id="post-519"]/div[4]/div[1]/ul/li[2]');
echo $node->item(0)->nodeValue;

?>


output: But the output looks like so


Code
see the results:  martin@linux-3645:~/dev/php> php p20.php 
PHP Notice: Trying to get property of non-object in /home/martin/dev/php/p20.php on line 8
martin@linux-3645:~/dev/php> php p20.php



background:


my way to get the xpath; use google chrome: I have a webpage I want to get some data off:


Quote


goal: i need the following data: the values of the following lines


Code
Version: 
Last updated:
Active installations:
Tested up


see for example the following - view-source:https://wordpress.org/plugins/wp-job-manager/


Code
Version: 1.29.3 
Last updated: 5 days ago
Active installations: 100,000+




eg the html lines

Code
                 <li> 
Requires WordPress Version:<strong>4.3.1</strong> </li>

<li>Tested up to: <strong>4.9.2</strong></li>



background: i need the data from all my favorite plugins - want to have it in a db or a calc sheet. So there were approx 70 pages to scrape:_

see here the list for the example - the full xpath:


Code
//*[@id="post-15991"]/div[4]/div[1]



and job-board-manager:


Code
//*[@id="post-519"]/div[4]/div[1]/ul/li[1] 
//*[@id="post-519"]/div[4]/div[1]/ul/li[2]
//*[@id="post-519"]/div[4]/div[1]/ul/li[3]
//*[@id="post-519"]/div[4]/div[1]/ul/li[7]


i used this method: Is there a way to get the xpath in google chrome?

Quote
Right click "inspect" on the item you are trying to find the xpath
Right click on the highlighted area on the console.
Go to Copy xpath


see the subsequent code:


Code
 
<?php

include('simple_html_dom');
$url = 'https://wordpress.org/plugins/wp-job-manager/';
$html = file_get_html($url);
$text = array();
foreach($html->find('DIV[class="widget plugin-meta"]') as $text) {
$text[] = $text->plaintext;
}
print_r($headlines);

?>








Code
 
martin@linux-3645:~/dev/php> php p100.php

PHP Warning: include(simple_html_dom): failed to open stream: No such file or directory in /home/martin/dev/php/p100.php on line 4
PHP Warning: include(): Failed opening 'simple_html_dom' for inclusion (include_path='.:/usr/share/php5:/usr/share/php5/PEAR') in /home/martin/dev/php/p100.php on line 4
PHP Fatal error: Call to undefined function file_get_html() in /home/martin/dev/php/p100.php on line 6
martin@linux-3645:~/dev/php>



goal: i need the following data:

Version:
Last updated:
Active installations:
Tested up

see for example the following - view-source:https://wordpress.or...wp-job-manager/

Version: 1.29.3
Last updated: 5 days ago
Active installations: 100,000+

i want to have a little database that runs locally - with those data of my favorite-plugins. So i want to fetch the data automatically - with a chron job.
Well after the PHP-trials, i need to know how to do this in perl instead - i want to try out this in perl



the idea:


i try to parse site using Perl inside perlbrew and XML::LibXML.



Code
 
my $parser = XML::LibXML->new();

my $doc = $parser->load_html(location => "http://www.example.com/", recover => 2);
foreach my $x ($doc->findnodes('*xPath*'){
...
}



Well i think that this code should give me a first approach to a working model


(This post was edited by dilbert on Feb 16, 2018, 6:48 AM)


Zhris
Enthusiast

Feb 18, 2018, 12:24 AM

Post #2 of 14 (6676 views)
Re: [dilbert] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

Hi Dilbert,

You didn't maintain updating us on the progress you had made on your Europa task, you were last attempting to implement paging. The code I provided, with a change to the conf, would apply here too. I have since written a comprehensive scraper module to supersede Web::Scraper with iteration capabilities, though it needs a bit of a rework.

I haven't written PHP in years now, I couldn't tell you whats wrong with your code off the top of my head, but PHP is perfectly compable of this task.

I have never used XML::LibXML directly to parse html, I typically recommend HTML::TreeBuilder::XPath. This is because it simplifys executing xpath queries, it inherits from HTML::TreeBuilder which is a featuresome html parser, which in turn inherits from HTML::Element which is a featuresome html extractor/modifier. Together they create a very powerful html processing package that cover all you'd need and more.

If you are having trouble getting to grips with the basics of web scraping in Perl, I'd be happy to go through it with you. Every scrape is different, if you don't understand the process behind one, you will have difficulty writing another.

Regards,

Chris


(This post was edited by Zhris on Feb 18, 2018, 12:26 AM)


dilbert
User

Feb 18, 2018, 8:07 AM

Post #3 of 14 (6668 views)
Re: [Zhris] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

hi dear Chris,

great to hear from you - Well your new projects sound very very interesting.

i keep you posted with all the updates for the Europa-Task.

After parsing each page, check for the existence of the next link at the bottom. When you have arrived on page 292, there are no more pages, so you are done and can exit the loop with e.g. last.

at a first glance the issue about scraping from page to page - can be solved via different approaches: we have the pagination on the bottom of the page: see for example:



Code
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=5


and



Code
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=6


and



Code
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7


and so forth and so forth.. well we can set this url (s) as a base -

if we have an array from which we load the urls that need to be visited - we would come across all the pages.. Note: we have more than 6000 results - and on each page 21 little entries that represent one record so we have approx 305 Pages that we have to visit.

Regarding the loop-process:

Well Chris, after parsing each page, we have to check for the existence of the next link at the bottom of the page.


see:

Code
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7


The proceedings: When we have arrived on page 292, there seem to be no more pages left, so we are done with the process of counting upward and can exit the loop with e.g. last.


Dear Chris i try to achieve more steps in the Europe-Task. It would be great if can go ahead with this great parser-approach that you have revealed. Many many thanks for the great approach. I am very glad.

I keep you maintain updating on the progress you will make on the Europa attempting to implement paging. The code you provided, with a change to the conf, would apply here too. Of Course!!

Dear Chris - that following sounds very very good:

In Reply To
I have since written a comprehensive scraper module to supersede Web::Scraper with iteration capabilities, though it needs a bit of a rework.


i would be more than glad to have some insights into this project - it sounds very promising, You know that i am learning - and with such day to day project like parses and scrapers i can learn quite alot.



In Reply To
I typically recommend HTML::TreeBuilder::XPath. This is because it simplifys executing xpath queries, it inherits from HTML::TreeBuilder which is a featuresome html parser, which in turn inherits from HTML::Element which is a featuresome html extractor/modifier. Together they create a very powerful html processing package that cover all you'd need and more.


Well this is a great idea to parse with HTML::TreeBuilder::XPath i thought that i have to do some preliminary tests. And yes: i am pretty sure that this is agreat chance to learn.


my way to get the xpath; use google chrome: I have a webpage I want to get some data off: see here https://wordpress.org/plugins/wp-job-manager/


goal: i need the following data:



Code
      Version: 
Last updated:
Active installations:
Tested up



Well i think that i have to use the findvalue function:

The findvalue function in HTML::TreeBuilder::XPath returns a concatenation of any values found by the xpath query. Why does it do this, and how could a concatenation of the values be useful to anyone?

Why does it do this?

When we call findvalue, we're requesting a single scalar value. If there are multiple matches, they have to be combined into a single value somehow.

From the documentation for HTML::TreeBuilder::XPath:

findvalue ($path)

...If the path returns a NodeSet, $nodeset->xpath_to_literal is called automatically for us (and thus a Tree::XPathEngine::Literal is returned).

And from the documentation for Tree::XPathEngine::NodeSet:

xpath_to_literal()


Returns the concatenation of all the string-values of all the nodes in the list.
An alternative would be to return the Tree::XPathEngine::NodeSet object so the user could iterate through the results himself, but the findvalues method already returns a list. How could a concatenation of the values be useful to anyone?

For example:


Code
   
use strict;
use warnings 'all';
use 5.010;

use HTML::TreeBuilder::XPath;

my $content = do { local $/; <DATA> };
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);

say $tree->findvalue('//p');





__DATA__
<p>HTML is just text.</p>
<p>It can still make sense without the markup.</p>

Output:

HTML is just text.It can still make sense without the markup.

Usually, though, it makes more sense to get a list of matches and iterate through them instead of doing just concatenation, so we can use findvalues (plural) if we could have multiple matches.


goal: i need the following data:


my way to get the xpath; use google chrome: I have a webpage I want to get some data off:



Code
   https://wordpress.org/plugins/wp-job-manager/ 
https://wordpress.org/plugins/participants-database/
https://wordpress.org/plugins/amazon-link/
https://wordpress.org/plugins/simple-membership/
https://wordpress.org/plugins/scrapeazon/


goal: i need the following data:


Code
    Version: 
Last updated:
Active installations:
Tested up


see for example the following - view-source:https://wordpress.org/plugins/wp-job-manager/


Code
    Version: 1.29.3 
Last updated: 5 days ago
Active installations: 100,000+



i want to have a little database that runs locally - with those data of my favorite-plugins.
Finally - i want to keep this data-chart updated by fetching the data automatically - with a chron job


In Reply To
If you are having trouble getting to grips with the basics of web scraping in Perl, I'd be happy to go through it with you. Every scrape is different, if you don't understand the process behind one, you will have difficulty writing another.


well - dear Chris - i understadnd - i can learn with each step and with each new scraping / parsing task. This is a new great chance to dive into perl.

Greetings
dilbert


Zhris
Enthusiast

Feb 18, 2018, 1:11 PM

Post #4 of 14 (6662 views)
Re: [dilbert] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

Hi Dilbert,


Quote
After parsing each page, check for the existence of the next link at the bottom


That is an excellent idea, and likely your best option. In a rough script I tested I fetched the total results using //span[@class="ey_badge"] then the max page using my $page_max = $results / 21; $page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max ) ;. But stick to your plan.


Quote
if we have an array from which we load the urls that need to be visited - we would come across all the pages.


Yes this would be fine. Preferably, use a URI object to update the urls page param per iteration of a loop until page max is achieved.


Quote
i would be more than glad to have some insights into this project - it sounds very promising


Its a work in progress, a first draft. I need to modularise more components as the interface from a users perspective isn't too clean nor intuative. It also needs to cover various other scenarios to ensure its capable of every possible scrape. The code won't make much sense alone, but I plugged in Europa and here is a snippet:


Code
our $iterator_organizations = sub 
{
my ( $browser, $parent ) = @_;

my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;

my $nodes = $browser->nodes( url => $url );

my $iterator = sub
{
return shift @$nodes;
};

return ( $iterator, 1 );
};

our $iterator_organizations_b = sub
{
my ( $browser, $parent ) = @_;

my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;
my $uri = URI->new( $url );
my $xpath = q#//div[@class="vp ey_block block-is-flex"]#;
my $nodes = [ ];
my $page = 0;

my $results = $parent->{results};
my $page_max = $results / 21;
$page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max ) ;

my $iterator_uri = sub
{
$uri->query_form( page => $page++ );

return $page > 2 ? undef : $uri ; # $page_max;
};

my $iterator_node = sub
{
unless ( @$nodes )
{
my $uri = $iterator_uri->( ) // return undef;

my $options = $page == 1 ? { tree => $parent->{_node} } : { url => $uri->as_string };

$nodes = $browser->nodes( %$options, xpath => $xpath );
}

return shift @$nodes;
};

return ( $iterator_node, 0 );
};

our $iterator_organization = sub
{
my ( $browser, $parent ) = @_;

my $url = $parent->{internal_url};

my $nodes = $browser->nodes( url => $url );

my $iterator = sub
{
return shift @$nodes;
};

return ( $iterator, 1 );
};

#########################

sub organizations
{
my ( $self, $options ) = ( shift, { @_ } );

my $map =
[
$Massweb::Browser::Europa::iterator_organizations,
results => q#.//span[@class="ey_badge"]#,
organizations =>
[
$Massweb::Browser::Europa::iterator_organizations_b,
internal_url => [ q#.//a/@href#, $Massweb::Browser::Europa::handler_url ],
external_url => [ q#.//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, $Massweb::Browser::handler_trim ],
title => q#.//h4#,
topics => [ q#.//div[@class="org_cord"]#, $Massweb::Browser::handler_val, $Massweb::Browser::handler_list_colon ],
location => [ q#.//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, $Massweb::Browser::handler_trim ],
hand => [ q#.//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, $Massweb::Browser::handler_trim, $Massweb::Browser::handler_list_comma ],
pic_number => [ q#.//p[contains(.,'PIC no')]#, $Massweb::Browser::handler_val ],
recruiting => [ q#boolean(.//i[@class="fa fa-user-times fa-lg"])#, $Massweb::Browser::handler_bool_rev ],
_ => \&organization,
],
];

my $organizations = $self->browse( map => $map );

return $organizations;
}

sub organization
{
my ( $self, $options ) = ( shift, { @_ } );

my $map =
[
sub { $Massweb::Browser::Europa::iterator_organization->( $_[0], $options ) },
#title => q#.//h1#,
description => q#.//div[@class="ey_vp_detail_page"]/p#,
];

my $organization = $self->browse( map => $map );

return $organization;
}


In general the map represents the resultant data structure. The iterators purpose is quite straight forward, it should return a node each time it is called, or undef to finish. The organizations paging iterator shifts off each node from an array, once the array is empty it calls a url iterator which increments the page, until there are no pages left.

Its too complex to go into too much detail right now, but hopefully in the near future.


Quote
i want to have a little database that runs locally - with those data of my favorite-plugins.


Keep working at it, if you bump into a specific issue, or theres an aspect you don't understand, feel absolutely free to share with us and we will do our best to help you move forward. Try to produce a working script, start by looping over each plugin from a hardcoded array, fetching the relevant pages content and putting it through an xpath module of your choice.

Best regards,

Chris


(This post was edited by Zhris on Feb 18, 2018, 1:12 PM)


dilbert
User

Feb 19, 2018, 1:35 AM

Post #5 of 14 (6650 views)
Re: [Zhris] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

dear Chris

many many thanks - this was more than expected; I am overwhelmed by such a amazing help and support.


i will work as adviced - and yes: i ll keep you informed.

Again many many thanks for all you did! You are encouraging me to go ahead. I will do so!!!


I am so glad to be here - in this great place!!!


best regards
Dilbert ;)


dilbert
User

May 6, 2018, 6:28 AM

Post #6 of 14 (4889 views)
Re: [dilbert] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

 
hello dear Chris hello dear all - i get some errors while i try to debug the following code..




Quote

martin@linux-3645:~/dev/perl> perl eu.pl
syntax error at eu.pl line 81, near "our "
Global symbol "$iterator_organizations" requires explicit package name at eu.pl line 81.
Can't use global @_ in "my" at eu.pl line 84, near "= @_"
Missing right curly or square bracket at eu.pl line 197, at end of line
Execution of eu.pl aborted due to compilation errors.
martin@linux-3645:~/dev/perl> ^C
martin@linux-3645:~/dev/perl>


it fetches the data from approx 6000 fields from the http://europa.eu/youth/volunteering/evs-organisation#open


see the code





Code
 
use strict;
use warnings FATAL => qw#all#;
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
use Data::Dumper;

my $handler_relurl = sub { q#https://europa.eu# . $_[0] };
my $handler_trim = sub { $_[0] =~ s#^\s*(.+?)\s*$#$1#r };
my $handler_val = sub { $_[0] =~ s#^[^:]+:\s*##r };
my $handler_split = sub { [ split $_[0], $_[1] ] };
my $handler_split_colon = sub { $handler_split->( qr#; #, $_[0] ) };
my $handler_split_comma = sub { $handler_split->( qr#, #, $_[0] ) };

my $conf =
{
url => q#https://europa.eu/youth/volunteering/evs-organisation_en#,
parent => q#//div[@class="vp ey_block block-is-flex"]#,
children =>
{
internal_url => [ q#//a/@href#, [ $handler_relurl ] ],
external_url => [ q#//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, [ $handler_trim ] ],
title => [ q#//h4# ],
topics => [ q#//div[@class="org_cord"]#, [ $handler_val, $handler_split_colon ] ],
location => [ q#//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, [ $handler_trim ] ],
hand => [ q#//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, [ $handler_trim, $handler_split_comma ] ],
pic_number => [ q#//p[contains(.,'PIC no')]#, [ $handler_val ] ],
}
};

print Dumper browse( $conf );

sub browse
{
my $conf = shift;

my $ref = [ ];

my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 );
my $response = $lwp_useragent->get( $conf->{url} );
die $response->status_line unless $response->is_success;
my $content = $response->decoded_content;

my $html_treebuilder_xpath = HTML::TreeBuilder::XPath->new_from_content( $content );
my @nodes = $html_treebuilder_xpath->findnodes( $conf->{parent} );
for my $node ( @nodes )
{
push @$ref, { };

while ( my ( $key, $val ) = each %{$conf->{children}} )
{
my $xpath = $val->[0];
my $handlers = $val->[1] // [ ];

$val = ($node->findvalues( qq#.$xpath# ))[0] // next;
$val = $_->( $val ) for @$handlers;
$ref->[-1]->{$key} = $val;
}
}

return $ref;
}

{
'internal_url' => 'https://europa.eu/youth/volunteering/organisation/948417016_en',
'external_url' => 'http://www.apd.ge',
'location' => 'Tbilisi, Georgia',
'title' => '"Academy for Peace and Development" Union',
'topics' => [
'Access for disadvantaged',
'Youth (Participation, Youth Work, Youth Policy)',
'Intercultural/intergenerational education and (lifelong)learning'
],
'pic_number' => '948417016',
'hand' => [
'Receiving',
'Sending'
]
}

our $iterator_organizations = sub

{
my ( $browser, $parent ) = @_;

my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;

my $nodes = $browser->nodes( url => $url );

my $iterator = sub
{
return shift @$nodes;
};

return ( $iterator, 1 );


our $iterator_organizations_b = sub
{
my ( $browser, $parent ) = @_;

my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;
my $uri = URI->new( $url );
my $xpath = q#//div[@class="vp ey_block block-is-flex"]#;
my $nodes = [ ];
my $page = 0;

my $results = $parent->{results};
my $page_max = $results / 21;
$page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max ) ;

my $iterator_uri = sub
{
$uri->query_form( page => $page++ );

return $page > 2 ? undef : $uri ; # $page_max;
};

my $iterator_node = sub
{
unless ( @$nodes )
{
my $uri = $iterator_uri->( ) // return undef;

my $options = $page == 1 ? { tree => $parent->{_node} } : { url => $uri->as_string };

$nodes = $browser->nodes( %$options, xpath => $xpath );
}

return shift @$nodes;
};

return ( $iterator_node, 0 );
};

our $iterator_organization = sub
{
my ( $browser, $parent ) = @_;

my $url = $parent->{internal_url};

my $nodes = $browser->nodes( url => $url );

my $iterator = sub
{
return shift @$nodes;
};

return ( $iterator, 1 );
};


sub organizations
{
my ( $self, $options ) = ( shift, { @_ } );

my $map =
[
$Massweb::Browser::Europa::iterator_organizations,
results => q#.//span[@class="ey_badge"]#,
organizations =>
[
$Massweb::Browser::Europa::iterator_organizations_b,
internal_url => [ q#.//a/@href#, $Massweb::Browser::Europa::handler_url ],
external_url => [ q#.//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, $Massweb::Browser::handler_trim ],
title => q#.//h4#,
topics => [ q#.//div[@class="org_cord"]#, $Massweb::Browser::handler_val, $Massweb::Browser::handler_list_colon ],
location => [ q#.//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, $Massweb::Browser::handler_trim ],
hand => [ q#.//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, $Massweb::Browser::handler_trim, $Massweb::Browser::handler_list_comma ],
pic_number => [ q#.//p[contains(.,'PIC no')]#, $Massweb::Browser::handler_val ],
recruiting => [ q#boolean(.//i[@class="fa fa-user-times fa-lg"])#, $Massweb::Browser::handler_bool_rev ],
_ => \&organization,
],
];

my $organizations = $self->browse( map => $map );

return $organizations;
}

sub organization
{
my ( $self, $options ) = ( shift, { @_ } );

my $map =
[
sub { $Massweb::Browser::Europa::iterator_organization->( $_[0], $options ) },
#title => q#.//h1#,
description => q#.//div[@class="ey_vp_detail_page"]/p#,
];

my $organization = $self->browse( map => $map );

return $organization;
}



BillKSmith
Veteran

May 6, 2018, 1:18 PM

Post #7 of 14 (4872 views)
Re: [dilbert] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

Your code fails to compile - so it does not fetch any data.

Line numbers in your posted error messages are off by one. I assume that is because you did not include the initial $! line in your post. When I added it, I reproduced your error messages exactly.

I believe that the missing curly bracket belongs at line 96

There is nothing wrong with lines 81 or 84. These errors are caused by the block of code in lines 64 thru 79. It is probably an editing error.

When I commented out this block, the syntax is correct, but I get a large number of warnings about variables used only once. Fix this much and post again if you still have problems.
Good Luck,
Bill


dilbert
User

May 6, 2018, 4:30 PM

Post #8 of 14 (4860 views)
Re: [BillKSmith] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

hello and good evening - first of all - many many thanks for the quck help - i have had a closer look at the code.







#!C:\Perl\bin\perl

use warnings;

BEGIN {
open my $file1,"+>>", ("links.txt");
select($file1);
}

use strict;
use warnings FATAL => qw#all#;
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
use Data::Dumper;



and subsequently ithought that i have to comment out the following lines : 71 to 121


in other words the following lines:


Code
 
{
'internal_url' => 'https://europa.eu/youth/volunteering/organisation/948417016_en',
'external_url' => 'http://www.apd.ge',
'location' => 'Tbilisi, Georgia',
'title' => '"Academy for Peace and Development" Union',
'topics' => [
'Access for disadvantaged',
'Youth (Participation, Youth Work, Youth Policy)',
'Intercultural/intergenerational education and (lifelong)learning'
],
'pic_number' => '948417016',
'hand' => [
'Receiving',
'Sending'
]
}

our $iterator_organizations = sub

{
my ( $browser, $parent ) = @_;

my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;

my $nodes = $browser->nodes( url => $url );

my $iterator = sub
{
return shift @$nodes;
};

return ( $iterator, 1 );

};

our $iterator_organizations_b = sub
{
my ( $browser, $parent ) = @_;

my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;
my $uri = URI->new( $url );
my $xpath = q#//div[@class="vp ey_block block-is-flex"]#;
my $nodes = [ ];
my $page = 0;

my $results = $parent->{results};
my $page_max = $results / 21;
$page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max ) ;

my $iterator_uri = sub
{
$uri->query_form( page => $page++ );

return $page > 2 ? undef : $uri ; # $page_max;
};



well i have to look why it does not run at these days - i need to figure out what stops the code from working.


Meanwhile i look forward to hear from you

regards martin


BillKSmith
Veteran

May 6, 2018, 9:28 PM

Post #9 of 14 (4851 views)
Re: [dilbert] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

I do not understand. Line numbers do not agree with original post. When I remove the indicated text (lines 64 thru 118) from the original post, I get a huge number of errors.
Good Luck,
Bill


Zhris
Enthusiast

May 7, 2018, 8:19 AM

Post #10 of 14 (4829 views)
Re: [dilbert] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

Hi,

It appears you've merged two different variations of code, they are not compatible with each other. The second was an excerpt from a more comprehensive module set which I wrote merely to test various designs. I had shared one of these designs so that you may use some of its concepts, particularly surrounding pagination. The "crawler" module set has radically changed now, it is very well organized and much simpler than before, but incomplete.

I don't even have a copy of the original script I posted an excerpt of above, I have attached the latest code I have with regards to Europa, it supports pagination of organizations and compiles just fine. But note, it was thrown together, it should be used cautiously. Also note, at line 186 there is a condition to limit it to iterating just two pages, the 1 needs to be replaced with $page_max to iterate all. In other words, it probably needs work.

Chris


(This post was edited by Zhris on May 7, 2018, 8:20 AM)
Attachments: europa-b.pl (8.30 KB)


dilbert
User

May 7, 2018, 10:50 AM

Post #11 of 14 (4813 views)
Re: [Zhris] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

hello dear Zhris, hello dear Bill,

first of all many many thanks for the quick reply of you both:
well the first steps in creating an approach can be found here:

http://perlguru.com/gforum.cgi?post=84635;sb=post_latest_reply;so=ASC;forum_view=forum_view_collapsed;guest=52395127


We have a little script that extracts the data out of each block and cleans it up a little.

Well - a great step : At this point the browse function is generic, it takes an input ref which contains the url and xpaths of the parent and children in order to construct the output ref.

Well - as i am not so advanced in Perl i think that the extension of this script towards a " kind of mechanize "

goes a bit over my head: I need to have some smaller steps to arrange the parts of the job : collection the results of 6000 pages.

But yes: it gives me a great idea of an approach we might take, it does not yet navigate across each page, we have a gerat great starting point to use it as a basis.


Well we have- a great step - now with that i have to go ahead.... in a stepwise progression



Code
use strict;  
use warnings FATAL => qw#all#;
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
use Data::Dumper;

my $handler_relurl = sub { q#https://europa.eu# . $_[0] };
my $handler_trim = sub { $_[0] =~ s#^\s*(.+?)\s*$#$1#r };
my $handler_val = sub { $_[0] =~ s#^[^:]+:\s*##r };
my $handler_split = sub { [ split $_[0], $_[1] ] };
my $handler_split_colon = sub { $handler_split->( qr#; #, $_[0] ) };
my $handler_split_comma = sub { $handler_split->( qr#, #, $_[0] ) };

my $conf =
{
url => q#https://europa.eu/youth/volunteering/evs-organisation_en#,
parent => q#//div[@class="vp ey_block block-is-flex"]#,
children =>
{
internal_url => [ q#//a/@href#, [ $handler_relurl ] ],
external_url => [ q#//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, [ $handler_trim ] ],
title => [ q#//h4# ],
topics => [ q#//div[@class="org_cord"]#, [ $handler_val, $handler_split_colon ] ],
location => [ q#//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, [ $handler_trim ] ],
hand => [ q#//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, [ $handler_trim, $handler_split_comma ] ],
pic_number => [ q#//p[contains(.,'PIC no')]#, [ $handler_val ] ],
}
};

print Dumper browse( $conf );

sub browse
{
my $conf = shift;

my $ref = [ ];

my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 );
my $response = $lwp_useragent->get( $conf->{url} );
die $response->status_line unless $response->is_success;
my $content = $response->decoded_content;

my $html_treebuilder_xpath = HTML::TreeBuilder::XPath->new_from_content( $content );
my @nodes = $html_treebuilder_xpath->findnodes( $conf->{parent} );
for my $node ( @nodes )
{
push @$ref, { };

while ( my ( $key, $val ) = each %{$conf->{children}} )
{
my $xpath = $val->[0];
my $handlers = $val->[1] // [ ];

$val = ($node->findvalues( qq#.$xpath# ))[0] // next;
$val = $_->( $val ) for @$handlers;
$ref->[-1]->{$key} = $val;
}
}

return $ref;
}


Output of the first block:

Code

{
'internal_url' => 'https://europa.eu/youth/volunteering/organisation/948417016_en',
'external_url' => 'http://www.apd.ge',
'location' => 'Tbilisi, Georgia',
'title' => '"Academy for Peace and Development" Union',
'topics' => [
'Access for disadvantaged',
'Youth (Participation, Youth Work, Youth Policy)',
'Intercultural/intergenerational education and (lifelong)learning'
],
'pic_number' => '948417016',
'hand' => [
'Receiving',
'Sending'
]
}




Dear all - this gives me a great idea of an approach we might take: we have a gerat great starting point to use it as a basis.

Well we have- a great step - now with that i have to go ahead.... in a stepwise progression - in small bits


(This post was edited by dilbert on May 7, 2018, 11:04 AM)


Zhris
Enthusiast

May 7, 2018, 12:00 PM

Post #12 of 14 (4807 views)
Re: [dilbert] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

Only briefly tested:


Code
use strict;   
use warnings FATAL => qw#all#;
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
use Data::Dumper;
$| = 1;

my $handler_relurl = sub { q#https://europa.eu# . $_[0] };
my $handler_trim = sub { $_[0] =~ s#^\s*(.+?)\s*$#$1#r };
my $handler_val = sub { $_[0] =~ s#^[^:]+:\s*##r };
my $handler_split = sub { [ split $_[0], $_[1] ] };
my $handler_split_colon = sub { $handler_split->( qr#; #, $_[0] ) };
my $handler_split_comma = sub { $handler_split->( qr#, #, $_[0] ) };

my $conf_results =
{
url => q#https://europa.eu/youth/volunteering/evs-organisation_en#,
parent => q#//body#,
children =>
{
results => [ q#.//span[@class="ey_badge"]# ],
}
};

my $conf_pages =
{
url => q#https://europa.eu/youth/volunteering/evs-organisation_en?page=%s#,
parent => q#//div[@class="vp ey_block block-is-flex"]#,
children =>
{
internal_url => [ q#//a/@href#, [ $handler_relurl ] ],
external_url => [ q#//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, [ $handler_trim ] ],
title => [ q#//h4# ],
topics => [ q#//div[@class="org_cord"]#, [ $handler_val, $handler_split_colon ] ],
location => [ q#//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, [ $handler_trim ] ],
hand => [ q#//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, [ $handler_trim, $handler_split_comma ] ],
pic_number => [ q#//p[contains(.,'PIC no')]#, [ $handler_val ] ],
}
};

my $results = browse( $conf_results )->[0]->{results};
my $page_max = $results / 21;
$page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max ) ;

my $ref = [ ];
for my $page ( 0 .. $page_max )
{
local $conf_pages->{url} = sprintf $conf_pages->{url}, $page;
print qq#browsing $conf_pages->{url}\n#;
push @$ref, @{browse( $conf_pages )};
}
print Dumper $ref;

sub browse
{
my $conf = shift;

my $ref = [ ];

my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 );
my $response = $lwp_useragent->get( $conf->{url} );
die $response->status_line unless $response->is_success;
my $content = $response->decoded_content;

my $html_treebuilder_xpath = HTML::TreeBuilder::XPath->new_from_content( $content );
my @nodes = $html_treebuilder_xpath->findnodes( $conf->{parent} );
for my $node ( @nodes )
{
push @$ref, { };

while ( my ( $key, $val ) = each %{$conf->{children}} )
{
my $xpath = $val->[0];
my $handlers = $val->[1] // [ ];

$val = ($node->findvalues( qq#.$xpath# ))[0] // next;
$val = $_->( $val ) for @$handlers;
$ref->[-1]->{$key} = $val;
}
}

return $ref;
}



dilbert
User

Jun 6, 2018, 9:58 AM

Post #13 of 14 (3402 views)
Re: [Zhris] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

Hello and good evening dear Zhris ,


the script works great - and i like it alot.


regarding the db we have a setup like the following:

The DB data from the above array would look like this (longer fields truncated for brevity)


Code
 
+--------+---------------------------+---------------------+------------+----------------------+----------------------+
| org_id | title | location | pic_number | int_url | ext_url |
+--------+---------------------------+---------------------+------------+----------------------+----------------------+
| 1 | ZWIAZEK MLODZIEZY MNIEJSZ | Opole, Poland | 939424146 | https://europa.eu/yo | http://www.bjdm.eu |
| 2 | Zwiazek Mlodziezy Wiejski | Czestochowa, Poland | 947395412 | https://europa.eu/yo | http://www.zmwczesto |
| 3 | ZWIAZEK POLSKICH KAWALER\ | Krak\x{f3}w, Poland | 941314385 | https://europa.eu/yo | http://www.centrumma |
+--------+---------------------------+---------------------+------------+----------------------+----------------------+
|
|
+---------------------+ +------------------------+ TABLE:
| | | | hand_type
| +-------------+--------+--------------+ +--------------+--------------+
| | org_hand_id | org_id | hand_type_id | | hand_type_id | description |
| +-------------+--------+--------------+ +--------------+--------------+
| | 1 | 1 | 1 | | 1 | Receiving |
| | 2 | 1 | 2 | | 2 | Sending |
| | 3 | 1 | 3 | | 3 | Coordinating |
| | 4 | &#65279; 2 | 1 | +--------------+--------------+
| | 5 | 2 | 2 |
| | 6 | 2 | 3 |
| | 7 | 3 | 1 |
| +-------------+--------+--------------+
|
+---------------------+ +---------------------+ TABLE:
| | | topic
+--------------+--------+----------+ +----------+-------------------------------------------------------+
| org_topic_id | org_id | topic_id | | topic_id | description |
+--------------+--------+----------+ +----------+-------------------------------------------------------+
| 1 | 1 | 1 | | 1 | Youth (Participation, Youth Work, Youth Policy) |
| 2 | &#65279; 1 | 2 | | 2 | Creativity and culture |
| 3 | 1 | 3 | | 3 | Romas and/or other minorities |
| 4 | &#65279; 2 | 4 | | 4 | Access for disadvantaged |
| 5 | 2 | 2 | | 5 | Early School Leaving / combating failure in education |
| 6 | &#65279; 2 | 5 | | 6 | Inclusion - equity |
| 7 | 3 | 6 | | 7 | Disabilities - special needs |
| 8 | 3 | 7 | | 8 | Social dialogue |
| 9 | &#65279; 3 | 8 | +----------+-------------------------------------------------------+
+--------------+----

&#65279;----+----------+



Like





again here the overvie on the model of the db.

Because each organisation has multiple occerences of hand and topic, the intermediate tables are required.


Code
    +------------------+                                              +------------------+ 
| organization | | hand_type |
+------------------+ +------------------+
| org_id (PK) |---+-+ +-------------------+ +------| hand_type_id(PK) |
| title | | | | org_hand | | | description |
| location | | | +-------------------+ | +------------------+
| pic_number | | | | org_hand_id (PK) | |
| internal_url | | +------<| org_id (FK) | |
| external_url | | | hand_type_id(FK) |>----+
+------------------+ | +-------------------+
| +------------------+
| | topic |
| +-------------------+ +------------------+
| | org_topic | +------| topic_id (PK) |
| +-------------------+ | | description |
| | org_topic_id (PK) | | +------------------+
+--------<| org_id (FK) | |
| topic_id (FK) |>----+
+-------------------+


again see here http://europa.eu/youth/volunteering/evs-organisation#open

i guess that we can narrow down the concept a bit...
what do you think!`

love to hear from you


Zhris
Enthusiast

Jun 8, 2018, 3:06 AM

Post #14 of 14 (3262 views)
Re: [dilbert] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

Hi,


Quote
the script works great


I recommend at minimum a couple of changes:

- move my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 ); out of the browse function, theres no point reinstantiating it for every request.
- we previously discussed various ways to handle the last page, the way I handle above isn't without its limitations, if the website is updated during the process then the max page may change. Instead use the way you suggested, checking if the next or last pagination button exists.


Quote
regarding the db


Your design looks fine, just a couple of recommendations:

- you probably don't need to have the org_topic relationship table between organization and topic, since topics are unique.
- you could construct the hand_type's table on database deployment since its values are finite, and pre-select into a ref before the process.

You could update the existing code to something along the lines of the following, which helps separate browse and database logic:


Code
push @$ref, @{browse( $conf_pages )}; 
insert( $_ ) for @{browse( $conf_pages )};

sub insert
{
my $organization = shift;

...
}


Chris

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives