CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
since php-parser attempts failed i need to get a perl-approach

 



dilbert
User

Feb 16, 2018, 3:32 AM

Post #1 of 5 (1500 views)
since php-parser attempts failed i need to get a perl-approach Can't Post

hello dear Perl-Gurus


i tried to retrieve the contents of a div from the external site withg PHP, and XPath. See below the story and - subsequently my very very first steps in a Perl-approach to this problem (below the php-explanations.


What happened:
as it is sometimes a bit tricky i tried several attempts -and used various approaches - in PHP - now i want to try out Perl.


See the php-Story: as it goes..

This is an excerpt from the page, showing the relevant code: note: i try to add all
- also to add @ on the class and a at the end on my query, After that, i use saveHTML() to get it. see my test:



goal: i need the following data:

Version:
Last updated:
Active installations:
Tested up

see for example the following - view-source:https://wordpress.or...wp-job-manager/

Version: 1.29.3
Last updated: 5 days ago
Active installations: 100,000+

i want to have a little database that runs locally - with those data of my favorite-plugins. So i want to fetch the data automatically - with a chron job.
Well after the PHP-trials, i need to know how to do this in perl instead - i want to try out this in perl



btw: this is my XPath: //*[@id="post-15991"]/div[4]/div[1]
this is the URL: https://wordpress.org/plugins/wp-job-manager/


Try to retrieve the contents of a div from the external site withg PHP, and XPath

This is an excerpt from the page, showing the relevant code: note: i try to add all
- also to add @ on the class and a at the end on my query, After that,
i use saveHTML() to get it. see my test:


see the subsequent code:


Code
<?php 

$remote = "https://wordpress.org/plugins/participants-database/";
$doc = new DOMDocument();
@$doc->loadHTMLFile($remote);
$xpath = new DOMXpath($doc);
$node = $xpath->query('//*[@id="post-519"]/div[4]/div[1]/ul/li[2]');
echo $node->item(0)->nodeValue;

?>


output: But the output looks like so


Code
see the results:  martin@linux-3645:~/dev/php> php p20.php 
PHP Notice: Trying to get property of non-object in /home/martin/dev/php/p20.php on line 8
martin@linux-3645:~/dev/php> php p20.php



background:


my way to get the xpath; use google chrome: I have a webpage I want to get some data off:


Quote


goal: i need the following data: the values of the following lines


Code
Version: 
Last updated:
Active installations:
Tested up


see for example the following - view-source:https://wordpress.org/plugins/wp-job-manager/


Code
Version: 1.29.3 
Last updated: 5 days ago
Active installations: 100,000+




eg the html lines

Code
                 <li> 
Requires WordPress Version:<strong>4.3.1</strong> </li>

<li>Tested up to: <strong>4.9.2</strong></li>



background: i need the data from all my favorite plugins - want to have it in a db or a calc sheet. So there were approx 70 pages to scrape:_

see here the list for the example - the full xpath:


Code
//*[@id="post-15991"]/div[4]/div[1]



and job-board-manager:


Code
//*[@id="post-519"]/div[4]/div[1]/ul/li[1] 
//*[@id="post-519"]/div[4]/div[1]/ul/li[2]
//*[@id="post-519"]/div[4]/div[1]/ul/li[3]
//*[@id="post-519"]/div[4]/div[1]/ul/li[7]


i used this method: Is there a way to get the xpath in google chrome?

Quote
Right click "inspect" on the item you are trying to find the xpath
Right click on the highlighted area on the console.
Go to Copy xpath


see the subsequent code:


Code
 
<?php

include('simple_html_dom');
$url = 'https://wordpress.org/plugins/wp-job-manager/';
$html = file_get_html($url);
$text = array();
foreach($html->find('DIV[class="widget plugin-meta"]') as $text) {
$text[] = $text->plaintext;
}
print_r($headlines);

?>








Code
 
martin@linux-3645:~/dev/php> php p100.php

PHP Warning: include(simple_html_dom): failed to open stream: No such file or directory in /home/martin/dev/php/p100.php on line 4
PHP Warning: include(): Failed opening 'simple_html_dom' for inclusion (include_path='.:/usr/share/php5:/usr/share/php5/PEAR') in /home/martin/dev/php/p100.php on line 4
PHP Fatal error: Call to undefined function file_get_html() in /home/martin/dev/php/p100.php on line 6
martin@linux-3645:~/dev/php>



goal: i need the following data:

Version:
Last updated:
Active installations:
Tested up

see for example the following - view-source:https://wordpress.or...wp-job-manager/

Version: 1.29.3
Last updated: 5 days ago
Active installations: 100,000+

i want to have a little database that runs locally - with those data of my favorite-plugins. So i want to fetch the data automatically - with a chron job.
Well after the PHP-trials, i need to know how to do this in perl instead - i want to try out this in perl



the idea:


i try to parse site using Perl inside perlbrew and XML::LibXML.



Code
 
my $parser = XML::LibXML->new();

my $doc = $parser->load_html(location => "http://www.example.com/", recover => 2);
foreach my $x ($doc->findnodes('*xPath*'){
...
}



Well i think that this code should give me a first approach to a working model


(This post was edited by dilbert on Feb 16, 2018, 6:48 AM)


Zhris
Enthusiast

Feb 18, 2018, 12:24 AM

Post #2 of 5 (1476 views)
Re: [dilbert] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

Hi Dilbert,

You didn't maintain updating us on the progress you had made on your Europa task, you were last attempting to implement paging. The code I provided, with a change to the conf, would apply here too. I have since written a comprehensive scraper module to supersede Web::Scraper with iteration capabilities, though it needs a bit of a rework.

I haven't written PHP in years now, I couldn't tell you whats wrong with your code off the top of my head, but PHP is perfectly compable of this task.

I have never used XML::LibXML directly to parse html, I typically recommend HTML::TreeBuilder::XPath. This is because it simplifys executing xpath queries, it inherits from HTML::TreeBuilder which is a featuresome html parser, which in turn inherits from HTML::Element which is a featuresome html extractor/modifier. Together they create a very powerful html processing package that cover all you'd need and more.

If you are having trouble getting to grips with the basics of web scraping in Perl, I'd be happy to go through it with you. Every scrape is different, if you don't understand the process behind one, you will have difficulty writing another.

Regards,

Chris


(This post was edited by Zhris on Feb 18, 2018, 12:26 AM)


dilbert
User

Feb 18, 2018, 8:07 AM

Post #3 of 5 (1468 views)
Re: [Zhris] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

hi dear Chris,

great to hear from you - Well your new projects sound very very interesting.

i keep you posted with all the updates for the Europa-Task.

After parsing each page, check for the existence of the next link at the bottom. When you have arrived on page 292, there are no more pages, so you are done and can exit the loop with e.g. last.

at a first glance the issue about scraping from page to page - can be solved via different approaches: we have the pagination on the bottom of the page: see for example:



Code
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=5


and



Code
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=6


and



Code
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7


and so forth and so forth.. well we can set this url (s) as a base -

if we have an array from which we load the urls that need to be visited - we would come across all the pages.. Note: we have more than 6000 results - and on each page 21 little entries that represent one record so we have approx 305 Pages that we have to visit.

Regarding the loop-process:

Well Chris, after parsing each page, we have to check for the existence of the next link at the bottom of the page.


see:

Code
http://europa.eu/youth/volunteering/evs-organisation_en?country=&topic=&field_eyp_vp_accreditation_type=All&town=&name=&pic=&eiref=&inclusion_topic=&field_eyp_vp_feweropp_additional_mentoring_1=&field_eyp_vp_feweropp_additional_physical_environment_1=&field_eyp_vp_feweropp_additional_other_support_1=&field_eyp_vp_feweropp_other_support_text=&&page=7


The proceedings: When we have arrived on page 292, there seem to be no more pages left, so we are done with the process of counting upward and can exit the loop with e.g. last.


Dear Chris i try to achieve more steps in the Europe-Task. It would be great if can go ahead with this great parser-approach that you have revealed. Many many thanks for the great approach. I am very glad.

I keep you maintain updating on the progress you will make on the Europa attempting to implement paging. The code you provided, with a change to the conf, would apply here too. Of Course!!

Dear Chris - that following sounds very very good:

In Reply To
I have since written a comprehensive scraper module to supersede Web::Scraper with iteration capabilities, though it needs a bit of a rework.


i would be more than glad to have some insights into this project - it sounds very promising, You know that i am learning - and with such day to day project like parses and scrapers i can learn quite alot.



In Reply To
I typically recommend HTML::TreeBuilder::XPath. This is because it simplifys executing xpath queries, it inherits from HTML::TreeBuilder which is a featuresome html parser, which in turn inherits from HTML::Element which is a featuresome html extractor/modifier. Together they create a very powerful html processing package that cover all you'd need and more.


Well this is a great idea to parse with HTML::TreeBuilder::XPath i thought that i have to do some preliminary tests. And yes: i am pretty sure that this is agreat chance to learn.


my way to get the xpath; use google chrome: I have a webpage I want to get some data off: see here https://wordpress.org/plugins/wp-job-manager/


goal: i need the following data:



Code
      Version: 
Last updated:
Active installations:
Tested up



Well i think that i have to use the findvalue function:

The findvalue function in HTML::TreeBuilder::XPath returns a concatenation of any values found by the xpath query. Why does it do this, and how could a concatenation of the values be useful to anyone?

Why does it do this?

When we call findvalue, we're requesting a single scalar value. If there are multiple matches, they have to be combined into a single value somehow.

From the documentation for HTML::TreeBuilder::XPath:

findvalue ($path)

...If the path returns a NodeSet, $nodeset->xpath_to_literal is called automatically for us (and thus a Tree::XPathEngine::Literal is returned).

And from the documentation for Tree::XPathEngine::NodeSet:

xpath_to_literal()


Returns the concatenation of all the string-values of all the nodes in the list.
An alternative would be to return the Tree::XPathEngine::NodeSet object so the user could iterate through the results himself, but the findvalues method already returns a list. How could a concatenation of the values be useful to anyone?

For example:


Code
   
use strict;
use warnings 'all';
use 5.010;

use HTML::TreeBuilder::XPath;

my $content = do { local $/; <DATA> };
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);

say $tree->findvalue('//p');





__DATA__
<p>HTML is just text.</p>
<p>It can still make sense without the markup.</p>

Output:

HTML is just text.It can still make sense without the markup.

Usually, though, it makes more sense to get a list of matches and iterate through them instead of doing just concatenation, so we can use findvalues (plural) if we could have multiple matches.


goal: i need the following data:


my way to get the xpath; use google chrome: I have a webpage I want to get some data off:



Code
   https://wordpress.org/plugins/wp-job-manager/ 
https://wordpress.org/plugins/participants-database/
https://wordpress.org/plugins/amazon-link/
https://wordpress.org/plugins/simple-membership/
https://wordpress.org/plugins/scrapeazon/


goal: i need the following data:


Code
    Version: 
Last updated:
Active installations:
Tested up


see for example the following - view-source:https://wordpress.org/plugins/wp-job-manager/


Code
    Version: 1.29.3 
Last updated: 5 days ago
Active installations: 100,000+



i want to have a little database that runs locally - with those data of my favorite-plugins.
Finally - i want to keep this data-chart updated by fetching the data automatically - with a chron job


In Reply To
If you are having trouble getting to grips with the basics of web scraping in Perl, I'd be happy to go through it with you. Every scrape is different, if you don't understand the process behind one, you will have difficulty writing another.


well - dear Chris - i understadnd - i can learn with each step and with each new scraping / parsing task. This is a new great chance to dive into perl.

Greetings
dilbert


Zhris
Enthusiast

Feb 18, 2018, 1:11 PM

Post #4 of 5 (1462 views)
Re: [dilbert] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

Hi Dilbert,


Quote
After parsing each page, check for the existence of the next link at the bottom


That is an excellent idea, and likely your best option. In a rough script I tested I fetched the total results using //span[@class="ey_badge"] then the max page using my $page_max = $results / 21; $page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max ) ;. But stick to your plan.


Quote
if we have an array from which we load the urls that need to be visited - we would come across all the pages.


Yes this would be fine. Preferably, use a URI object to update the urls page param per iteration of a loop until page max is achieved.


Quote
i would be more than glad to have some insights into this project - it sounds very promising


Its a work in progress, a first draft. I need to modularise more components as the interface from a users perspective isn't too clean nor intuative. It also needs to cover various other scenarios to ensure its capable of every possible scrape. The code won't make much sense alone, but I plugged in Europa and here is a snippet:


Code
our $iterator_organizations = sub 
{
my ( $browser, $parent ) = @_;

my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;

my $nodes = $browser->nodes( url => $url );

my $iterator = sub
{
return shift @$nodes;
};

return ( $iterator, 1 );
};

our $iterator_organizations_b = sub
{
my ( $browser, $parent ) = @_;

my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;
my $uri = URI->new( $url );
my $xpath = q#//div[@class="vp ey_block block-is-flex"]#;
my $nodes = [ ];
my $page = 0;

my $results = $parent->{results};
my $page_max = $results / 21;
$page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max ) ;

my $iterator_uri = sub
{
$uri->query_form( page => $page++ );

return $page > 2 ? undef : $uri ; # $page_max;
};

my $iterator_node = sub
{
unless ( @$nodes )
{
my $uri = $iterator_uri->( ) // return undef;

my $options = $page == 1 ? { tree => $parent->{_node} } : { url => $uri->as_string };

$nodes = $browser->nodes( %$options, xpath => $xpath );
}

return shift @$nodes;
};

return ( $iterator_node, 0 );
};

our $iterator_organization = sub
{
my ( $browser, $parent ) = @_;

my $url = $parent->{internal_url};

my $nodes = $browser->nodes( url => $url );

my $iterator = sub
{
return shift @$nodes;
};

return ( $iterator, 1 );
};

#########################

sub organizations
{
my ( $self, $options ) = ( shift, { @_ } );

my $map =
[
$Massweb::Browser::Europa::iterator_organizations,
results => q#.//span[@class="ey_badge"]#,
organizations =>
[
$Massweb::Browser::Europa::iterator_organizations_b,
internal_url => [ q#.//a/@href#, $Massweb::Browser::Europa::handler_url ],
external_url => [ q#.//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, $Massweb::Browser::handler_trim ],
title => q#.//h4#,
topics => [ q#.//div[@class="org_cord"]#, $Massweb::Browser::handler_val, $Massweb::Browser::handler_list_colon ],
location => [ q#.//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, $Massweb::Browser::handler_trim ],
hand => [ q#.//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, $Massweb::Browser::handler_trim, $Massweb::Browser::handler_list_comma ],
pic_number => [ q#.//p[contains(.,'PIC no')]#, $Massweb::Browser::handler_val ],
recruiting => [ q#boolean(.//i[@class="fa fa-user-times fa-lg"])#, $Massweb::Browser::handler_bool_rev ],
_ => \&organization,
],
];

my $organizations = $self->browse( map => $map );

return $organizations;
}

sub organization
{
my ( $self, $options ) = ( shift, { @_ } );

my $map =
[
sub { $Massweb::Browser::Europa::iterator_organization->( $_[0], $options ) },
#title => q#.//h1#,
description => q#.//div[@class="ey_vp_detail_page"]/p#,
];

my $organization = $self->browse( map => $map );

return $organization;
}


In general the map represents the resultant data structure. The iterators purpose is quite straight forward, it should return a node each time it is called, or undef to finish. The organizations paging iterator shifts off each node from an array, once the array is empty it calls a url iterator which increments the page, until there are no pages left.

Its too complex to go into too much detail right now, but hopefully in the near future.


Quote
i want to have a little database that runs locally - with those data of my favorite-plugins.


Keep working at it, if you bump into a specific issue, or theres an aspect you don't understand, feel absolutely free to share with us and we will do our best to help you move forward. Try to produce a working script, start by looping over each plugin from a hardcoded array, fetching the relevant pages content and putting it through an xpath module of your choice.

Best regards,

Chris


(This post was edited by Zhris on Feb 18, 2018, 1:12 PM)


dilbert
User

Feb 19, 2018, 1:35 AM

Post #5 of 5 (1450 views)
Re: [Zhris] since php-parser attempts failed i need to get a perl-approach [In reply to] Can't Post

dear Chris

many many thanks - this was more than expected; I am overwhelmed by such a amazing help and support.


i will work as adviced - and yes: i ll keep you informed.

Again many many thanks for all you did! You are encouraging me to go ahead. I will do so!!!


I am so glad to be here - in this great place!!!


best regards
Dilbert ;)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives