CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Perl Documentation

 



wisnoskij
New User

Feb 17, 2013, 1:43 PM

Post #1 of 4 (463 views)
Perl Documentation Can't Post

I am having quite a lot of trouble figuring out the limitations of some Perl syntax from the Documentation.

I look at this (http://search.cpan.org/~miyagawa/Web-Scraper-0.32/lib/Web/Scraper.pm) and think that that is really not enough information.

What if you wanted to do something like:
process "tag.class", key => 'TEXT', name => '@id'; # multiple variable creation -- this does work
OR
process "tag.class", key => 'TEXT'.substr(3,6); # modifying the data instead of copying the exact contents of specified attribute -- I have not been able to find any way to do this

Am I just not experienced enough, and any competent Perl programmer would know all the syntax they could use and modifications they could do to the sparse examples and explanations given in the documentation? Am I just not looking in the right place for Perl documentation? Or do some Perl modules just not have good documentation and reading the source code and experimentation is necessary?


Zhris
User

Feb 19, 2013, 1:07 PM

Post #2 of 4 (436 views)
Re: [wisnoskij] Perl Documentation [In reply to] Can't Post

Hey,

You could modify the data after you've scraped, as i'm uncertain if it can be done via an xpath style expression:

untested

Code
#!/usr/bin/perl 
use strict;
use warnings FATAL => qw(all);
use URI;
use Web::Scraper;

my $scrape_url = "http://www.example.com";

my $scraper = scraper { process "tag1.class1", key => 'TEXT';
process "tag2.class2", name => 'TEXT'; };

my $response = $scraper->scrape( URI->new( $scrape_url ) );

$response->{name} = substr $response->{name}, 3, 6;

printf "%s, %s\n", $response->{key}, $response->{name};


I've found documentation across CPAN to be generally clear and complete. Web::Scraper does not expand into much detail, but there is enough to experiment with. It also states that "There are many examples in the eg/ dir packaged in this distribution. It is recommended to look through these". Documentation should provide enough information to support its reader's requirements without them having to study the source. If you are competent enough to study the source, then you will inevitably develop a deeper understanding of the module and its limitations. Its also a good idea to look at the modules dependencies, in the instance of Web::Scraper, HTML::TreeBuilder::XPath and HTML::Selector::XPath appear to handle xpath expressions therefore may provide additional syntax / documentation / examples. If this is the first time you've approached web scraping in Perl, although Web::Scraper has been designed to simplify the process, it would be good to research into "rawer" techniques, which give you more control at every stage of the scraping process i.e. HTML::Element.

Best regards,

Chris


(This post was edited by Zhris on Feb 19, 2013, 1:18 PM)


wisnoskij
New User

Feb 19, 2013, 1:50 PM

Post #3 of 4 (428 views)
Re: [Zhris] Perl Documentation [In reply to] Can't Post

Thank you for your response.

This is pretty much my very beginnings of Perl use, and have gone with bare minimums for a deeper understanding of Perl for most of my project, but this project really needs the depth of css selector syntax to do right so I am planning on sticking with Web::Scrapper.

I still do not understand why the syntax of this package is this way but have experimented and read enough examples to get a far better understanding of at least some of what you can do with it. It is worth mentioning that you can nest these "process"es. So for example if you are scraping a table, you can do:


Code
my $web_scraper = scraper { 
process 'body>table.data', table => scraper { #the .data is if the table has a class called data
process '>thead>tr>th', 'header[]' => 'TEXT';
process '>tbody>tr', 'rows[]' => scraper {
process '>td', 'cols[]' => 'TEXT';
process '>td>a', link => '@href';
};
};
};


This will produce a two dimensional array style data structure, with the header text stored alongside and this example even captures some link that each row has. You would access this like:


Code
my $data = $web_scraper->scrape('http://mywebsite.com'); 
$data->{rows}[5]->{cols}[3];
$data->{rows}[2]->{link};
$data->{header}[7];



(This post was edited by wisnoskij on Feb 19, 2013, 2:08 PM)


Zhris
User

Feb 19, 2013, 2:36 PM

Post #4 of 4 (424 views)
Re: [wisnoskij] Perl Documentation [In reply to] Can't Post

Hey,

It appears that your knowledge of Web::Scraper has greatly improved since your original post.

In answer to your query "I still do not understand why the syntax of this package is this way", the module does not construct its object in the conventional manner. In Perl, there are many ways to achieve equal outcomes, do not necessarily regard Web::Scraper as a model example.

Remember that Web::Scraper is basically an interface to more extensive classes, with the benefit of simplifying the core scraping process for the programmer. In your instance, it uses HTML::Selector::XPath to parse css selectors into an xpath, usable by HTML::TreeBuilder::XPath, which does the initial html parsing. If Web::Scraper doesn't do what you need directly due to limitations, use these and other classes to do what you need indirectly.

Your example above looks to have covered the core syntax / functionality achievable through Web::Scraper.

Chris


(This post was edited by Zhris on Feb 19, 2013, 2:47 PM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives