CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Perl::Mechanize - how to loop within this [example]

 



dilbert
User

Oct 2, 2012, 1:16 AM

Post #1 of 8 (2362 views)
Perl::Mechanize - how to loop within this [example] Can't Post

 
i am heading for Perl-programming. I want to learn something


Well i am currently working on a small solution: I have tried various tutorials
(examples of Mechanize - that i have found on the CPAN) not oll of them work - some of them are broken!

Now i try t o get some real-world-task!

Especially interesting for me as a PHP/Perl-beginner

what we have so far: Well the harvesting task should be no problem if i
take WWW::Mechanize - particularly for doing the form based search and selecting the individual entries.


see the page: http://katholisch.at/content/site/pfarrfinder/index.html
Here you can see for more results..

Hmm - i guess that the algorithm would be basically 2 nested loops:
the outer loop runs the form based search,
the inner loop processes the search results.

i have approximatley 10000 pages to parse

they are organized like that:

http://Www.address/5307.html
http://Www.address/5308.html
http://Www.address/5309.html




Code
 
$mech->follow_link(url_regex => qr/webgrab_path=http://evs2000.*?
Id=d+$/, n => $result_nbr);


well what do you think?


(This post was edited by dilbert on Oct 2, 2012, 1:08 PM)


wickedxter
User

Oct 2, 2012, 12:24 PM

Post #2 of 8 (2351 views)
Re: [dilbert] Perl::Mechanize - how to loop within this [example] [In reply to] Can't Post

post the rest of the code as well....


dilbert
User

Oct 3, 2012, 2:03 AM

Post #3 of 8 (2342 views)
Re: [wickedxter] Perl::Mechanize - how to loop within this [example] [In reply to] Can't Post

hello dear wickedxter


many many thanks for the quick answer; great to hear from you.


In Reply To
post the rest of the code as well....


well i try out several ways - note i am a beginner in Perl.


i want to combine it with this snippet below... - Hmmm - the question is:

how to arrange the itteration through the results that are arranged like this...


i have approximatley 10000 pages to parse

they are organized like that:

http://Www.address/5307.html
http://Www.address/5308.html
http://Www.address/5309.html


Wickedxter - i would love if you can give me a hint...

greetings






Code
   #call the mechanize object, with autocheck switched off 
#so we don't get error when bad/malformed url is requested
my $mech = WWW::Mechanize->new(autocheck=>0);
my %comments;
my %links;
my @comment;

my $target = "http://google.com";
#store the first target url as not checked
$links{$target} = 0;
#initiate the search
my $url = &get_url();

#start the main loop
while ($url ne "")
{
#get the target url
$mech->get($url);
#search the source for any html comments
my $res = $mech->content;
@comment = $res =~ /<!--[^>]*-->/g;
#store comments in 'comments' hash and output it on the screen, if there are any found
$comments{$url} = "@comment" and say "\n$url \n---------------->\n $comments{$url}" if $#comment >= 0;
#loop through all the links that are on the current page (including only urls that are contained in html anchor)

foreach my $link ($mech->links())
{
$link = $link->url();
#exclude some irrelevant stuff, such as javascript functions, or external links
#you might want to add checking domain name, to ensure relevant links aren't excluded

if ($link !~ /^(#|mailto:|(f|ht)tp(s)?\:|www\.|javascript:)/)
{
#check whether the link has leading slash so we can build properly the whole url
$link = $link =~ /^\// ? $target.$link : $target."/".$link;
#store it into our hash of links to be searched, unless it's already present
$links{$link} = 0 unless $links{$link};
}
}

#indicate we have searched this url and start over
$links{$url} = 1;
$url = &get_url();
}

sub get_url
{
my $key, my $value;
#loop through the links hash and return next target url, unless it's already been searched
#if all urls have been searched return empty, ending the main loop

while (($key,$value) = each(%links))
{
return $key if $value == 0;
}

return "";
}



dilbert
User

Oct 3, 2012, 11:03 AM

Post #4 of 8 (2336 views)
Re: [wickedxter] Perl::Mechanize - how to loop within this [example] [In reply to] Can't Post

dear wickexter

again me...

Well i need Mecha - particularly for doing the form based search and selecting the individual entries.

goal: i have approximatley 100 pages to parse

they are organized like that:


Code
http://www.address/307.html 
http://www.address/308.html
http://www.address/309.html


Hmm - i guess that the algorithm would be basically only one nested loop:
hmm - one outer loop runs the form based search,
and yes: one inner loop processes the search results.


well i did it like this one here:


Code
use WWW::Mechanize; 

my $mech = WWW::Mechanize->new();
my $url = "here the urls go in ";

$mech->cookie_jar->set_cookie(0,"start",1,"/",".test.com");
$mech->get($url);



It's really just one loop, isn't it?


Code
foreach my $url (@urls) { 
my $response = $ua->post($url, \%data);
my $result = parse_response($response);
}





well the thing is pretty simple - i need to do two things with Mechanize

- particularly for doing the form based search and
- selecting the individual entries and parse them


wickedxter
User

Oct 3, 2012, 1:43 PM

Post #5 of 8 (2329 views)
Re: [dilbert] Perl::Mechanize - how to loop within this [example] [In reply to] Can't Post

This is how i would process the pages in the url where page3000 thru page 3004 existed..

if your processing HTML on the page use HTML::TokeParser or HTML::TokeParser::Simple


Code
#!/usr/bin/perl 


## This is how i would go about doing what i understand about what your trying todo
## EXAMPLE only

use 5.014;
use strict;
use warnings;

use WWW::Mechanize;

my $target_url = 'http://www.google.com/';
my $page = 3000;
my $format = '.html';
my $max_page_num = 4;


#loop threw the pages
for (0..$max_page_num){
my $mech = WWW::Mechanize->new();
$mech->agent_alias('Microsoft Mozilla');

#this combines to make the url
my $url = $target_url . 'page'. "$page" . "$format";

#get the page
$mech->get($url);

#get all links that match the regex
my @links = $mech->find_all_links(url_regex => qr//);

###process the links and follow_link or process page.

#this makes sure the pages are processed in order
$page++;

}


1;



(This post was edited by wickedxter on Oct 3, 2012, 1:46 PM)


dilbert
User

Oct 3, 2012, 2:15 PM

Post #6 of 8 (2326 views)
Re: [wickedxter] Perl::Mechanize - how to loop within this [example] [In reply to] Can't Post

hello dear wickedxter

many many thanks for the quick reply - great to hear from you

well i try out this tomorrow evening. in the meanwhile i will install all the needed perl modules.

i come back to this great place and report all my findings

many many greetings

dilbert


dilbert
User

Oct 3, 2012, 2:23 PM

Post #7 of 8 (2325 views)
Re: [dilbert] Perl::Mechanize - how to loop within this [example] [In reply to] Can't Post

Hello again

one question - just in case i did not understand one little thing:

you say:


Quote
if your processing HTML on the page use HTML::TokeParser or HTML::TokeParser::Simple


well i think this is a good idea - but in your code you do not add any of the above mentioned modules - so the html-processing [-job] is not implied in your code..

or did i get something wrong!?

in other words:
Hmmm - there are no calls of the above mentioned modules _ so you do not include the parser job - your scirpt only fetches the sites and store the stuff!?

i will have a closer look at the code...


dilbert
User

Oct 5, 2012, 9:16 AM

Post #8 of 8 (2304 views)
Re: [dilbert] Perl::Mechanize - how to loop within this [example] [In reply to] Can't Post

again me - again dilbert the guy from good old Europe.


Well - we have following options here:

to print to a file instead of printing at the screen, we just have to change:

[highlight="Perl"]say $text;[/highlight]

to:

[highlight="Perl"]print $OUT_FILE $text;[/highlight]

Some explanations:
where $OUT_FILE will be a filehandle for the output file that we will have to open before getting into the so called "for loop".

This would work for the code as it is, but it might be different if we are using the Text:CSV module which has probably
dedicated functions or methods for printing CSV lines to a file (Well to be frank i don't use this module and don't know it,
although I should probably change this because I am using CSV files from time to time ;).

Well i try to describe more in details what we want to have: Which output file to look like.
Well i want the comma to separate the fields of the addresses, or the records?


if we take this for example: http://katholisch.at/content/site/pfarrfinder/address/5722.html

we have the following dataset:

Dom- und Metropolitanpfarre St. Stephan
Stephansplatz 3
1010 Wien
Telefonnummer: 515 52-3530
FAX-Nummer: 515 52-3720
E-Mail: dompfarre-st.stephan@edw.or.at
Web: http://www.st.stephan.at

well i want to have seperated each datset into these bits - in other words:
if i have a dataset that delimiters and seperates the lines that are given like that

Loosdorf Ledochowskastra&#65533;e 4 3382 Loosdorf Telefonnummer: 02754 6257 FAX-Nummer: 02754 6257-4

i would be very very happy. Note: there also a Encoding issues is: see the Ledochowskastra&#65533;e - there is a sign in it "" so we have to take care for the
iso 8859 encoding dont we!?

Well i love if you can give some hints and helping hands. That would be very very supportive.
Note;: this is a great gerat chance f or me to learn alot about Perl, and the options and power of Mechanize.

Look forward to hear from you

Many many greetings

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives