CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
minor changes in a script - an easy one

 



dilbert
User

Jun 23, 2013, 4:16 AM

Post #1 of 13 (697 views)
minor changes in a script - an easy one Can't Post

 
Hi there, good day dear Coding-Experts.


- want to run the following script with the URL from a to z

that means


Code
url = "http://search.cpan.org/author/?a"



Code
url = "http://search.cpan.org/author/?z"


okay - this is no perl-script. but i guess that you can help me tooo... here:

how to change the term in the code...


Code
 
import urllib
import urlparse
import re

url = "http://search.cpan.org/author/?X"
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('<a href="(/~.*?/)"><b>(.*?)</b></a><br/><small>(.*?)</small>', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'name':name, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto:(.*?)">', phtml)
if memail:
data['email'] = memail.group(1)

print data



hwnd
User

Jun 25, 2013, 1:43 PM

Post #2 of 13 (644 views)
Re: [dilbert] minor changes in a script - an easy one [In reply to] Can't Post

Your code is currently python, to break it down..


Code
# Import modules  
import urllib
import urlparse
import re

# Url to get content from
url = "http://search.cpan.org/author/?X"

# Read in html to get ready to parse
html = urllib.urlopen(url).read()

# (lk = link, capname = NAME OF AUTHOR in bold caps, name = original name of author)
# Using a regular expression type here to parse out (lk,capname,name) from our content
for lk, capname, name in re.findall('<a href="(/~.*?/)"><b>(.*?)</b></a><br/><small>(.*?)</small>', html):

# Joining together the ^ url and link that we took from our regex
alk = urlparse.urljoin(url, lk)

# Storing our data in a hashtype
data = { 'url':alk, 'name':name, 'cname':capname }

# Read in html once again to use a regex to
# see if any links have mailto: ... in it
phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto:(.*?)">', phtml)

# If found, add the email data to our original data and print.
if memail:
data['email'] = memail.group(1)
print data



To convert our code to perl in a similar way we can use the module 'WWW::Mechanize'


Code
#!/usr/bin/perl  

use strict;
use warnings;
use WWW::Mechanize;

# Initiate our $mech object
my $mech = WWW::Mechanize->new;

# Declare our hash to push our new link data in.
my %links;

# Use scalar $id as our identifier as we loop from A to Z
# to define which letter we are on. We use $mech->get( $url )
# to retreive our web content for each $id and we use hash
# %data with map function to store the link text and join the url
# along with the url of the $id letter we are currently on, then
# use grep to catch only the urls we want with (/~NAME/) in them.
# Next we will loop through our %data hash to push our
# ($key = link text, $data{$key} = url) to our new hash %links.

for my $id ( 'A' .. 'Z' ) {
$mech->get('http://search.cpan.org/author/?' . $id);
my %data =
map { $_->text => join '', 'http://search.cpan.org', $_->url }
grep { $_->url =~ qr/\~.*?/ } $mech->find_all_links;
foreach my $key ( keys %data ) {
push @{$links{$id}}, join(':', $key, $data{$key});
}
}

use Data::Dumper;
print Dumper \%links;

__SAMPLE OUTPUT__
'A' => [
'ALECS:http://search.cpan.org/~alecs/',
'ACE:http://search.cpan.org/~ace/',
'ABREY:http://search.cpan.org/~abrey/',
],
'B' => [
'BRINZER:http://search.cpan.org/~brinzer/',
'BRIAN:http://search.cpan.org/~brian/',
'BDSYMMES:http://search.cpan.org/~bdsymmes/',
'BCH:http://search.cpan.org/~bch/',
.....
.....
],



dilbert
User

Jun 25, 2013, 7:51 PM

Post #3 of 13 (634 views)
Re: [hwnd] minor changes in a script - an easy one [In reply to] Can't Post

hello dear

many thanks for this in depth going transfer of python to perl.

a great example of what can Mechanize do



Code
# along with the url of the $id letter we are currently on, then  
# use grep to catch only the urls we want with (/~NAME/) in them.
# Next we will loop through our %data hash to push our
# ($key = link text, $data{$key} = url) to our new hash %links


by the way: if we want to catch the mails too - then we use grep and do it with it -( grep) too!?


hwnd
User

Jun 25, 2013, 8:43 PM

Post #4 of 13 (629 views)
Re: [dilbert] minor changes in a script - an easy one [In reply to] Can't Post

Yes, you just use the built in regex or use your own.


Code
grep { $_->url =~ qr/mailto:(.*?)/ } $mech->find_all_links  

my @email = $mech->find_all_links( tag => 'a', url_regex => qr/mailto:(.*?/ );



dilbert
User

Jun 25, 2013, 11:10 PM

Post #5 of 13 (625 views)
Re: [hwnd] minor changes in a script - an easy one [In reply to] Can't Post

hi there

many thanks for the quick reply

the following line



Code
  grep { $_->url =~ qr/\~.*?/ } $mech->find_all_links;


i changed to this one here



Code
 
#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize;

# Initiate our $mech object
my $mech = WWW::Mechanize->new;

# Declare our hash to push our new link data in.
my %links;

# Use scalar $id as our identifier as we loop from A to Z
# to define which letter we are on. We use $mech->get( $url )
# to retreive our web content for each $id and we use hash
# %data with map function to store the link text and join the url
# along with the url of the $id letter we are currently on, then
# use grep to catch only the urls we want with (/~NAME/) in them.
# Next we will loop through our %data hash to push our
# ($key = link text, $data{$key} = url) to our new hash %links.
# grep { $_->url =~ qr/\~.*?/ } $mech->find_all_links;


for my $id ( 'A' .. 'Z' ) {
$mech->get('http://search.cpan.org/author/?' . $id);
my %data =
map { $_->text => join '', 'http://search.cpan.org', $_->url }
grep { $_->url =~ qr/mailto:(.*?)/ } $mech->find_all_links
my @email = $mech->find_all_links( tag => 'a', url_regex => qr/mailto:(.*?/ );

foreach my $key ( keys %data ) {
push @{$links{$id}}, join(':', $key, $data{$key});
}
}

use Data::Dumper;
print Dumper \%links;


well i tried it out but got some errors - i will dig deeper and investigate why i get eerrors



Quote
-rw-r--r-- 1 martin users 150 26. Mai 18:27 urls.txt
martin@linux-70ce:~/perl> perl cpan.pl
syntax error at cpan.pl line 29, near "->find_all_links
my "
Global symbol "@email" requires explicit package name at cpan.pl line 29.
Global symbol "%data" requires explicit package name at cpan.pl line 31.
Global symbol "%data" requires explicit package name at cpan.pl line 32.
BEGIN not safe after errors--compilation aborted at cpan.pl line 36.
martin@linux-70ce:~/perl>


hmmm - guess that i did something worng!?`



by the way - i the previous code runs nicely and i love the sorted and very very clean view with the separators



it looks great


Quote

org/~garnadi/',
'GCHESLER:http://search.cpan.org/~gchesler/',
'GEOKEVHAT:http://search.cpan.org/~geokevhat/',
'GOODI:http://search.cpan.org/~goodi/',
'GMCH:http://search.cpan.org/~gmch/',
'GAZERRO:http://search.cpan.org/~gazerro/',
'GARTHD:http://search.cpan.org/~garthd/',
'GYU:http://search.cpan.org/~gyu/',
'GHEALTON:http://search.cpan.org/~ghealton/',
'GBUDD:http://search.cpan.org/~gbudd/',
'GSULLIVAN:http://search.cpan.org/~gsullivan/',
'GED:http://search.cpan.org/~ged/',
'GREGARYH:http://search.cpan.org/~gregaryh/',
'GOODEVLPR:http://search.cpan.org/~goodevlpr/',
'GVENKAT:http://search.cpan.org/~gvenkat/'
],
'U' => [
'UEW:http://search.cpan.org/~uew/',
'UGANSERT:http://search.cpan.org/~ugansert/',
'UNERA:http://search.cpan.org/~unera/',
'UARUN:http://search.cpan.org/~uarun/',
'UDHAY:http://search.cpan.org/~udhay/',
'UWES:http://search.cpan.org/~uwes/',


well i am sure that we find out why we get complains with the new grep...


hwnd
User

Jun 26, 2013, 11:33 AM

Post #6 of 13 (606 views)
Re: [dilbert] minor changes in a script - an easy one [In reply to] Can't Post

Firstly, you're getting those errors because you left off the semicolon ' ; ' at the end of $mech->find_all_links

Now, the below line is matching only links that start with href="~name"


Code
grep { $_->url =~ qr/\~.*?/ } $mech->find_all_links;



The below lines, you don't need both:


Code
grep { $_->url =~ qr/mailto:(.*?)/ } $mech->find_all_links; 
my @email = $mech->find_all_links( tag => 'a', url_regex => qr/mailto:(.*?/ );



To simplify and grab the 'links' and 'mailto' you want, you can push the links into hash %links and email into @email, and dump them.


Code
#!/usr/bin/perl  

use strict;
use warnings;
use WWW::Mechanize;

my %links;
my @email;

my $mech = WWW::Mechanize->new;

for ('A'..'Z') {
$mech->get('http://search.cpan.org/author/?'.$_);

push @{$links{$_}},
map { join ':', $_->text, 'http://search.cpan.org' . $_->url }
$mech->find_all_links(url_regex => qr/\~.*?/);

push @email, map { $_->url } $mech->find_all_links(url_regex => qr/mailto:(.*?)/);
}

use Data::Dumper;
print Dumper \%links;
print Dumper \@email;



dilbert
User

Jun 27, 2013, 5:47 AM

Post #7 of 13 (596 views)
Re: [hwnd] minor changes in a script - an easy one [In reply to] Can't Post

hi there

many many thanks

the following code


Code
 

#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize;

my %links;
my @email;

my $mech = WWW::Mechanize->new;

for ('A'..'Z') {
$mech->get('http://search.cpan.org/author/?'.$_);

push @{$links{$_}},
map { join ':', $_->text, 'http://search.cpan.org' . $_->url }
$mech->find_all_links(url_regex => qr/\~.*?/);

push @email, map { $_->url } $mech->find_all_links(url_regex => qr/mailto:(.*?)/);
}

use Data::Dumper;
print Dumper \%links;
print Dumper \@email;


gives back the following results



Code
 
ADM:http://search.cpan.org/~ultradm/',
'UMEMOTO:http://search.cpan.org/~umemoto/',
'UMEYUKI:http://search.cpan.org/~umeyuki/',
'UMIYOSH:http://search.cpan.org/~umiyosh/',
'UMVUE:http://search.cpan.org/~umvue/',
'UNBIT:http://search.cpan.org/~unbit/',
'UNCLE:http://search.cpan.org/~uncle/',
'UNCLEANDY:http://search.cpan.org/~uncleandy/',
'UNCLELVIS:http://search.cpan.org/~unclelvis/',
'UNDEF:http://search.cpan.org/~undef/',
'UNDERMINE:http://search.cpan.org/~undermine/',
'UNDX:http://search.cpan.org/~undx/',
'UNERA:http://search.cpan.org/~unera/',
'UNICOLET:http://search.cpan.org/~unicolet/',
'UNIEJO:http://search.cpan.org/~uniejo/',
'UNIFIEDSW:http://search.cpan.org/~unifiedsw/',
'UNISOLVE:http://search.cpan.org/~unisolve/',
'UNIXNOMAD:http://search.cpan.org/~unixnomad/',
'UNIXTOWN:http://search.cpan.org/~unixtown/',
'UNKNOWNQ:http://search.cpan.org/~unknownq/',
'UNLEARNED:http://search.cpan.org/~unlearned/',
'UNOBE:http://search.cpan.org/~unobe/',
'UNRTST:http://search.cpan.org/~unrtst/',
'UNSAVED:http://search.cpan.org/~unsaved/',
'UNSTATIK:http://search.cpan.org/~unstatik/',
'UREE:http://search.cpan.org/~uree/',
'URI:http://search.cpan.org/~uri/',
'URKLE:http://search.cpan.org/~urkle/',
'URSUS:http://search.cpan.org/~ursus/',
'USAGIJER:http://search.cpan.org/~usagijer/',
'USEDHONDA:http://search.cpan.org/~usedhonda/',
'USEOPENID:http://search.cpan.org/~useopenid/',
'USIYER:http://search.cpan.org/~usiyer/',
'USMANOV:http://search.cpan.org/~usmanov/',
'USPROCESS:http://search.cpan.org/~usprocess/',
'USSJOIN:http://search.cpan.org/~ussjoin/',
'USTIANSKY:http://search.cpan.org/~ustiansky/',
'USUALOMA:http://search.cpan.org/~usualoma/',
'UTAANI:http://search.cpan.org/~utaani/',
'UUDEREK:http://search.cpan.org/~uuderek/',
'UVOELKER:http://search.cpan.org/~uvoelker/',
'UWEH:http://search.cpan.org/~uweh/',
'UWES:http://search.cpan.org/~uwes/'
]
};



hwnd
User

Jun 27, 2013, 6:36 AM

Post #8 of 13 (592 views)
Re: [dilbert] minor changes in a script - an easy one [In reply to] Can't Post

Is that what you were wanting for ouput?


dilbert
User

Jun 27, 2013, 7:25 AM

Post #9 of 13 (587 views)
Re: [hwnd] minor changes in a script - an easy one [In reply to] Can't Post

 
hi there - many thanks for the reply

i want to have somewhat like the following ...;

that is result of the python - code that i ve posted in the initial thread.




Code
 

{'url': 'http://search.cpan.org/~xachen/', 'cname': 'XACHEN', 'name': 'Justin Cassidy', 'email': 'xachen%40gmail.com'}
{'url': 'http://search.cpan.org/~xaerxess/', 'cname': 'XAERXESS', 'name': 'Grzegorz Ro&#x17C;niecki', 'email': 'xaerxess%40gmail.com'}
{'url': 'http://search.cpan.org/~xaicron/', 'cname': 'XAICRON', 'name': 'Yuji Shimada', 'email': 'xaicron%40cpan.org'}
{'url': 'http://search.cpan.org/~xandela/', 'cname': 'XANDELA', 'name': 'Mathias J. Hennig', 'email': 'CENSORED'}
{'url': 'http://search.cpan.org/~xandre/', 'cname': 'XANDRE', 'name': 'Alexander Sousa', 'email': 'alexandersousa%40gmail.com'}
{'url': 'http://search.cpan.org/~xanni/', 'cname': 'XANNI', 'name': 'Andrew Pam', 'email': 'andrew%40sericyb.com.au'}
{'url': 'http://search.cpan.org/~xant/', 'cname': 'XANT', 'name': 'Andrea Guzzo', 'email': 'xant%40cpan.org'}
{'url': 'http://search.cpan.org/~xantus/', 'cname': 'XANTUS', 'name': 'David Davis', 'email': 'xantus%40cpan.org'}
{'url': 'http://search.cpan.org/~xaoc/', 'cname': 'XAOC', 'name': 'Brian Manning', 'email': 'xaoc%40cpan.org'}
{'url': 'http://search.cpan.org/~xaoinc/', 'cname': 'XAOINC', 'name': 'XAO Inc.', 'email': 'am%40xao.com'}
{'url': 'http://search.cpan.org/~xaos/', 'cname': 'XAOS', 'name': 'Andrew', 'email': 'andrew%40uzdesign.com'}
{'url': 'http://search.cpan.org/~xav/', 'cname': 'XAV', 'name': 'Xavier Caron', 'email': 'xcaron%40gmail.com'}
{'url': 'http://search.cpan.org/~xavier/', 'cname': 'XAVIER', 'name': 'Steve Pomeroy', 'email': 'xavier-cpan%40staticfree.info'}
{'url': 'http://search.cpan.org/~xaxxon/', 'cname': 'XAXXON', 'name': 'Zac Hansen', 'email': 'xaxxon%5Bat%5Dgoogles%5Bemail%5D'}
{'url': 'http://search.cpan.org/~xcalbet/', 'cname': 'XCALBET', 'name': 'Xavier Calbet', 'email': 'xcalbet%40yahoo.es'}
{'url': 'http://search.cpan.org/~xcezx/', 'cname': 'XCEZX', 'name': 'MAEKAWA Tsuyoshi', 'email': 'CENSORED'}
{'url': 'http://search.cpan.org/~xdr/', 'cname': 'XDR', 'name': 'Alexander Simakov', 'email': 'xdr%40cpan.org'}


well is it possible to get this reslults with perl mechanize too?!


love to hear from you


hwnd
User

Jun 27, 2013, 9:15 AM

Post #10 of 13 (574 views)
Re: [hwnd] minor changes in a script - an easy one [In reply to] Can't Post

Yes you can get the desired data structure running in threads and supposely an alternative route also, do you have access to install and use any module?


dilbert
User

Jun 27, 2013, 9:45 AM

Post #11 of 13 (565 views)
Re: [hwnd] minor changes in a script - an easy one [In reply to] Can't Post

hello thx to hear from you

i can install and use any module here. i run an opensuse linux - system version 12.3 and i can install via

a. YAST or
b. CPAN


any perl module

greetings

ps.
what would be great is to have Mechanize delivering all the data - from a to z.


hwnd
User

Jun 28, 2013, 8:06 PM

Post #12 of 13 (542 views)
Re: [dilbert] minor changes in a script - an easy one [In reply to] Can't Post

Personally I'd recommend using Web::Scraper or running WWW::Mechanize in parallel with threads or Cora because the data needs to be parsed correctly, using just Mechanize will cause an overload for what you are asking. For example run the code below and see how long it takes, imagine how long it will take to load looping from a .. z.


Code
use strict; 
use warnings;
use Data::Dumper;
use WWW::Mechanize;

my %data;
dump_data('x');

sub dump_data {
my $id = shift;

my $m = WWW::Mechanize->new;
$m->get('http://search.cpan.org/author/?'.$id);

foreach ( $m->find_all_links(url_abs_regex => qr/\~.*?/) ) {
my $link = $_->url_abs;

$m->get( $link );
my ($mail) = $m->content =~ /<a href=.*?mailto:.*?>(.*?)<\/a>/;

push @{$data{$id}}, { url => join('', $link), cname => $_->text, email => $mail };
}

print Dumper \%data;
}



dilbert
User

Jun 29, 2013, 7:17 AM

Post #13 of 13 (531 views)
Re: [hwnd] minor changes in a script - an easy one [In reply to] Can't Post

hi many thanks

looks great anyway.

well hwnd - this is for learing and digging deeper into
perl
mecha
and all the lwp-stuff

so i learn with every snippet and allmost every approach

thx for all you do.

greetings

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives