CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner: Re: [FishMonger] Global symbol errors - all along the way... need help with a little script: Edit Log



dilbert
User

Nov 12, 2017, 10:50 AM


Views: 10653
Re: [FishMonger] Global symbol errors - all along the way... need help with a little script

hello dear Fishmonger,



many many thanks for the reply - great to hear from you. Youre right.

I tried to make some efforts in php and perl - for some tasks perl is the language of choice....

here below i have the code that works - and that is the base of some further changes: the new tasks: well what i want to do now is to change is the following; i want to modify the script a bit - tailoring and tinkering is the way to learn. I want to fetch urls with a certain content in the URL-string ....

in other words: what is aimed:

- i need to fetch all the urls that contains the term " /bar " . in other words:
- after fetching the urls i want to extract the "bar" so that it remains the url of the whole construct: http://www.xy.com/participants-database/


but first of all - here the code that works - the base of my weekend-project:


Code
 
#!C:\Perl\bin\perl

use strict; # You always want to include both strict and warnings
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

# There was no reason for this to be in a BEGIN block (and there
# are a few good reasons for it not to be)
open my $file1,"+>>", ("links.txt");
select($file1);

#The Url I want it to start at;
# Note that I've made this an array, @urls, rather than a scalar, $URL
my @urls = ('http://www.cems.org/academic-members/our-members/list/');
my %visited; # The % sigil indicates it's a hash
my $browser = LWP::UserAgent->new();
$browser->timeout(5);

while (@urls) {
my $url = shift @urls;

# Skip this URL and go on to the next one if we've
# seen it before
next if $visited{$url};

my $request = HTTP::Request->new(GET => $url);
my $response = $browser->request($request);

# No real need to invoke printf if we're not doing
# any formatting
if ($response->is_error()) {print $response->status_line, "\n";}
my $contents = $response->content();

# Now that we've got the url's content, mark it as
# visited
$visited{$url} = 1;

my ($page_parser) = HTML::LinkExtor->new(undef, $url);
$page_parser->parse($contents)->eof;
my @links = $page_parser->links;

foreach my $link (@links) {
print "$$link[2]\n";
push @urls, $$link[2];
}
sleep 60;
}



the results: i got back more than 200 lines - see below the output sample:


the new tasks: well what i want to do now is to change is the following; i want to modify the script a bit - tailoring and tinkering is the way to learn. I want to fetch urls with a certain content in the URL-string ....

in other words: what is aimed:

- i need to fetch all the urls that contains the term " /bar " . in other words:
- after fetching the urls i want to extract the "bar" so that it remains the url of the whole construct: http://www.xy.com/participants-database/

given the following results:



Code
 
http://www.1.com/participants-database/
http://www.2.com/participants-database/
http://www.3.com/participants-database/
http://www.4.com/participants-database/
http://www.5.com/participants-database/
http://www.6.com/participants-database/
http://www.7.com/participants-database/



first of all i have to fetch the urls:
then i have to strip the


Code
 "PORT: ", $uri->port, "\n"; 
"PATH: ", $uri->path, "\n"



so that i have the following results:


Code
http://www.1.com/ 
http://www.2.com/
http://www.3.com/
http://www.4.com/
http://www.5.com/
http://www.6.com/
http://www.7.com/



well: ...to achieve this i need to tailor the script a bit. And yes: i think that i need split, why: if i have the results - i guess 200 or more lines - then i want to extract parts of the URLs using regular expressions. i have to strip the url

- URL scheme://domain:port..../participants-database/ ... that i can get the domain ,,,,,

so that i can get the urls alone...:

Well - That can be done with Perl like so:

given the general format of a URL is scheme://domain:port/path?query_string#fragment_id

While domain (and possible other parts of the URL) may contain Unicode characters, in the following we assume that only ASCII characters are used. Furthermore, we assume that



Code
 
scheme only consists of letters az and AZ;
domain does not contain :, ?, # or /;
port is a natural number, :port is optional;
path does not contain ? or #, path is optional;
query_string does not contain #, ?query_string is optional;
fragment_id can contain arbitrary characters, #fragment_id is optional.

Here is my code:

@urls = (
"http://www.example.com/",
"http://www80.local.com:80/",
"https://www.ex221.ac.uk:442/perl/rulez?all+q#all.time");

foreach (@urls) {
print "URL: $_\n";
($scheme,$domain,$port,$path,$query,$fragment) = (/(.)(.)(.)(.)(.)(.)/);
print "SCHEME: $scheme, DOMAIN: $domain, PORT: $port\n";
print "PATH: $path\n"; print "QUERY: $query\n";
print "FRAGMENT: $fragment\n\n";
}



.... well to achive that i can use the URI module:



Code
 
use URI;

my @urls = (
"http://www.example.com/",
"http://www80.local.com:80/",
"https://www.ex221.ac.uk:442/perl/rulez?all+q#all.time");

foreach (@urls) {
my $uri = URI->new($_);
print "URL: $_\n";
print "SCHEME: ", $uri->scheme, "\n";
print "DOMAIN: ", $uri->host, "\n";
print "PORT: ", $uri->port, "\n";
print "PATH: ", $uri->path, "\n";
print "QUERY: ", $uri->query, "\n";
print "FRAGMENT: ", $uri->fragment, "\n";
}




Back to the code that works allready. (see above:)

remembering: the basic was: http://www.cems.org/academic-members/our-members/list/

see the results:



Code
http://www.cems.org/sites/all/themes/cems_theme/favicon.ico 
http://www.cems.org/rss/news.xml
http://www.cems.org/sites/default/files/css/css_fbccd6cf1d744a02e3d3c96b13899abc.css
http://www.cems.org/sites/default/files/css/css_cb1d8f9de90605e479255100ae34fad0.css
http://www.cems.org/sites/default/files/js/js_dcc6ca7e3b31340a2b20a3293ea00940.js
https://plus.google.com/112980751747702528942
http://www.cems.org/
http://www.cems.org/sites/all/themes/cems_theme/images/custom/cems-logo.png
http://www.cems.org/
http://www.cems.org/about/contacts
http://www.cems.org/lostpassword
https://cas.cems.org:443/cas/login?service=http://www.cems.org/cas&locale=en
http://www.cems.org/academic-members/our-members/list/
http://www.cems.org/about
http://www.cems.org/about/overview
http://www.cems.org/about/mission
http://www.cems.org/about-cems/overview/key-facts-figures
http://www.cems.org/about/alumni-profiles
http://www.cems.org/about/global
http://www.cems.org/sustainability
http://www.cems.org/sustainability/strategy
http://www.cems.org/sustainability/implementation
http://www.cems.org/sustainability/projects-profiles
http://www.cems.org/about/history
http://www.cems.org/about/organisation
http://www.cems.org/about/organisation/boards
http://www.cems.org/about/organisation/headoffice
http://www.cems.org/about/organisation/committees
http://www.cems.org/academic-members/faculty-groups
http://www.cems.org/about/organisation/student
http://www.cems.org/about/organisation/alumni
http://www.cems.org/about/contacts
http://www.cems.org/about-cems/contacts/head-office
http://www.cems.org/about/contacts/programme-managers
http://www.cems.org/students/student-life/student-board/members
http://www.cems.org/about/contacts/cems-club


well ,,,, i try to get ahead...; now i want to

love to hear from you


(This post was edited by dilbert on Nov 12, 2017, 11:12 AM)


Edit Log:
Post edited by dilbert (User) on Nov 12, 2017, 10:53 AM
Post edited by dilbert (User) on Nov 12, 2017, 10:56 AM
Post edited by dilbert (User) on Nov 12, 2017, 11:06 AM
Post edited by dilbert (User) on Nov 12, 2017, 11:08 AM
Post edited by dilbert (User) on Nov 12, 2017, 11:12 AM


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives