
dilbert
User
Nov 12, 2017, 10:50 AM
Post #11 of 11
(6006 views)
|
Re: [FishMonger] Global symbol errors - all along the way... need help with a little script
[In reply to]
|
Can't Post
|
|
hello dear Fishmonger, many many thanks for the reply - great to hear from you. Youre right. I tried to make some efforts in php and perl - for some tasks perl is the language of choice.... here below i have the code that works - and that is the base of some further changes: the new tasks: well what i want to do now is to change is the following; i want to modify the script a bit - tailoring and tinkering is the way to learn. I want to fetch urls with a certain content in the URL-string .... in other words: what is aimed: - i need to fetch all the urls that contains the term " /bar " . in other words: - after fetching the urls i want to extract the "bar" so that it remains the url of the whole construct: http://www.xy.com/participants-database/ but first of all - here the code that works - the base of my weekend-project:
#!C:\Perl\bin\perl use strict; # You always want to include both strict and warnings use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use HTML::LinkExtor; # There was no reason for this to be in a BEGIN block (and there # are a few good reasons for it not to be) open my $file1,"+>>", ("links.txt"); select($file1); #The Url I want it to start at; # Note that I've made this an array, @urls, rather than a scalar, $URL my @urls = ('http://www.cems.org/academic-members/our-members/list/'); my %visited; # The % sigil indicates it's a hash my $browser = LWP::UserAgent->new(); $browser->timeout(5); while (@urls) { my $url = shift @urls; # Skip this URL and go on to the next one if we've # seen it before next if $visited{$url}; my $request = HTTP::Request->new(GET => $url); my $response = $browser->request($request); # No real need to invoke printf if we're not doing # any formatting if ($response->is_error()) {print $response->status_line, "\n";} my $contents = $response->content(); # Now that we've got the url's content, mark it as # visited $visited{$url} = 1; my ($page_parser) = HTML::LinkExtor->new(undef, $url); $page_parser->parse($contents)->eof; my @links = $page_parser->links; foreach my $link (@links) { print "$$link[2]\n"; push @urls, $$link[2]; } sleep 60; } the results: i got back more than 200 lines - see below the output sample: the new tasks: well what i want to do now is to change is the following; i want to modify the script a bit - tailoring and tinkering is the way to learn. I want to fetch urls with a certain content in the URL-string .... in other words: what is aimed: - i need to fetch all the urls that contains the term " /bar " . in other words: - after fetching the urls i want to extract the "bar" so that it remains the url of the whole construct: http://www.xy.com/participants-database/ given the following results:
http://www.1.com/participants-database/ http://www.2.com/participants-database/ http://www.3.com/participants-database/ http://www.4.com/participants-database/ http://www.5.com/participants-database/ http://www.6.com/participants-database/ http://www.7.com/participants-database/ first of all i have to fetch the urls: then i have to strip the
"PORT: ", $uri->port, "\n"; "PATH: ", $uri->path, "\n" so that i have the following results:
http://www.1.com/ http://www.2.com/ http://www.3.com/ http://www.4.com/ http://www.5.com/ http://www.6.com/ http://www.7.com/ well: ...to achieve this i need to tailor the script a bit. And yes: i think that i need split, why: if i have the results - i guess 200 or more lines - then i want to extract parts of the URLs using regular expressions. i have to strip the url - URL scheme://domain:port..../participants-database/ ... that i can get the domain ,,,,, so that i can get the urls alone...: Well - That can be done with Perl like so: given the general format of a URL is scheme://domain:port/path?query_string#fragment_id While domain (and possible other parts of the URL) may contain Unicode characters, in the following we assume that only ASCII characters are used. Furthermore, we assume that
scheme only consists of letters a–z and A–Z; domain does not contain :, ?, # or /; port is a natural number, :port is optional; path does not contain ? or #, path is optional; query_string does not contain #, ?query_string is optional; fragment_id can contain arbitrary characters, #fragment_id is optional. Here is my code: @urls = ( "http://www.example.com/", "http://www80.local.com:80/", "https://www.ex221.ac.uk:442/perl/rulez?all+q#all.time"); foreach (@urls) { print "URL: $_\n"; ($scheme,$domain,$port,$path,$query,$fragment) = (/(.)(.)(.)(.)(.)(.)/); print "SCHEME: $scheme, DOMAIN: $domain, PORT: $port\n"; print "PATH: $path\n"; print "QUERY: $query\n"; print "FRAGMENT: $fragment\n\n"; } .... well to achive that i can use the URI module:
use URI; my @urls = ( "http://www.example.com/", "http://www80.local.com:80/", "https://www.ex221.ac.uk:442/perl/rulez?all+q#all.time"); foreach (@urls) { my $uri = URI->new($_); print "URL: $_\n"; print "SCHEME: ", $uri->scheme, "\n"; print "DOMAIN: ", $uri->host, "\n"; print "PORT: ", $uri->port, "\n"; print "PATH: ", $uri->path, "\n"; print "QUERY: ", $uri->query, "\n"; print "FRAGMENT: ", $uri->fragment, "\n"; } Back to the code that works allready. (see above:) remembering: the basic was: http://www.cems.org/academic-members/our-members/list/ see the results:
http://www.cems.org/sites/all/themes/cems_theme/favicon.ico http://www.cems.org/rss/news.xml http://www.cems.org/sites/default/files/css/css_fbccd6cf1d744a02e3d3c96b13899abc.css http://www.cems.org/sites/default/files/css/css_cb1d8f9de90605e479255100ae34fad0.css http://www.cems.org/sites/default/files/js/js_dcc6ca7e3b31340a2b20a3293ea00940.js https://plus.google.com/112980751747702528942 http://www.cems.org/ http://www.cems.org/sites/all/themes/cems_theme/images/custom/cems-logo.png http://www.cems.org/ http://www.cems.org/about/contacts http://www.cems.org/lostpassword https://cas.cems.org:443/cas/login?service=http://www.cems.org/cas&locale=en http://www.cems.org/academic-members/our-members/list/ http://www.cems.org/about http://www.cems.org/about/overview http://www.cems.org/about/mission http://www.cems.org/about-cems/overview/key-facts-figures http://www.cems.org/about/alumni-profiles http://www.cems.org/about/global http://www.cems.org/sustainability http://www.cems.org/sustainability/strategy http://www.cems.org/sustainability/implementation http://www.cems.org/sustainability/projects-profiles http://www.cems.org/about/history http://www.cems.org/about/organisation http://www.cems.org/about/organisation/boards http://www.cems.org/about/organisation/headoffice http://www.cems.org/about/organisation/committees http://www.cems.org/academic-members/faculty-groups http://www.cems.org/about/organisation/student http://www.cems.org/about/organisation/alumni http://www.cems.org/about/contacts http://www.cems.org/about-cems/contacts/head-office http://www.cems.org/about/contacts/programme-managers http://www.cems.org/students/student-life/student-board/members http://www.cems.org/about/contacts/cems-club well ,,,, i try to get ahead...; now i want to love to hear from you
(This post was edited by dilbert on Nov 12, 2017, 11:12 AM)
|