CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Obtaining urls from source code

 



wyndcrosser
Novice

Oct 30, 2012, 9:34 PM

Post #1 of 11 (3657 views)
Obtaining urls from source code Can't Post

Okay, so my company thinks because I had a 7 week class on Perl (Completed, got a decent grade) that I'm a genuis now (sarcasm). They've asked me for some assistance on a project. We are moving servers and networks, it's a crazy mess. My boss wants me to pull the urls from some of our intranet sites by viewing the source code, so we can see how we might want to config the new sharepoint and intranet sites (we have so much fluff for site locations, that some files haven't been updated since like 2009, but people have been creating them elsewhere). So, it's basically web crawling across our network. We had a huge layoff and its been nuts. What I thought about doing is using Perl to pull the urls from a saved .txt or .html and add them to an array (something I used to hate in Java, but find nicer in Perl). Everything will print out and I can copy it into a spreadsheet/word doc and start my Visio workflow diagram. I'd ask my scripting guru at work, but he's out with a new kid. So, now you see my dilemma. Wanna assist? Thanks for reading.

Being an intern, I can't VPN in to access the intranet sites, but it's a project he'd like me to help with. So I've just been using source views from firefox, or making my own "one or two" entries to test my code.

use strict;

my $htmlfile;
my @htmlarray;
my $each_line;

$htmlfile="testhtml.txt";
open(DAT, $htmlfile) || die ("Dude no file by that name!");
@htmlarray=<DAT>;

close(DAT);

now this pulls my fake test html (if in the same directory) and allows me to print from the array. my plan was to use a foreach method to use the $each_line to acquire each line of text and then a regular expression to verify my the information before printing it out and being able to copy it to notepad/word/etc.

Below is where I get lost. I know the idea is sound and should be rather easy to accomplish. Thanks again for any assistance.

foreach $each_line (@htmlarray)

{

if ((!($each_line=~/^http/))&&($each_line=~/^\//))

}

Scheme

open file

add file data to arrray

find lines of urls

print them out so I can copy them to a text documents (probably between 1,000-1,500 urls or A LOT more);
Perl Newbie - 7 months of PERL basics.


rovf
Veteran

Oct 31, 2012, 3:01 AM

Post #2 of 11 (3631 views)
Re: [wyndcrosser] Obtaining urls from source code [In reply to] Can't Post

I don't know, according to which criterium you want to recognize URLs in your files. According to your code, you are searching for lines which start with "http" and start with "/" at the same time, so it is impossible that you come up with a valid line.

A more reasonable approach would be to grep all strings starting with "http://" or "https://", up to the first white space, but this might also return strings embedded in a comment, for instance.


Laurent_R
Veteran / Moderator

Oct 31, 2012, 5:57 AM

Post #3 of 11 (3591 views)
Re: [rovf] Obtaining urls from source code [In reply to] Can't Post


Code
if ((!($each_line=~/^http/))&&($each_line=~/^\//))


is actually looking for lines not starting with http (because of the '!'), and starting with '/', which is not contradictory, but redundant, since lines starting with '/' are bound not to start with 'http'. But it is probably not the OP's intention.

@wyndcrosser: please explain what the lines you are looking for look like and what you want to extract from them.


FishMonger
Veteran / Moderator

Oct 31, 2012, 6:17 AM

Post #4 of 11 (3582 views)
Re: [wyndcrosser] Obtaining urls from source code [In reply to] Can't Post

HTML::LinkExtor - Extract links from an HTML document
http://search.cpan.org/~gaas/HTML-Parser-3.69/lib/HTML/LinkExtor.pm


rovf
Veteran

Oct 31, 2012, 7:53 AM

Post #5 of 11 (3572 views)
Re: [Laurent_R] Obtaining urls from source code [In reply to] Can't Post

Ah, thanks for pointing out, missed the '!'

!!!!


wyndcrosser
Novice

Nov 1, 2012, 8:37 PM

Post #6 of 11 (3503 views)
Re: [rovf] Obtaining urls from source code [In reply to] Can't Post

yea I found that regex while looking for web crawlers.

I've got about 80% of it down, it finds a href lines and identifies what lines they are on after an array, so I think I just need to work on my regex for formatting and I should be set.

I want to use Perl (I have used the site you guys mentioned for the url phraser) as it's something I can practice with in case I decide to do a level 200 Perl class in the future, or higher.

If anyone has a cleaner attempt at a regex expression that would at least get me past the differences in source codes (ahref, href, etc.) I'd greatly appreciate it.

I'll upload my code soonl, for some reason now I can't I/O files with the program now... argh!?! lol
Perl Newbie - 7 months of PERL basics.


Laurent_R
Veteran / Moderator

Nov 2, 2012, 3:52 AM

Post #7 of 11 (3499 views)
Re: [wyndcrosser] Obtaining urls from source code [In reply to] Can't Post

Give us a sample of your data and explain what you want to extract from it.


wyndcrosser
Novice

Nov 4, 2012, 12:07 PM

Post #8 of 11 (3462 views)
Re: [Laurent_R] Obtaining urls from source code [In reply to] Can't Post

I can't give the exact code, as it's on our intranet site.

for example....

<a href="http://www.holycow.org/%myfiles/ManDept/p1.html">Program
1 - sequential statement</a><br>
<a href="http://www.holycow.org/%myfiles/MarketingDept/p1.html">Program

I've got the code to return any line starting with a <href="http://...

My question
-How to add those lines to an array (this would be the best method, right? My boss said he'd like it that way if possible)
I've been able to do a foreach loop and add the data, but I'd like to hear your opinion.
-After your regular expressions finds the lines with urls, how to you get it to just print out http://www.whatever.co.uk.index.html, etc. without all those added details. My issue is that I want to be able to copy it into a word/excel doc and divide them up for my Visio layout.

Thanks guys

This is just a very basic attempt. I'm still working on it. I've got work around the house to complete as well.

while($line =<STDIN>)
{
if($line =~ /<a href= "http:\/\/(.+)"/)
{
print "$line\n";
}
}
Perl Newbie - 7 months of PERL basics.

(This post was edited by wyndcrosser on Nov 4, 2012, 12:09 PM)


Laurent_R
Veteran / Moderator

Nov 4, 2012, 1:44 PM

Post #9 of 11 (3457 views)
Re: [wyndcrosser] Obtaining urls from source code [In reply to] Can't Post

To add a line to an array, just use the push function on that array.

To parse the lines for URLs within quote marks, you could use something like this:

print $1 if /"([^"]*)"/;

This will capture things between quote marks. But if it is spread on more than one line, this will not work. There are probably other reasons why this would not work. And then, you'll really need a real parser, i.e. most probably a CPAN module.


wyndcrosser
Novice

Nov 7, 2012, 6:38 PM

Post #10 of 11 (3259 views)
Re: [Laurent_R] Obtaining urls from source code [In reply to] Can't Post

Hey Laurent.

This is what I ended up doing.

if ($line =~ m/href\=\"(.\S+)\"/ig) I believe i is used to ignore matches
Perl Newbie - 7 months of PERL basics.

(This post was edited by wyndcrosser on Nov 16, 2012, 4:24 PM)


Laurent_R
Veteran / Moderator

Nov 7, 2012, 11:20 PM

Post #11 of 11 (3249 views)
Re: [wyndcrosser] Obtaining urls from source code [In reply to] Can't Post

In a regex, the "i" modifier is to ignore case (lower case or upper case).

When opening a file, you should test for success or failure with something like:


Code
open $FH, "<", $file or die "could not open $file $! \n";

Otherwise, is you code doing what you want or is it doing something wrong?

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives