CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Parsing out part of an html document

 



jeff
Deleted

Nov 1, 2000, 4:12 PM

Post #1 of 7 (800 views)
Parsing out part of an html document Can't Post

Hi everyone. I am having a problem with what should be a simple thing. I need to take a mountain of html files, open them, and parse out the title.

If the writer of the web file has written the title tags perfectly, ie <title>my title</title>, there are no problems. It is when the writer has stuck in extra spaces or carrage returns between the title open and title close tags that I run into trouble.

I am sure there is an easy way to do this. Any ideas would be wonderful.

Thanks much for reading this,

Jeff


sleuth
Enthusiast / Moderator

Nov 1, 2000, 9:01 PM

Post #2 of 7 (800 views)
Re: Parsing out part of an html document [In reply to] Can't Post

 
Regex can take care of this. We will say that $file is the html file your trying to parse.

$file =~ m,<title>(.*)</title>,s;
($title = $1) =~ s!\s+!!g;

That should get the title no matter how many newlines are between the tags. Then strip the title of extra spaces. Stuff like " " or " " but shouldn't get single spaces " ".

That's off the top of my head, but I assume the modifier "s" can be used with the match function allthough I havn't tried it yet. Logicaly it should work the same.

Give it a wirl though, I'll test it later tonight for you anyway.

Sleuth


sleuth
Enthusiast / Moderator

Nov 1, 2000, 9:07 PM

Post #3 of 7 (800 views)
Re: Parsing out part of an html document [In reply to] Can't Post

 
Yes it works, the code to take out the spaces is wrong though, it should be like this,

($title = $1) =~ s!\s\s+! !g;

Difference being \s\s+ instead of \s and a " " single space in the second section their.

So

$file =~ m,<title>(.*)</title>,s;
($title = $1) =~ s!\s\s+! !g;

Is definatly going to do it.

Sleuth


jeff
Deleted

Nov 2, 2000, 12:02 PM

Post #4 of 7 (800 views)
Re: Parsing out part of an html document [In reply to] Can't Post

Thanks much Sleuth! I think my intellegence cycle is low lately.

I was wondering if you would look over this snippet of code as I still cant get it to work. I am sure I am missing something easy.

Thanks very very much

Jeff

while (defined ($filename=glob("*.htm" ))){
open (FILE , $filename) &#0124; &#0124;
die "Cant open $filename: $!";
&read_title;
}
close(FILE)&#0124; &#0124; die "cant close directory: $!";


sub read_title{

$filename=~m,<title>(.*)</title>,s;

($title=$1)=~s!\s\s\s+! !g;
print "$filename\n";
print "$title\n";
return;
}


sleuth
Enthusiast / Moderator

Nov 2, 2000, 7:33 PM

Post #5 of 7 (800 views)
Re: Parsing out part of an html document [In reply to] Can't Post

 
Hello Jeff,

There are a few things wrong, one is that your trying to get the title out of the file's name. It looks like it should work, but after putting it to the test, I myself found out that it was doing that.

Anyway, I made this and it's a fast way, you said that all of your pages had titles, I hope so, because this code needs for all pages to have a title. At least have <title></title> is what I mean.

<BLOCKQUOTE><font size="1" face="Arial,Helvetica,sans serif">code:</font><HR>


while (defined ($filename=glob("*.html"))){
push(@files, $filename);
}
{
local @ARGV = glob "*.html";
@data = <>;
}
foreach $line (@data){
if ($line =~ m,<title>(.*)</title>,s){
($c = $n++) -1;
($titles[$c]=$1)=~s!\s\s+! !g;
}
}
foreach $number (0..$c){
print "$files[$number] - $titles[$number]\n";
}
</pre><HR></BLOCKQUOTE>

Sleuth


sleuth
Enthusiast / Moderator

Nov 3, 2000, 10:23 PM

Post #6 of 7 (800 views)
Re: Parsing out part of an html document [In reply to] Can't Post

 
Glad it worked out for ya,

Sleuth


jeff
Deleted

Nov 4, 2000, 8:49 AM

Post #7 of 7 (800 views)
Re: Parsing out part of an html document [In reply to] Can't Post

Thanks so much Sleuth. It worked like a charm and the results are exactly what I needed. I take the result, run through ucfirst then sort. Out comes a grand list of all the titles in alpha form.

Thanks again,
jeff

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives