CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Advanced:
Pattern matching issue with HTML file, please help...

 



chetwyn
Novice

Apr 25, 2006, 8:49 PM

Post #1 of 6 (1550 views)
Pattern matching issue with HTML file, please help... Can't Post

Hi guys.

This is the example string I'm working with.

<DIV class=listingName><A
href="http://www.yellowpages.com.au/onlineSolution_moreInfo.do?z=200001&amp;iblName=Scott+%26+Sons+Plumbers+Drainers+%26+Gasfitters&amp;iblId=3639382&amp;authToken=10ad3dd6d99%7C88d8b6977f98206fe7c681b9e302225f&amp;st=cs">Scott
&amp; Sons Plumbers Drainers &amp; Gasfitters</A></DIV>
<SPAN> this is the spa </SPAN>

So I'm trying to get only the name of the business i.e. (the name of the business will of course always be different)

"Scott & Sons Plumbers Drainers & Gasfitters"

It's almost working... I just can't get the "Scott" bit. This is the code I've got so far.

Tha actual file breaks at the end of <A, then all the way to the end of Scott, new line. then all the way to </DIV> then the <SPAN> > this is the spa </SPAN> on a new line.

This is the code im using so far.

open (DIR,"end.txt")||die "print, can not open file\n";
while(<DIR>){
chomp;
$_ =~ s/<A//gi;
$_ =~ s/<\/A>//gi;
$_ =~ s/(\&amp;)/\&/gi;
if($_ =~ /(D*)\</gim){
$_ =~ s/\<DIV class=listingName\>//;
$_ =~ s/\<SPAN.*//;
$_ =~ s/\<\/.*//;
print $_."\n";

}

}

The resultof this is:

"& Sons Plumbers Drainers & Gasfitters"

But I need:

"Scott & Sons Plumbers Drainers & Gasfitters"

Please help... :)


(This post was edited by chetwyn on Apr 25, 2006, 8:54 PM)


KevinR
Veteran


Apr 25, 2006, 11:09 PM

Post #2 of 6 (1547 views)
Re: [chetwyn] Pattern matching issue with HTML file, please help... [In reply to] Can't Post

This should probably have been in the beginner forum, but here is one possible way:


Code
use strict; 
use warnings;
use CGI qw(:cgi);
my $html_string = do {local $/; <DATA>};
my ($URI) = $html_string =~ /href=".+?\?([^"]+)"/i;
my $q = CGI->new($URI);
my $name = $q->param('iblName');
print $name;
__DATA__
<DIV class=listingName><A
href="http://www.yellowpages.com.au/onlineSolution_moreInfo.do?z=200001&amp;iblName=Scott+%26+Sons+Plumbers+Drainers+%26+Gasfitters&amp;iblId=3639382&amp;authToken=10ad3dd6d99%7C88d8b6977f98206fe7c681b9e302225f&amp;st=cs">Scott
&amp; Sons Plumbers Drainers &amp; Gasfitters</A></DIV>
<SPAN> this is the spa </SPAN>


I don't know how you get the HTML code so you will have to adapt the code to your needs. Using the CGI module for this might not be the most efficient way but it will probably get it right.
-------------------------------------------------


davorg
Thaumaturge / Moderator

Apr 26, 2006, 1:30 AM

Post #3 of 6 (1544 views)
Re: [chetwyn] Pattern matching issue with HTML file, please help... [In reply to] Can't Post

As you have found, parsing HTML with regular expressions is far harder that it looks like it should be. There's always one more problem to take care of.

The best way to parse HTML is to use an HTML parser - HTML::Parser or one of its subclasses.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


chetwyn
Novice

Apr 26, 2006, 6:10 AM

Post #4 of 6 (1538 views)
Re: [davorg] Pattern matching issue with HTML file, please help... [In reply to] Can't Post

OK. Here's what I've done... It works prtty well. I'm not sure how to format in this forum but here it is:

#!/usr/bin/perl


srand();
$progname = $0;
$progname =~ s@(.*?)(/|\\)@@ig;
$generate = '';
$credit = <<EOT;
HTML Formatter
EOT
$amount = 0;
$replace = 0;
$lower = 0;
$upper = 0;
$quiet = 0;
$convert = 0;
$usetabs = 0;
$tags = "<html|</html|<body|</body|<head|</head|<title";
$tags .= "|<isindex|<link|<meta|<!doctype";
$tags .= "|<table|<tr|<th|<td|</th|</tr|</table|<caption";
$tags .= "|<thead|</thead|<tbody|</tbody";
$tags .= "|<p|<br|<blockquote|<hr|<center|</center|<div|</div";
$tags .= "|<col|<colgroup";
$tags .= "|<marquee|</marquee";
$tags .= "|<style|</style";
$tags .= "|<h1|<h2|<h3|<h4|<h5|<h6";
$tags .= "|<ul|</ul|<ol|</ol|<dl|</dl|<li|<dt|<dd|<dir|</dir|<menu|</menu";
$tags .= "|<map|<area|</map";
$tags .= "|<base|<basefont|<bgsound";
$tags .= "|<object|<applet|<param|</object|</applet|<embed|</embed";
$tags .= "|<frameset|<frame|<noframes|</noframes|</frame|</frameset";
$tags .= "|<form|</form|<input|<select|<option|</select|<textarea";
$tagindent = "<table|<tr|<td";
$tagindent .= "|<select|<form";
$tagindent .= "|<frameset";
$tagindent .= "|<ul|<ol|<dl|<dir|<menu|<map";
$tagunindent = "</table|</tr|</td";
$tagunindent .= "|</select|</form";
$tagunindent .= "|</frameset";
$tagunindent .= "|</ul|</ol|</dl|</dir|</menu|</map";
NGetOpt('n:i','r','l','u','q','h','t','todo');
if ($opt_todo) {
print <<EOT;
$credit
To do list:
Everything's done? Everything's DONE! So bring it on
EOT
exit;
}
@files = @ARGV;
if ($#files+1 <= 0 || $opt_h) {
usage();
exit(0);
}
if (defined($opt_n) && $opt_n >= 0) { $amount = $opt_n; }
if (defined($opt_t)) { $usetabs = 1; }
if ($opt_r) { $replace = 1; }
if ($opt_l) { $lower = 1; }
if ($opt_u) { $upper = 1; }
if ($opt_c) { $convert = 1; }
if ($opt_q) { $quiet = 1; }
if (!$quiet) { print "$credit\n\nProcessing...\n"; }
chdir('.') ? print "+\n" : print "-\n";
opendir(DIR,'.');
@files = readdir(DIR);
foreach(@files) {
@lines = ();
@temp = ();
if($_ !~ m/\.(htm.*?)/gi){
next;
}
$filein = $_;
if ($replace) {
$fileout = $filein;
} else {
$fileout = "$filein.out";
}
open(i,"<$filein") || die "Can't open $_ ";
while (!eof(i)) {
$line = <i>;
push @temp, $line;
}
close(i);
splitlines();
if (!$quiet) { print "$_\n"; }
$SCRIPT = 0;
$COMMENT = 0;
$PRE = 0;
$temp = '';
foreach(@lines) {
$SCRIPT = 0 if ($line =~ m@(</script|%>)@ig);
$COMMENT = 0 if ($line =~ m@(-->|</comment)@ig);
$PRE = 0 if ($line =~ m@</pre>@ig);
$line = $_;
$SCRIPT = 1 if ($line =~ m@(<script|<%)@ig);
$COMMENT = 1 if ($line =~ m@(<!--|<comment)@ig);
$PRE = 1 if ($line =~ m@<pre@ig);
if (!$SCRIPT && !$COMMENT && !$PRE) {
$line =~ s/\t//ig;
$line =~ s/<\ /</ig;
$line =~ s/\ >/>/ig;
if ($line =~ />$/) {
$temp .= $line;
} else {
$temp .= $line." ";
}
} else {
$temp .= "\n".$line."\n";
}
}
push @temp, $temp;
splitlines();
$SCRIPT = 0;
$COMMENT = 0;
$PRE = 0;
foreach(@lines) {
$SCRIPT = 0 if ($line =~ m@(</script|%>)@ig);
$COMMENT = 0 if ($line =~ m@(-->|</comment)@ig);
$PRE = 0 if ($line =~ m@</pre>@ig);
$line = $_;
$SCRIPT = 1 if ($line =~ m@(<script|<%)@ig);
$COMMENT = 1 if ($line =~ m@(<!--|<comment)@ig);
$PRE = 1 if ($line =~ m@<pre@ig);
if (!$SCRIPT && !$COMMENT && !$PRE) {
$line =~ s/\ {2,}/\ /ig;
$line =~ s@($tags)@\n$1@ig;
if ($convert) {
$line =~ s@@&copy;@ig;
$line =~ s@@&reg;@ig;
}
if ($upper || $lower || $convert) {
for($i=0; $i<length($line); $i++) {
$char = substr($line,$i,1);

if ($char eq '<') { $in = 1; }
if ($char eq '>') { $in = 0; }
if ($char eq '"') {
if ($quote) { $quote = 0; }
else { $quote = 1; }
}
if ($in && !$quote) {
substr($line,$i,1) = uc($char) if $upper;
substr($line,$i,1) = lc($char) if $lower;
}
if (!$quote) {
if (ord($char) == 169) { substr($line,$i,1) = "&copy;"; }
if (ord($char) == 174) { substr($line,$i,1) = "&reg;"; }
}
}
}
}
push @temp, $line;
}
splitlines();
$indent = 0;
$SCRIPT = 0;
$COMMENT = 0;
$PRE = 0;
foreach (@lines) {
$SCRIPT = 0 if ($line =~ m@(</script|%>)@ig);
$COMMENT = 0 if ($line =~ m@(-->|</comment)@ig);
$PRE = 0 if ($line =~ m@</pre>@ig);
$line = $_;
$SCRIPT = 1 if ($line =~ m@(<script|<%)@ig);
$COMMENT = 1 if ($line =~ m@(<!--|<comment)@ig);
$PRE = 1 if ($line =~ m@<pre@ig);
$spaces = "";
if (!$SCRIPT && !$COMMENT && !$PRE) {
$line =~ s@(\ $)@@ig;
$indent -= $line =~ s@($tagunindent)@$1@ig;
$spaces = "";
for ($j=0; $j<$indent; $j++) {
for ($k=0; $k<$amount; $k++) {
if ($usetabs) {
$spaces .= "\t";
} else {
$spaces .= " ";
}
}
}
}
push @temp, $spaces.$line;
if (!$SCRIPT && !$COMMENT && !$PRE) {
$indent += $line =~ s/($tagindent)/$1/ig;
}
}
splitlines();
open (o, ">$fileout");
foreach (@lines) { print o "$_\n"; }
close(o);
}


open(x, ">Results.txt")||die"can not open result file\n";
chdir('.') ? print "+\n" : print "-\n";
opendir(DIR,'.');
@files = readdir(DIR);
foreach(@files) {
if($_ !~ m/\.(out.*?)/gi){
next;
}
#final fucking stretch - thank fuck for that!!!!!
#load the shit into memory and finish it's sorry ass off(sorry, I'm a little tired now)
#yer... you own me a beer for this one... 'heh' 'heh'
open(p,"$_")||die"can not open $file\n";
my @loadit=<p>;
close(p);
foreach (@loadit){
$_ =~ s/\s+//;
$_ =~ s/\<L.*//;
if($_ eq ''){
next;
}
if($_ =~ m/class=listingName/gi){
if(!($_ =~ /\<A/gi)){
($name) = $_ =~ /\>(.*)/;
$name =~ s/&amp;/&/gi;
print x "\n$name\n";
} else {
($name) = $_ =~ /\">(.*)\</mgi;
$name =~ s/&amp;/&/gi;
print x "\n$name\n";
}
}
if($_ =~ m/class=gold/gi){
my ($address) = $_ =~ /\>(.*)/;
if(!($address =~ m/^(ph)/gi)){
print x "$address\n\n";
}
}elsif($_ =~ m/class=free/gi){
($address) = $_ =~ /\>(.*)/gi;
print x "$address\n\n";
}
}
}

close(x);
cleaner();
print "\n\nExtraction Complete\n\n";
exit;

sub splitlines {
@lines = ();
foreach(@temp) {
$line = $_;
if ($line eq "\n") {
# This preserves blank lines in script and comments.
push @lines, " ";
} else {
push @lines, split(/\n/, $line);
}
}
@temp = ();
}

sub cleaner(){
if(!(-e 'processed')){
system("md processed");
}
if(!(-e 'html_files')){
system("md html_files");
}

chdir('.') ? print "+\n" : print "-\n";
opendir(DIR,'.');
my @files = readdir(DIR);

foreach(@files) {
if($_ =~ m/\.(out.*?)/gi){
system("copy \"$_\" processed");
system ("del \"$_\"");
} elsif ($_ =~ m/\.(htm.*?)/gi){
system("copy \"$_\" html_files");
system ("del \"$_\"");
}

}
}

sub usage {
print <<EOT;
$credit
Usage: $progname [-options] filespec [filespec...]

filespec is a filename or filename pattern (e.g. *.htm)
-n Number of tabs/spaces to indent (default: 0)
-t Uses tabs instead of spaces for indenting
-r Replace original with new file
-l Convert tags to lower-case
-u Convert tags to upper-case
-c Convert (C) -> &copy; and (R) -> &reg;
-h Help
-todo To do list
EOT
}


(This post was edited by chetwyn on Apr 26, 2006, 6:12 AM)


davorg
Thaumaturge / Moderator

Apr 26, 2006, 6:16 AM

Post #5 of 6 (1532 views)
Re: [chetwyn] Pattern matching issue with HTML file, please help... [In reply to] Can't Post

Anyone writing that much code without a) using "strict" and "warnings" and b) looking on CPAN first is just asking for trouble :)

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


chetwyn
Novice

Apr 26, 2006, 6:19 AM

Post #6 of 6 (1531 views)
Re: [chetwyn] Pattern matching issue with HTML file, please help... [In reply to] Can't Post

'heh' 'heh'

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives