CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
HTML::Parser

 



camelman
Novice

Sep 1, 2003, 5:39 PM

Post #1 of 22 (3361 views)
HTML::Parser Can't Post

Does anyone know how to use HTML::Parser? I've been using this program to try and do the same thing but it doesn't work. I'm trying to strip the HTML out of a text file and according to Perl experts at another forum it should work, but it doesn't, all it outputs is one file and it doesn't take out any tags. This is the code I have been using:

#! C:/perl/ -w

$a=0;
$outfile="output$a";
$file="C:/output/data4/output$a.txt";
print "files to process...\n\n";
print (join "|", $file);
print "\n";

while($a < 153000) {

$file =~ s/.txt//;

open(INFILE, "<$file.txt") || die;
open(OUTFILE,">C:/parsed/data4/text$outfile.txt") || die;

while (<INFILE>) {
chomp;
s/<[^>]+> //g;
print OUTFILE "$_\n";
}

close(INFILE);
close(OUTFILE);
$a++;
}



Thanks,
Camelman


davorg
Thaumaturge / Moderator

Sep 2, 2003, 1:32 AM

Post #2 of 22 (3356 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

The HTML::Parser distribution contains an example program called htext which does just that.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


camelman
Novice

Sep 2, 2003, 6:36 AM

Post #3 of 22 (3353 views)
Re: [davorg] HTML::Parser [In reply to] Can't Post

Thanks davorg, but where would I put input and output files in the htext example?



Thanks,
Camelman


davorg
Thaumaturge / Moderator

Sep 2, 2003, 7:04 AM

Post #4 of 22 (3351 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

Let's take a look at the program and see if we can work it out.

The obvious place to look for the input filename is in the argument to the "parse_file" method. That argument is simply a call to the function "shift". We know that "shift" without any arguments returns the first argument from @ARGV, so the input filename is the first argument to the program.

Now let's see if if we can work out where the output goes. Well the only obvious output lines are the calls to "print" in "text" and "tag". If it's not given a filehandle, then "print" will send its output to STDOUT, so it looks very much like that's what it will do here.

So the input file is given on the command line and the output goes to STDOUT. Therefore it looks to me as tho' you would call the program like this.


Code
htext input.html > output.txt


Of course, I've never run the program so I can't be 100% sure, but looking at the code that what's I think is going to happen

What do you think?

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


camelman
Novice

Sep 2, 2003, 8:58 AM

Post #5 of 22 (3348 views)
Re: [davorg] HTML::Parser [In reply to] Can't Post

Well, since i need to do alot of files in sequence, the command line won't work. I think I can use the loops that i was using in the first post to make the file sequence work, however I'm not sure what variables should be written to the file. Is it just $_? also it seems that I open the input file at the end of the program, its this segment right?

Code
 HTML::Parser->new(api_version => 3, 
handlers => [start => [\&tag, "tagname, '+1'"],
end => [\&tag, "tagname, '-1'"],
text => [\&text, "dtext"],
],
marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";;



Im not sure where to stick the open command. And then the output would go beneath that? I'm confused.



Thanks
Camelman


davorg
Thaumaturge / Moderator

Sep 2, 2003, 9:07 AM

Post #6 of 22 (3347 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

Why not just create another script that calls htext for each file in your sequence.


Code
foreach my $infile (@list_of_input_files) { 
my $outfile = (some code to generate the output filename);

system "htext $infile > $outfile";
}


That looks to be about the easiest option to me.

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


camelman
Novice

Sep 2, 2003, 6:48 PM

Post #7 of 22 (3340 views)
Re: [davorg] HTML::Parser [In reply to] Can't Post

sounds easy to me as well. Thanks davorg.



Camelman


camelman
Novice

Sep 2, 2003, 7:08 PM

Post #8 of 22 (3338 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

just a quck question, can i specify the file path of the output file like this?


Code
 system "perl parse.pl C:/output/data3/$infile > C:/parsed/$outfile";



Thanks,

Camelman


davorg
Thaumaturge / Moderator

Sep 3, 2003, 1:24 AM

Post #9 of 22 (3334 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

What happens when you try?

--
Dave Cross, Perl Hacker, Trainer and Writer
http://www.dave.org.uk/
Get more help at Perl Monks


camelman
Novice

Sep 3, 2003, 6:21 AM

Post #10 of 22 (3332 views)
Re: [davorg] HTML::Parser [In reply to] Can't Post

Right, sorry about that.



Camelman


camelman
Novice

Sep 5, 2003, 6:30 PM

Post #11 of 22 (3321 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

okay, so I do try it and it doesn't work. I use this code:


Code
 #! C:/perl/ -w@infile=<files.txt> 
foreach $infile (@infile) {
my $outfile = "C:/parsed/data4/$infile";
system "perl parse.pl C:/output/data4/$infile > $outfile";
}



and I get the following error messages:

"my" variable $infile masks earlier declaration in same scope at caller.pl line 6.

"my" variable $outfile masks earlier declaration in same scope at caller.pl line 6.

syntax error at caller.pl line 4, near "$infile ("

syntax error at caller.pl line 7, near "}"

execution of caller.pl aborted due to compilation errors.

I don't understand these, especially because there is no my near $infile. Files.txt is just a text file with all the files to process in it, each on a seperate line.

Thanks,

Camelman


KevinR
Veteran


Sep 5, 2003, 11:35 PM

Post #12 of 22 (3318 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

maybe the error in the example code you posted was just a typo when you posted it, but it contains an error that will crash the script:


Code
#! C:/perl/ -w 
@infile=<files.txt>
foreach $infile (@infile) {
my $outfile = "C:/parsed/data4/$infile";
system "perl parse.pl C:/output/data4/$infile > $outfile";
}



should be:


Code
#! C:/perl/ -w 
@infile=<files.txt>;
foreach $infile (@infile) {
my $outfile = "C:/parsed/data4/$infile";
system "perl parse.pl C:/output/data4/$infile > $outfile";
}

note the semi-colon at the end of the second line. That should get rid of the error message.Wink
-------------------------------------------------


(This post was edited by KevinR on Sep 5, 2003, 11:37 PM)


camelman
Novice

Sep 6, 2003, 6:51 AM

Post #13 of 22 (3310 views)
Re: [KevinR] HTML::Parser [In reply to] Can't Post

you're right kevin, no error messages now. Unfortunately when i run the script it just outputs 1 file, textoutput0.txt and it doesn't take out the html. All the code im usin is somewhere in this post. I can't figure it out.

Thanks
Camelman


KevinR
Veteran


Sep 6, 2003, 4:14 PM

Post #14 of 22 (3304 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

I don't know what the problem is but lets take a look at what you have:


@infile=<files.txt>;
foreach $infile (@infile) {
my $outfile = "C:/parsed/data4/$infile";
system "perl parse.pl C:/output/data4/$infile > $outfile";
}

if you are running a perl script (parse.pl), there is no reason to use the system function. Make parse.pl a subroutine instead of a separate script or call it with a "require" directive.

What is parse.pl anyway?
-------------------------------------------------


camelman
Novice

Sep 6, 2003, 5:18 PM

Post #15 of 22 (3301 views)
Re: [KevinR] HTML::Parser [In reply to] Can't Post

parse.pl is the HTML::Parser script. Davorg said to use the system printing inorder to assign in/out files. i don't know how else I would specify in/out files in the parse.pl script.



Camelman


KevinR
Veteran


Sep 7, 2003, 7:33 PM

Post #16 of 22 (3296 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

OK, if you look at Daves post, he is calling the htext script that he posted a link to. You need to be calling that script (after uploading it to your server), or paste it into your script as a subroutine.
-------------------------------------------------


camelman
Novice

Sep 8, 2003, 8:34 PM

Post #17 of 22 (3287 views)
Re: [KevinR] HTML::Parser [In reply to] Can't Post

alright, htext is parse.pl, I just renamed it, so as far as I know I am calling that script, correct? Also if I were to put it in a subroutine, how would I then specify the input and output files?



Camelman


KevinR
Veteran


Sep 9, 2003, 11:15 AM

Post #18 of 22 (3281 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

hopefully Dave or another person can help you. I tried a few things but I can't get the HTML:Parser module to do what you want, not even using the htext script to call the parser module. Sorry.Unsure
-------------------------------------------------


camelman
Novice

Sep 9, 2003, 8:44 PM

Post #19 of 22 (3277 views)
Re: [KevinR] HTML::Parser [In reply to] Can't Post

okay, new code:

#! C:/perl/ -w

package Example;

use strict;

require HTML::Parser;

$a=0;
while($a < 151218) {
@Example::ISA = qw(HTML::Parser);
$file="output$a.txt";
my $parser = Example->new;
$parser->parse_file("C:/output/data4/$file");
open(FILE,">C:/parsed/data4/$file");
print FILE $parser->{TEXT};
close(FILE);
$a++;
}
sub text
{
my ($self,$text) = @_;

$self->{TEXT} .= $text;
}

when I run this it gives me:

global symbol "$file" requires explicit package name at parse.pl line 12.

global symbol "$file" requires explicit package name at parse.pl line 14.

global symbol "$file" requires explicit package name at parse.pl line 15.

Execution of parse.pl aborted due to compilation errors.



Thanks,

Camelman


camelman
Novice

Sep 12, 2003, 5:51 PM

Post #20 of 22 (3270 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

so if I take the $file variable out i get this:

#! C:/perl/ -w

package Example;

use strict;

require HTML::Parser;

$a=0;

while($a < 151218) {
@Example::ISA = qw(HTML::Parser);
my $parser = Example->new;
open(FILE,">C:/parsed/data4/output1.txt");
$parser->parse_file("C:/output/data4/output1.txt");
print FILE $parser->{TEXT};
close(FILE);
$a++;
}
sub text
{
my ($self,$text) = @_;

$self->{TEXT} .= $text;
}



and this gives me:

print() on closed filehandle FILE at parse.pl line 16.

does anyone know what I should be printing to the file, as far as variables go. also why isn't it printing? I opened the file for writing correct?



Thanks

Camelman


KevinR
Veteran


Sep 12, 2003, 11:15 PM

Post #21 of 22 (3267 views)
Re: [camelman] HTML::Parser [In reply to] Can't Post

OK, the code below works (for the most part) for me. I could not test it with your exact code, especially the while($a < 151218) part of your code. I substituted a more modest dir that had 46 html files in it. I do not know why, but the script below would only parse 25 of the 46 files. Also, the parsed files retain all the white space where tags are removed so the txt file look rather bizarre, I also do not why that is or how to change that behavior. Give this a try and see what results you get. If you are running from a browser uncomment the print header line and maybe the other commented print line if you want to see somthing printed to the screen. This is about all the help I can give you with this problem.


Code
#! C:/perl/ -w  

use strict;
use HTML::Parser 3.00 ();
#print qq~Content-type: text/html\n\n~;
my %inside;
my @temp;
my $parser;
my $file;
my $i=0;

while ($i < 151218) {
$file="output$i.txt";
parse_files("C:/output/data4/$file");
open(FILE,">C:/parsed/data4/$file");
print FILE @temp;
close(FILE);
$i++;
#print "$i - $file: finished<br>";
}

sub tag {
my($tag, $num) = @_;
$inside{$tag} += $num;
print " "; # not for all tags
}

sub text {
return if $inside{script} || $inside{style};
push @temp, $_[0];
}

sub parse_files {
undef(@temp);
HTML::Parser->new(api_version => 3,
handlers => [start => [\&tag, "tagname, '+1'"],
end => [\&tag, "tagname, '-1'"],
text => [\&text, "dtext"],
], marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";
}

-------------------------------------------------


(This post was edited by KevinR on Sep 12, 2003, 11:23 PM)


camelman
Novice

Sep 14, 2003, 7:44 AM

Post #22 of 22 (3256 views)
Re: [KevinR] HTML::Parser [In reply to] Can't Post

thank you very much kevinr.



Camelman

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives