
KevinR
Veteran

Sep 12, 2003, 11:15 PM
Post #21 of 22
(1438 views)
|
OK, the code below works (for the most part) for me. I could not test it with your exact code, especially the while($a < 151218) part of your code. I substituted a more modest dir that had 46 html files in it. I do not know why, but the script below would only parse 25 of the 46 files. Also, the parsed files retain all the white space where tags are removed so the txt file look rather bizarre, I also do not why that is or how to change that behavior. Give this a try and see what results you get. If you are running from a browser uncomment the print header line and maybe the other commented print line if you want to see somthing printed to the screen. This is about all the help I can give you with this problem. #! C:/perl/ -w use strict; use HTML::Parser 3.00 (); #print qq~Content-type: text/html\n\n~; my %inside; my @temp; my $parser; my $file; my $i=0; while ($i < 151218) { $file="output$i.txt"; parse_files("C:/output/data4/$file"); open(FILE,">C:/parsed/data4/$file"); print FILE @temp; close(FILE); $i++; #print "$i - $file: finished<br>"; } sub tag { my($tag, $num) = @_; $inside{$tag} += $num; print " "; # not for all tags } sub text { return if $inside{script} || $inside{style}; push @temp, $_[0]; } sub parse_files { undef(@temp); HTML::Parser->new(api_version => 3, handlers => [start => [\&tag, "tagname, '+1'"], end => [\&tag, "tagname, '-1'"], text => [\&text, "dtext"], ], marked_sections => 1, )->parse_file(shift) || die "Can't open file: $!\n"; } -------------------------------------------------
(This post was edited by KevinR on Sep 12, 2003, 11:23 PM)
|