CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
HTML::Parser to strip HTML

 



albatros
Novice

Jul 30, 2011, 8:00 AM

Post #1 of 7 (3302 views)
HTML::Parser to strip HTML Can't Post

Hi,

I'm using HTML::Parser to strip HTML tags from my files. I noticed
how //<![cdata[ .. //]]> and the javascript between that is not
stripped. Any idea how to do this?

Regards


fixles
Novice

Jul 30, 2011, 8:28 AM

Post #2 of 7 (3298 views)
Re: [albatros] HTML::Parser to strip HTML [In reply to] Can't Post

Hi,

I tried to use HTML::Parser and didnt have much luck but I managed to get what I needed with WWW::Mechanize->text() depending on what your trying to do you maybe able to use it.

Thanks.


albatros
Novice

Jul 30, 2011, 9:12 AM

Post #3 of 7 (3296 views)
Re: [fixles] HTML::Parser to strip HTML [In reply to] Can't Post

I try to write a program in perl to download tv guide listing from some site (Israel) and some programme contain links to a sponsor (betting site) that want to avoid them.

Example source code:


Code
<div style="display: none;"> 
<!-- ******************************************** -->
<script type="text/javascript">
// <![CDATA[
if(!window.goA)document.write('<sc'+'ript src="http://*******************/scripts/gwloader.js?ord='+Math.floor(Math.random()*1000000000)+'" type="text/javascript"><\/sc'+'ript>');
// ]]>
</script><script type="text/javascript">
// <![CDATA[
if(window.goA)goA.addZone(*******,{displayOptions:{bannerhome:'http://*******************'}});
// ]]>
</script><script charset="iso-8859-2" src="http://*******************/js.prm?zona=*****&amp;ord=*******************;re=http*******************"></script>
<noscript><a href="http://*******************/click.prm?zona=*****" target="_blank" title="Touch me!"><img border="0" src="http://*******************/img.prm?zona=*****" alt="Atention" /></a></noscript>
</div>


and output file containe unwanted text:

Code
// &lt;![CDATA[ 
if(!window.goA)document.write(&lt;sc'+'ript src="http://*******************/scripts/gwloader.js?ord='+Math.floor(Math.random()*1000000000)+'" type="text/javascript"&gt;&lt;\/sc'+'ript&gt;');
// ]]>

// &lt;![CDATA[
if(window.goA)goA.addZone(*******,{displayOptions:{bannerhome:'http://*******************'}});
// ]]&gt;

I like this text not be parsing.


fixles
Novice

Jul 30, 2011, 9:23 AM

Post #4 of 7 (3288 views)
Re: [albatros] HTML::Parser to strip HTML [In reply to] Can't Post

After you've created the file you could try a foreach look to only write the line if it doesnt contain the code for the advert.
This might not be a good way as if other HTML code is on the same line it might ruin the page. Worth a try though. I've only been coding perl for a couple of weeks myself.

Something like.

Code
open (FILE, 'r',  'tv.html'); 

foreach <FILE> {
print $_ if !~ /regex identifying sponser/;
}



albatros
Novice

Jul 30, 2011, 10:30 AM

Post #5 of 7 (3285 views)
Re: [fixles] HTML::Parser to strip HTML [In reply to] Can't Post

Solution suggested by you (strip text after create file) is not a good idea for me, as well can use SED, PERL or other programs for strip part of text from created file, but I want a solution within the my perl program to simplify things.

Thanks anyway for your interest and answers to my questions.


kasuals
New User

Jul 30, 2011, 10:41 AM

Post #6 of 7 (3282 views)
Re: [albatros] HTML::Parser to strip HTML [In reply to] Can't Post


In Reply To
Solution suggested by you (strip text after create file) is not a good idea for me, as well can use SED, PERL or other programs for strip part of text from created file, but I want a solution within the my perl program to simplify things.

Thanks anyway for your interest and answers to my questions.


CPAN can release new modules all day, but I still stand by regex replacement. Regular Expressions can and will remove any HTML tags from a file when done properly. Read each line, and use a regex expression that removes the accompanying HTML directives.

*edit* can and will remove any HTML tags from a string in your case?

*posteditedit* There are scripts that currently exist for this, I used one provided back in early 2000 so I would assume they are probably in-depth enough at this point to remove anything HTML based in an HTML file.


(This post was edited by kasuals on Jul 30, 2011, 10:47 AM)


albatros
Novice

Jul 31, 2011, 4:28 AM

Post #7 of 7 (3222 views)
Re: [kasuals] HTML::Parser to strip HTML [In reply to] Can't Post

I found a solution


Code
my $data=get_nice($url); 
# FIXME strip sponsored links, they don't work in my file anyway
$data =~ s|\x0A||g; # strip new line
$data =~ s|<script type=\"text\/javascript\">.*?</script>||g; # strip CDATA block


Thanks for your help.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives