CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Extracting blockquotes and text within them?

 



albinodog
New User

Mar 7, 2009, 1:28 AM

Post #1 of 2 (810 views)
Extracting blockquotes and text within them? Can't Post

Ok, so I understand how I can use TokeParser to extract tags and values from the tags for say links or images. But what about extracting data from a tag like a blockquote, or all bold text? How do I extract multiple instances of text contained within an opening and closing tag? i have used the following code to extract a single blockquote, but I need to repeat this for every blockquote within the HTML source.


Code
    my $html = lc($source); 
my $start = index($html, '<blockquote');
my $end = index($html, '</blockquote>') + 13;
my $blockquotes = substr($content, $start, $end - $start);


I'd like to strip it down so that all text is extracted from each blockquote tag, and placed on a single line in an array. Any help is appreciated.


(This post was edited by albinodog on Mar 7, 2009, 1:29 AM)


1arryb
User

Mar 9, 2009, 8:52 AM

Post #2 of 2 (793 views)
Re: [albinodog] Extracting blockquotes and text within them? [In reply to] Can't Post

Hi albinodog,

You might want to try a better HTML parser, like HTML::TreeBuilder. Check out this thread: http://perlguru.com/gforum.cgi?do=post_view_threaded;post=35864;sb=post_latest_reply;so=ASC;.

Your use case is slightly different: 1. You are looking for blockquote tags, not table tags; and 2. You want the unformatted text from the tag, not the raw html, so you'll want to use HTML::Element::as_text() to get the tag content.

I use code like this to cleanup and singulate multi-line text blocks. It's not very efficient, but it deals with mismatches between Perl's default line termination and what's going on in the HTML file. Maybe someone on the board has a better method.

Code
#Get the blockquote text somehow. 
...
# Normalize line termination. In DOS text files, this will add vertical whitespace.
$text =~ tr/\r/\n/;
# Turn the block into a single line by changing newlines to spaces.
$text =~ tr/\n/ /;
# Remove extra interior whitespace.
$text =~ s/\s+/ /g;
# Remove leading/trailing whitespace.
$text =~ s/^\s+|\s+$//;
...


Cheers,

Larry

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives