CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Beginner:
Extracting blockquotes and text within them?


New User

Mar 7, 2009, 1:28 AM

Post #1 of 2 (893 views)
Extracting blockquotes and text within them? Can't Post

Ok, so I understand how I can use TokeParser to extract tags and values from the tags for say links or images. But what about extracting data from a tag like a blockquote, or all bold text? How do I extract multiple instances of text contained within an opening and closing tag? i have used the following code to extract a single blockquote, but I need to repeat this for every blockquote within the HTML source.

    my $html = lc($source); 
my $start = index($html, '<blockquote');
my $end = index($html, '</blockquote>') + 13;
my $blockquotes = substr($content, $start, $end - $start);

I'd like to strip it down so that all text is extracted from each blockquote tag, and placed on a single line in an array. Any help is appreciated.

(This post was edited by albinodog on Mar 7, 2009, 1:29 AM)


Mar 9, 2009, 8:52 AM

Post #2 of 2 (876 views)
Re: [albinodog] Extracting blockquotes and text within them? [In reply to] Can't Post

Hi albinodog,

You might want to try a better HTML parser, like HTML::TreeBuilder. Check out this thread:;post=35864;sb=post_latest_reply;so=ASC;.

Your use case is slightly different: 1. You are looking for blockquote tags, not table tags; and 2. You want the unformatted text from the tag, not the raw html, so you'll want to use HTML::Element::as_text() to get the tag content.

I use code like this to cleanup and singulate multi-line text blocks. It's not very efficient, but it deals with mismatches between Perl's default line termination and what's going on in the HTML file. Maybe someone on the board has a better method.

#Get the blockquote text somehow. 
# Normalize line termination. In DOS text files, this will add vertical whitespace.
$text =~ tr/\r/\n/;
# Turn the block into a single line by changing newlines to spaces.
$text =~ tr/\n/ /;
# Remove extra interior whitespace.
$text =~ s/\s+/ /g;
# Remove leading/trailing whitespace.
$text =~ s/^\s+|\s+$//;




Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives