CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Paragraph extraction

 



IsabelleFr
Novice

Feb 19, 2013, 8:10 AM

Post #1 of 9 (679 views)
Paragraph extraction Can't Post

Hi, i would like to extract paragraphs from html files using regex. I know it can be done using some modules and stuff but my goal is to do it using regex. I would go for something like this:

Code
use strict; 
use locale;

while (my $para = <STDIN>){
chomp $para;
if ($para =~ /.*(<p>.*<\/p>.*/ig){
print $1."\n";
}
}

But obviously it will not print the paragraphs that takes more than one line. Anyone know how to print the paragraphs even if it takes more than one line ? Thank you guys :)


FishMonger
Veteran / Moderator

Feb 19, 2013, 8:13 AM

Post #2 of 9 (677 views)
Re: [IsabelleFr] Paragraph extraction [In reply to] Can't Post

Load the entire file into a scalar then apply your regex to that scalar.

Don't forget to use the /m regex option.


(This post was edited by FishMonger on Feb 19, 2013, 8:15 AM)


IsabelleFr
Novice

Feb 19, 2013, 5:16 PM

Post #3 of 9 (662 views)
Re: [FishMonger] Paragraph extraction [In reply to] Can't Post

That's not what i do when i use $data = <STDIN> ? I thought i was putting the lines / text into a scalar.


Rahul6990
Novice

Feb 19, 2013, 9:19 PM

Post #4 of 9 (652 views)
Re: [IsabelleFr] Paragraph extraction [In reply to] Can't Post

No you are storing the content of <STDIN> into $data line by line not whole data in one shot.

And one more issue in the above approach what will happen if the input file have more than one pair of <p></p> tags. Then I dont think the above approach will work.


IsabelleFr
Novice

Feb 20, 2013, 4:47 AM

Post #5 of 9 (645 views)
Re: [Rahul6990] Paragraph extraction [In reply to] Can't Post

Hmm i see... what should i do then ?


BillKSmith
Veteran

Feb 20, 2013, 5:17 AM

Post #6 of 9 (641 views)
Re: [IsabelleFr] Paragraph extraction [In reply to] Can't Post

The /g that you already have should take care of multiple paragraphs, once you get the regex right.

Refer to the first paragraph of the section of perldoc perlvar on INPUT_RECORD_SEPARATOR for information on how to read the entire file into a single string.
Good Luck,
Bill


Laurent_R
Enthusiast / Moderator

Feb 20, 2013, 2:22 PM

Post #7 of 9 (626 views)
Re: [BillKSmith] Paragraph extraction [In reply to] Can't Post

In addition to what has been said, your regex is wrong because of greediness of regex quantifiers.

If you have something like:


Code
<p>foo</p> <p>bar</p>


this will be considered as just one paragraph, whereas there are two of them.

You need to change your regex:


Code
if ($para =~ /.*(<p>.*<\/p>.*/ig)


to something like:


Code
if ($para =~ /.*(<p>[^<]+<\/p>.*/ig)


which will prevent the two paragraph tags to be considered as only one. An alternative approach is to use non greedy quantifiers, i.e. something like this:


Code
if ($para =~ /.*(<p>.+?<\/p>.*/ig)


which will match as little as possible between the <p> and </p> tags.

I am just suggesting corrections to your regex, you'll need to combine this with what has been pointed out before about multi line match.


Kenosis
User

Feb 23, 2013, 8:08 PM

Post #8 of 9 (600 views)
Re: [IsabelleFr] Paragraph extraction [In reply to] Can't Post

You can't parse [X]HTML with regex. Or, at least, you shouldn't try--especially when you can use a module, like Mojo::DOM, that's well designed for the task:


Code
use strict; 
use warnings;
use Mojo::DOM;

my $html = <<END;
<p>foo</p> <p>
bar</p>
<p>
foo bar
</p><p> bar
foo
bar

</p>
END

my $dom = Mojo::DOM->new($html);

for my $paragraph ( $dom->find('p')->each ) {
print $paragraph->text, "\n";
}


Output:

Quote
foo
bar
foo bar
bar foo bar


If you don't want smart whitespace trimming (notice that the text of the paragraphs has been reformatted), you can do the following:


Code
print $paragraph->text(0)



(This post was edited by Kenosis on Feb 23, 2013, 9:49 PM)


Laurent_R
Enthusiast / Moderator

Feb 24, 2013, 1:34 AM

Post #9 of 9 (578 views)
Re: [Kenosis] Paragraph extraction [In reply to] Can't Post

While I restricted my previous post to pointing out to a flaw in the OP's regex, I definitely agree with Kenosis that it is generally a bad idea to use regexes for parsing HTML, except possibly for the most extremely simple cases.

There are many CPAN modules that can do this far better for you.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives