
Kenosis
User
Feb 23, 2013, 8:08 PM
Post #8 of 9
(240 views)
|
|
Re: [IsabelleFr] Paragraph extraction
[In reply to]
|
Can't Post
|
|
You can't parse [X]HTML with regex. Or, at least, you shouldn't try--especially when you can use a module, like Mojo::DOM, that's well designed for the task:
use strict; use warnings; use Mojo::DOM; my $html = <<END; <p>foo</p> <p> bar</p> <p> foo bar </p><p> bar foo bar </p> END my $dom = Mojo::DOM->new($html); for my $paragraph ( $dom->find('p')->each ) { print $paragraph->text, "\n"; } Output:
foo bar foo bar bar foo bar If you don't want smart whitespace trimming (notice that the text of the paragraphs has been reformatted), you can do the following:
print $paragraph->text(0)
(This post was edited by Kenosis on Feb 23, 2013, 9:49 PM)
|