 |
|
Home:
Perl Programming Help:
Beginner:
Re: [IsabelleFr] Paragraph extraction:
Edit Log
|
|

Kenosis
User
Feb 23, 2013, 8:08 PM
Views: 279
|
|
Re: [IsabelleFr] Paragraph extraction
|
|
|
You can't parse [X]HTML with regex. Or, at least, you shouldn't try--especially when you can use a module, like Mojo::DOM, that's well designed for the task:
use strict; use warnings; use Mojo::DOM; my $html = <<END; <p>foo</p> <p> bar</p> <p> foo bar </p><p> bar foo bar </p> END my $dom = Mojo::DOM->new($html); for my $paragraph ( $dom->find('p')->each ) { print $paragraph->text, "\n"; } Output:
foo bar foo bar bar foo bar If you don't want smart whitespace trimming (notice that the text of the paragraphs has been reformatted), you can do the following:
print $paragraph->text(0)
(This post was edited by Kenosis on Feb 23, 2013, 9:49 PM)
|
|
|
Edit Log:
|
|
Post edited by Kenosis
(User) on Feb 23, 2013, 8:15 PM
|
|
Post edited by Kenosis
(User) on Feb 23, 2013, 9:17 PM
|
|
Post edited by Kenosis
(User) on Feb 23, 2013, 9:20 PM
|
|
Post edited by Kenosis
(User) on Feb 23, 2013, 9:49 PM
|
|
|  |