
Zed
New User
Sep 6, 2008, 11:08 AM
Post #1 of 1
(1744 views)
|
XML::Twig doctype and entity handling
|
Can't Post
|
|
I'm writing a program that needs to extract a clump of XML metadata stored inside of a noncompliant HTML file and then perform a number of operations on that metadata. (Specifically, for those curious, this is part of a Mobipocket .prc to IPDF .epub ebook converter.) The HTML file in question has no doctype declaration, and XHTML entities may be found in the metadata portion. In particular, © is the first entity that XML::Parser will choke on in my current test data. Could someone please provide me with an example of how to get XML::Twig to recognize XHTML entities? (Or even just © to get me started?) I came up with a workaround involving slurping the input file and using a regular expression to split the metadata out into a temporary file, then run tidy on it, but it's something of an evil hack, given that I have to just read the results of that back into XML::Twig anyway. My last attempt at getting XML::Twig to read this looks like this:
$mobihtmltwig = XML::Twig->new( load_DTD => 1, twig_roots => { 'metadata' => 1 }, twig_handlers => { 'metadata' => \&twig_cut_metadata }, output_encoding => 'utf8', pretty_print => 'indented', twig_print_outside_roots => 'HTML' ); $mobihtmltwig->set_doctype( 'package', "http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd", "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN"); $mobihtmltwig->entity_list->add_new_ent(copy => "©"); print $mobihtmltwig->entity_names,"\n"; $mobihtmltwig->parsefile($mobihtmlfile); It dies at the parsefile command with:
undefined entity at line 1, column 413, byte 413 at /usr/lib/perl5/XML/Parser.pm line 187 Byte 413 is the first ©. This is despite 'copy' being present in the entity list. Thanks for any help.
|