CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Intermediate:
XML::Twig doctype and entity handling


New User

Sep 6, 2008, 11:08 AM

Post #1 of 1 (1948 views)
XML::Twig doctype and entity handling Can't Post

I'm writing a program that needs to extract a clump of XML metadata stored inside of a noncompliant HTML file and then perform a number of operations on that metadata. (Specifically, for those curious, this is part of a Mobipocket .prc to IPDF .epub ebook converter.)

The HTML file in question has no doctype declaration, and XHTML entities may be found in the metadata portion. In particular, © is the first entity that XML::Parser will choke on in my current test data.

Could someone please provide me with an example of how to get XML::Twig to recognize XHTML entities? (Or even just © to get me started?) I came up with a workaround involving slurping the input file and using a regular expression to split the metadata out into a temporary file, then run tidy on it, but it's something of an evil hack, given that I have to just read the results of that back into XML::Twig anyway.

My last attempt at getting XML::Twig to read this looks like this:

    $mobihtmltwig = XML::Twig->new( 
load_DTD => 1,
twig_roots => { 'metadata' => 1 },
twig_handlers => { 'metadata' => \&twig_cut_metadata },
output_encoding => 'utf8',
pretty_print => 'indented',
twig_print_outside_roots => 'HTML'

"+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN");

$mobihtmltwig->entity_list->add_new_ent(copy => "©");

print $mobihtmltwig->entity_names,"\n";


It dies at the parsefile command with:

undefined entity at line 1, column 413, byte 413 at /usr/lib/perl5/XML/ line 187

Byte 413 is the first ©. This is despite 'copy' being present in the entity list.

Thanks for any help.


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives