
ev0lution37
Novice
Nov 13, 2012, 5:22 PM
Post #1 of 2
(7111 views)
|
Exporting multibyte characters in UTF-8 to XML
|
Can't Post
|
|
I've been looking around for a solution to this but haven't had luck in finding anything in particular that works. The current project I'm working on involves exporting from a MySQL database in utf8_general_ci collation and formatting to XML. No problem for latin-based characters so far, as I've used XML::Writer and things work fine. We're recently started covering Chinese/Arabic sources, however. It seems like they all end up as junk in the XML files (usually just a mess of question marks, without the ability to open in a browser due to "error on line 2 at column 190: Encoding error" Here's the code portion I use to write to the XML:
#Set up temporary XML output file for XML::Writer my $output = IO::File->new(">$PassedSource.xml"); #Initiate XML::Writer my $writer = XML::Writer->new(OUTPUT => $output); #Write XML header and start tag. $writer->xmlDecl("UTF-8"); $writer->startTag("objects"); #Iterates through MySQL table until all matching rows have been parsed and written to XML. while (my ($SourceID,$SourceURL,$artTitle,$artDate,$artText,$objKey,$robName,$exID,$firstExt,$lastExt,$extYN,$lastUD) = $sth->fetchrow_array ()) { $FileName = $SourceID; $writer->startTag("ISIS"); #### Removing illegal characters. # allowed: [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] $artTitle =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go; # restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F] $artTitle =~ s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go; $log->debug("Article Title: $artTitle"); #### Removing illegal characters. # allowed: [#x 1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] $artText =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go; # restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F] $artText =~ s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go; #Write to XML file per each row in Database (aka, each article). $writer->startTag("articleTitle"); $writer->characters($artTitle); $writer->endTag("articleTitle"); $writer->startTag("date"); $writer->characters($artDate); $writer->endTag("date"); $writer->startTag("url"); $writer->characters($SourceURL); $writer->endTag("url"); $writer->startTag("articleText"); $writer->characters($artText); $writer->endTag("articleText"); $writer->endTag("ISIS"); my $deleteSQL = "DELETE FROM CollectionDB.CollectionDB WHERE url='$SourceURL'"; #Clears MySQL database per loop iteration. my $sth = $dbh->prepare($deleteSQL); $sth->execute(); } $writer->endTag("objects"); $writer->end(); $log->debug("Successfully wrote XML to temporary XML file: $output" . "\n"); $output->close(); Thanks in advance for any insight you can give.
|