CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Exporting multibyte characters in UTF-8 to XML

 



ev0lution37
Novice

Nov 13, 2012, 5:22 PM

Post #1 of 2 (4915 views)
Exporting multibyte characters in UTF-8 to XML Can't Post

I've been looking around for a solution to this but haven't had luck in finding anything in particular that works.

The current project I'm working on involves exporting from a MySQL database in utf8_general_ci collation and formatting to XML. No problem for latin-based characters so far, as I've used XML::Writer and things work fine.

We're recently started covering Chinese/Arabic sources, however. It seems like they all end up as junk in the XML files (usually just a mess of question marks, without the ability to open in a browser due to "error on line 2 at column 190: Encoding error"

Here's the code portion I use to write to the XML:

Code
#Set up temporary XML output file for XML::Writer 
my $output = IO::File->new(">$PassedSource.xml");

#Initiate XML::Writer
my $writer = XML::Writer->new(OUTPUT => $output);

#Write XML header and start tag.
$writer->xmlDecl("UTF-8");
$writer->startTag("objects");


#Iterates through MySQL table until all matching rows have been parsed and written to XML.
while (my ($SourceID,$SourceURL,$artTitle,$artDate,$artText,$objKey,$robName,$exID,$firstExt,$lastExt,$extYN,$lastUD) = $sth->fetchrow_array ())
{

$FileName = $SourceID;

$writer->startTag("ISIS");

#### Removing illegal characters.
# allowed: [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$artTitle =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
# restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F]
$artTitle =~ s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go;
$log->debug("Article Title: $artTitle");

#### Removing illegal characters.
# allowed: [#x 1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$artText =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
# restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F]
$artText =~ s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go;


#Write to XML file per each row in Database (aka, each article).
$writer->startTag("articleTitle");
$writer->characters($artTitle);
$writer->endTag("articleTitle");

$writer->startTag("date");
$writer->characters($artDate);
$writer->endTag("date");

$writer->startTag("url");
$writer->characters($SourceURL);
$writer->endTag("url");

$writer->startTag("articleText");
$writer->characters($artText);
$writer->endTag("articleText");

$writer->endTag("ISIS");

my $deleteSQL = "DELETE FROM CollectionDB.CollectionDB WHERE url='$SourceURL'";

#Clears MySQL database per loop iteration.
my $sth = $dbh->prepare($deleteSQL);
$sth->execute();


}

$writer->endTag("objects");
$writer->end();

$log->debug("Successfully wrote XML to temporary XML file: $output" . "\n");

$output->close();


Thanks in advance for any insight you can give.


wickedxter
User

Nov 21, 2012, 8:34 PM

Post #2 of 2 (4694 views)
Re: [ev0lution37] Exporting multibyte characters in UTF-8 to XML [In reply to] Can't Post


Code
#Initiate XML::Writer  
my $writer = XML::Writer->new(OUTPUT => $output,ENCODING => 'utf-8');


ENCODING
A character encoding; currently this must be one of 'utf-8' or 'us-ascii'. If present, it will be used for the underlying character encoding and as the default in the XML declaration.

http://search.cpan.org/~josephw/XML-Writer-0.615/Writer.pm

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives