CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Advanced:
encoding trouble: cannot concatenate unicode characters with codepoint between #128 and #256

 



ulo
Novice

Sep 6, 2011, 1:19 AM

Post #1 of 8 (2714 views)
encoding trouble: cannot concatenate unicode characters with codepoint between #128 and #256 Can't Post

Hi,

I have a weird encoding problem with just a small number of unicode characters. If I want to use chr($cp) to produce unicode the unicode character with codepoint $cp, this works fine in principle. It seems to work for all codepoints >= 0x100 without limitation, and it also works for the ascii range. Why shouldn't it?

For the range in between (0x80-0xFF) it works to produce the character and directly print it - but as soon as I try to integrate te char in some string, the replacement character U+FFFD is printed instead. No matter if I use the '.' operator, or s/// or whatever.

Check out the following program:

Code
use encoding "utf8"; 

print "Use comma to separate arguments for print-function\n";
print chr(0x0061),"\n"; # LATIN SMALL LETTER A: trivial
print chr(0x00DC),"\n"; # LATIN CAPITAL LETTER U WITH DIAERESIS: works fine
print chr(0x2184),"\n"; # LATIN SMALL LETTER REVERSED C: works fine

print "Use string concatenation instead - expect same output\n";
print chr(0x0061)."\n"; # works fine
print chr(0x00DC)."\n"; # FAILS! prints the replacement character U+FFFD instead!!!!
print chr(0x2184)."\n"; # works fine


Or, more systematic:

Code
use encoding "utf8"; 
for(0..0x120) {
print "$_\t",chr($_)."\n";
print "$_\t".chr($_)."\n";
}


Is this a bug? Can it ever be a feature??

Thanks for your help!

ps. I'm new to this forum, so please just tell me if I missed any conventions ...


rovf
Veteran

Sep 6, 2011, 3:41 AM

Post #2 of 8 (2691 views)
Re: [ulo] encoding trouble: cannot concatenate unicode characters with codepoint between #128 and #256 [In reply to] Can't Post

I modified your example a bit, and in my modification, the problem does not occur:


Code
$ perl -we 'use encoding "utf8";print chr($_)."01" for (0x0061,0x00DC,0x2184)'|od -cx


When I execute this on the command line, I get:


Code
0000000   a   0   1 357 277 275   0   1 342 206 204   0   1 
3061 ef31 bdbf 3130 86e2 3084 0031


I inserted the string "01" so that we can easier recognize the byte order. We see that 0x0061 maps to 61 (as expected), 0x00DC becomes efbdbf (don't know whether this is correct, but this is certainly different from your result), and 0x2184 turns into e28684. This is with Perl 5.10.1 running on Cygwin.

Codepoint 0xFFFD would correspond to efbfbd in UTF8 encoding, right?

If you use my example, do you get the same result as I do, or do you still get the encoding for 0xFFFD for char (0xdc)?


ulo
Novice

Sep 6, 2011, 4:39 AM

Post #3 of 8 (2681 views)
Re: [rovf] encoding trouble: cannot concatenate unicode characters with codepoint between #128 and #256 [In reply to] Can't Post

Thanks for your answer, rovf.
Your example does exactly the same thing on my machine, and I think it reproduces my problem. I find od -cx somewhat confusing - if you use od -ctx1 (print char and hex for each single byte) you see that, again, the replacement char ef bf bd is printed:

Code
$ perl -we 'use encoding "utf8";print chr($_)."01" for (0x0061,0x00DC,0x2184)'|od -ctx1 
0000000 a 0 1 357 277 275 0 1 342 206 204 0 1
61 30 31 ef bf bd 30 31 e2 86 84 30 31
0000015


Just another way to see it is to convert it to utf32:

Code
$ perl -we 'use encoding "utf8";print chr($_)."01" for (0x0061,0x00DC,0x2184)'| iconv -tutf32LE | od -tx4 
0000000 00000061 00000030 00000031 0000fffd
0000020 00000030 00000031 00002184 00000030
0000040 00000031
0000044



rovf
Veteran

Sep 6, 2011, 6:20 AM

Post #4 of 8 (2669 views)
Re: [ulo] encoding trouble: cannot concatenate unicode characters with codepoint between #128 and #256 [In reply to] Can't Post

Indeed, you are right! Silly that I couldn't see it in the first place.

However, I found a hint why these characters are treated differently. From perldoc -f chr :



Note that characters from 128 to 255 (inclusive) are by default
internally not encoded as UTF-8 for backward compatibility
reasons.


Of course, this does not explain yet why the problem occurs just with concetanation.

I thought first that it might be related to the fact that catenation puts chr() into scalar mode, but this is not the reason. Even if I put it into list context, then take the first element of the list, and catenate it, the bug appears:


Code
print(([chr(0x00DC)]->[0])."\n")


BTW, it is not only catenation. Interpolation also doesn't work:


Code
print "@{ [chr(0x00DC)] }\n"


Since


Code
print(chr(0x00DC),"\n")


works, I feel that it is not just a bug in chr, but somehow deeper in Perl, when it comes to manipulate Unicode strings.

If you can't find a good explanation in this forum, I suggest that you explain the issue at http://perlmonks.org/, and if they also can't explain it, I would file a bug report....


ulo
Novice

Sep 6, 2011, 7:52 AM

Post #5 of 8 (2656 views)
Re: [rovf] encoding trouble: cannot concatenate unicode characters with codepoint between #128 and #256 [In reply to] Can't Post


Quote
Note that characters from 128 to 255 (inclusive) are by default
internally not encoded as UTF-8 for backward compatibility
reasons.

This is valuable information, I'm sure the reason is hidden here.
Let's see if the monks have sth to say about it ...


rovf
Veteran

Sep 7, 2011, 1:13 AM

Post #6 of 8 (2539 views)
Re: [ulo] encoding trouble: cannot concatenate unicode characters with codepoint between #128 and #256 [In reply to] Can't Post


Quote
I'm sure the reason is hidden here.


Maybe, but with this alone it is difficult to understand why print($this_character,"x") works, but print($this_character."x") does not.

Let us know your findings, in case you get an explanation...


rovf
Veteran

Sep 8, 2011, 2:11 AM

Post #7 of 8 (2512 views)
Re: [ulo] encoding trouble: cannot concatenate unicode characters with codepoint between #128 and #256 [In reply to] Can't Post

I found this in perluniintro:


Quote
Note that "\x.." (no "{}" and only two hexadecimal digits), "\x{...}",
and "chr(...)" for arguments less than 0x100 (decimal 256) generate an
eight-bit character for backward compatibility with older Perls. For
arguments of 0x100 or more, Unicode characters are always produced. If
you want to force the production of Unicode characters regardless of the
numeric value, use "pack("U", ...)" instead of "\x..", "\x{...}", or
"chr()".


Maybe it helps....


ulo
Novice

Sep 15, 2011, 1:05 AM

Post #8 of 8 (2443 views)
Re: [rovf] encoding trouble: cannot concatenate unicode characters with codepoint between #128 and #256 [In reply to] Can't Post

I will try and give this thread a sort of wrap-up...

Perl Unicode has problems with codepoints 128-255 (0x80-0xff). This means that e.g. chr(228) or related things like \x{..} will give you trouble. As far as I can see this is known as the "Unicode Bug", or at least an important aspect thereof. Often it is not called a bug but a backwards comaptibility issue. Whatever ... (see links below)

In my examples above things begin to go wrong as soon as I concatenate these characters with other strings. The links listed below also include indications why this could be so. But after years and years of perl programming with utf8, I still find it very hard to finally understand.


The sources below often suggest utf8::upgrade or utf8::encode but neither worked for me. What did work was what rovf found:
pack( "U",0x00DC)


Code
print chr(0x0061)."-"; # works fine 
print pack("U",0x0061)."-"; # works fine

print chr(0x00DC)."-"; # FAILS! prints the replacement character U+FFFD instead!!!!
print pack("U",0x00DC)."-"; # WORKS!

print chr(0x2184)."-"; # works fine
print pack("U",0x2184)."-"; # works fine


Thanks rovf for the helpful comments.

Look here:
http://perldoc.perl.org/perlunicode.html#Byte-and-Character-Semantics
http://perldoc.perl.org/perlunicode.html#The-%22Unicode-Bug%22
http://perldoc.perl.org/perluniintro.html#Questions-With-Answers


Schei� Encoding!

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives