CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
hex metacharacters for characters below x100

 

First page Previous page 1 2 Next page Last page  View All


JonathanPool
Novice

Mar 26, 2010, 10:16 PM

Post #1 of 29 (5308 views)
hex metacharacters for characters below x100 Can't Post

I'm finding that the hex metacharacter syntax works in regular expressions as I expect with characters from x100 up, but not with characters from xff down.

For example, I can match the "Ā" in a regular expression with \x{0100} or with [\x{0100}], but I cannot match the "ÿ" with \x{00ff} or [\x{00ff}] or \x{ff} or [\x{ff}] or \xff or [\xff].

I have found no Perl doc that describes this difference in treatment.

Can somebody explain this?

I'm attaching an example script and the input file it processes.
Attachments: debug.pl (0.92 KB)
  debug.txt (13 B)


7stud
Enthusiast

Mar 27, 2010, 2:36 AM

Post #2 of 29 (5299 views)
Re: [JonathanPool] hex metacharacters for characters below x100 [In reply to] Can't Post


Quote
the hex metacharacter syntax

I'm not sure where you got that term from. I would call the syntax: \x{ff} a "unicode escape sequence" to distinguish it from a regular 'hexadecimal escape sequence'.


Quote
For example, I can match the "Ā" in a regular expression with \x{0100} or with [\x{0100}]

Not me:

Code
use strict; 
use warnings;
use 5.010;

use utf8;

my @strings = (
'abc',
'Ā',
);

for (@strings) {

if (/\x{62}/) {
say "matched 'a'";
}

if (/\x{0100}/) {
say q{matched 'cap A with umlaut'};
}
}

--output:--
matched 'a'



Quote
I'm finding that the hex metacharacter syntax works in regular expressions as I expect with characters from x100 up, but not with characters from xff down.


Maybe the following will alter your expectations:


Quote
Note that "\x.." (no "{}" and only two hexadecimal digits), "\x{...}",
and "chr(...)" for arguments less than 0x100 (decimal 256) generate an
eight-bit character for backward compatibility with older Perls. For
arguments of 0x100 or more, Unicode characters are always produced. If
you want to force the production of Unicode characters regardless of
the numeric value, use "pack("U", ...)" instead of "\x..", "\x{...}",
or "chr()".

See perluniintro.

ff is 1111 1111 in binary, which is 255 in decimal. The range 0-255 are the 256 numbers than can be represented by one byte(= 8 bits). Usually problems along a boundary like that can be traced to ascii, which represents characters using one byte. 7 bits of that one byte produce the numbers 0-127, which are the 128 ascii characters.

Also, nothing in your post mentions an 'encoding', e.g. utf8. You can compare unicode strings to other unicode strings, but you can't compare unicode strings to regular strings, like utf8 encoded strings (the exception being unicode strings below \x{ff} which perl automatically converts to ascii and therefore can be compared to other ascii strings or utf8 strings). You have to encode a unicode string to compare it to a utf8 encoded string. Unfortunately, perl may automatically encode unicode strings in certain situations, which can be very confusing.


(This post was edited by 7stud on Mar 27, 2010, 3:27 AM)


JonathanPool
Novice

Mar 27, 2010, 8:43 AM

Post #3 of 29 (5278 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

Thank you very much. I was relying on the statement (perluniintro) that "A user of Perl does not normally need to know nor care how Perl happens to encode its internal strings, but it becomes relevant when out-putting Unicode strings to a stream without a PerlIO layer -- one with the "default" encoding." I was outputting nothing, so I didn't care.

It seems messy that Perl forces users to consider whether a codepoint is less than 0x100, when using hex representations.

To test this analysis, I have added to my script, creating variables with the "pack" function and then using those variables in regular expressions. Those patterns do indeed match the input characters with codepoints 0xff and 0x100. I am attaching the revised script.

As for your (7stud) script, I executed it on my Perl 5.10.0 system and got a different output from you. I got:

matched 'a'
matched 'cap A with umlaut'

Any idea why our outputs differ?

As for the term "metacharacter syntax", I got that from perlre (my problem arose with regular expressions):

"Characters may be specified using a metacharacter syntax much like that used in C: ... \xnn, where nn are hexadecimal digits, matches the character whose numeric value is nn."
Attachments: debug.pl (1.07 KB)


7stud
Enthusiast

Mar 27, 2010, 7:15 PM

Post #4 of 29 (5268 views)
Re: [JonathanPool] hex metacharacters for characters below x100 [In reply to] Can't Post

I don't see how that is possible. U+0100 is 'A with macron', and U+00C4 is 'A with diaeresis', which is the umlaut.

You are saying that you copied and pasted my script and got the output you posted? I am having a hard time believing that.


(This post was edited by 7stud on Mar 27, 2010, 7:17 PM)


JonathanPool
Novice

Mar 27, 2010, 7:29 PM

Post #5 of 29 (5263 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

Your script contained "umlaut" hard-coded into it, but the character in your string was A with a macron. So yes, the message was incorrect, and I disregarded that. What was interesting was that your script output a success message for character 0x0100, although it had not output that when you had run the script. Yes, I copied and pasted your script, adding only a perl invocation line at the top.


7stud
Enthusiast

Mar 28, 2010, 2:27 AM

Post #6 of 29 (5252 views)
Re: [JonathanPool] hex metacharacters for characters below x100 [In reply to] Can't Post


Quote
Your script contained "umlaut" hard-coded into it, but the character in your string was A with a macron.

I don't understand what you are saying. My string has 'A with daeiresis' in it, and the regex looks for an 'A with macron'. There shouldn't be a match. On my system, perl does an automatic conversion in the regex: it converts the Unicode code point \x{0100} to UTF-8 because perl sees that the string I am searching has UTF-8 characters in it.


(This post was edited by 7stud on Mar 28, 2010, 2:31 AM)


JonathanPool
Novice

Mar 28, 2010, 8:51 AM

Post #7 of 29 (5247 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

I see "cap A with umlaut" in the code posted in your first reply. I see no "A with daeiresis" in your code.

If on your system Perl converts characters into UTF-8, then I understand it finds no match. But what makes Perl do that? I believe I haven't seen that behavior.


7stud
Enthusiast

Mar 28, 2010, 12:37 PM

Post #8 of 29 (5244 views)
Re: [JonathanPool] hex metacharacters for characters below x100 [In reply to] Can't Post


Quote
I see "cap A with umlaut" in the code posted in your first reply. I see no "A with daeiresis" in your code.

As far as I can tell, there is no such thing as the first one in Unicode. See here:

http://unicode.org/charts/charindex.html


JonathanPool
Novice

Mar 28, 2010, 2:13 PM

Post #9 of 29 (5239 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

True, but that's the string that was in your code quoted in your first reply to my posting. It doesn't matter what it says; it was just a quoted string.


7stud
Enthusiast

Mar 28, 2010, 4:29 PM

Post #10 of 29 (5237 views)
Re: [JonathanPool] hex metacharacters for characters below x100 [In reply to] Can't Post

I can assure you it's not. I copy and pasted the UTF-8 character C3 84 into my string. That character's Unicode code point is U+00C4, and it's official name is "LATIN CAPITAL LETTER A WITH DIAERESIS". So, I'm not sure what you are talking about.


Quote
If on your system Perl converts characters into UTF-8, then I understand it finds no match. But what makes Perl do that? I believe I haven't seen that behavior.


perlunicode:

Quote
Regular Expressions

The regular expression compiler produces polymorphic opcodes. That
is, the pattern adapts to the data and automatically switches to
the Unicode character scheme when presented with data that is
internally encoded in UTF-8 -- or instead uses a traditional byte
scheme when presented with byte data.



(This post was edited by 7stud on Mar 28, 2010, 4:35 PM)


JonathanPool
Novice

Mar 28, 2010, 4:34 PM

Post #11 of 29 (5233 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

I'm talking about the code in your first reply in this thread. That code contains "umlaut" in it.


7stud
Enthusiast

Mar 28, 2010, 4:38 PM

Post #12 of 29 (5232 views)
Re: [JonathanPool] hex metacharacters for characters below x100 [In reply to] Can't Post

Ok, I see what you are saying now. You could have simply said, "The output of your program is "A with umlaut". I also added some info to my previous post.

I still don't know why your code finds a match for U+0100. Neither character in the string has that Unicode code point. The A's code point is U+00C4, and the euro symbol's code point is U+20AC.

Maybe instead of copying the string with the A and the euro symbol, you should try creating that string yourself in your source code.


(This post was edited by 7stud on Mar 28, 2010, 4:49 PM)


JonathanPool
Novice

Mar 28, 2010, 4:52 PM

Post #13 of 29 (5226 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

Sorry for the miscommunication.

The notion that Perl uses UTF-8 internally confuses me. In perlunicode, I see "A character in Perl is logically just a number ranging from 0 to 2**31 or so." In my experience, this is how Perl behaves; I can match a character with its codepoint value, not with 2, 3, or 4 successive values corresponding to its UTF-8 encoding. If you could explain the sense in which UTF-8 is an internal representation, I would be grateful.


7stud
Enthusiast

Mar 28, 2010, 11:24 PM

Post #14 of 29 (5217 views)
Re: [JonathanPool] hex metacharacters for characters below x100 [In reply to] Can't Post

As I understand it, all perl strings have a UTF-8 flag, which is either on or off. perl tries to store strings as ascii characters(single byte) as long as it can, then perl turns the UTF-8 flag on when it realizes there are multi-byte characters present.

In my opinion, the docs are poorly written because they seem to use the terms Unicode and UTF-8 as if they are equivalent--and they are different things. A case in point is your quote. I think that quote is referring to the Unicode code point. 2**31 is roughly the highest integer that can be represented by 4 bytes (= 32 bits).


(This post was edited by 7stud on Mar 28, 2010, 11:31 PM)


7stud
Enthusiast

Mar 28, 2010, 11:39 PM

Post #15 of 29 (5213 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

This from perluniintro:


Quote
Internally, Perl currently uses either whatever the native eight-bit
character set of the platform (for example Latin-1) is, defaulting to
UTF-8, to encode Unicode strings. Specifically, if all code points in
the string are 0xFF or less, Perl uses the native eight-bit character
set. Otherwise, it uses UTF-8.


Despite what that says, it's possible that perl stores the Unicode code point internally. But there may not be a way to know that. On output, perl may just encode with Latin-1 for single byte characters and UTF-8 for multi-byte characters.

It seems that today most computer languages have completely screwed up their unicode support. They try to make things transparent for beginners who don't know what unicode is by employing all kinds of implicit, automatic conversions--which just serves to make the whole system completely baffling to people who know about unicode.

There should be a pragma that allows you to turn off all automatic conversions so that you can handle the encoding and decoding of everything yourself.

Did you try copying my program and deleting the initial string, and then pasting the UTF-8 characters into the string yourself?


(This post was edited by 7stud on Mar 29, 2010, 12:27 AM)


JonathanPool
Novice

Mar 29, 2010, 12:16 AM

Post #16 of 29 (5206 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

I agree completely that automatic conversion should be optional. The Unicode support documentation is quite confusing.

I have deleted the A with macron from your script and re-entered it. When I enter the precomposed character, then it is still matched by \x{0100}. When I enter 'A' followed by a combining macron, then the match fails. So you can check, I am posting here my copy with the precomposed character, which outputs both match reports. (I also changed the word to "macron".)


Code
#!/usr/bin/perl -w 
use strict;
use warnings;
use 5.010;

use utf8;

my @strings = (
'abc',
'Ā',
);

for (@strings) {

if (/\x{62}/) {
say "matched 'a'";
}

if (/\x{0100}/) {
say q{matched 'cap A with macron'};
}
}



7stud
Enthusiast

Mar 29, 2010, 12:31 AM

Post #17 of 29 (5203 views)
Re: [JonathanPool] hex metacharacters for characters below x100 [In reply to] Can't Post

I am at a loss to explain the match. Maybe try another experiment, delete the string, and this time enter only the 'A with daeiresis'--don't put the euro symbol in there. Does a regex containing \x{0100} still match the string?


7stud
Enthusiast

Mar 29, 2010, 12:56 AM

Post #18 of 29 (5199 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

You can also try another test. See what happens if you 'round trip' the string. The following function is from perluniintro. It prints out the string, converting UTF-8 characters to unicode escape sequences:


Code
sub nice_string { 
join("",
map { $_ > 255 ? # if wide character...
sprintf("\\x{%04X}", $_) : # \x{...}
chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
sprintf("\\x%02X", $_) : # \x..
quotemeta(chr($_)) # else quoted or as themselves
} unpack("W*", $_[0])); # unpack Unicode characters
}


print nice_string("foo\x{0100}bar\n"), "\n";



(This post was edited by 7stud on Mar 29, 2010, 12:57 AM)


JonathanPool
Novice

Mar 29, 2010, 6:04 PM

Post #19 of 29 (5191 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

Thanks. Yes, this one outputs "foo\x{0100}bar\x0A" and a newline, as it seems it should.


7stud
Enthusiast

Mar 30, 2010, 2:09 AM

Post #20 of 29 (5187 views)
Re: [JonathanPool] hex metacharacters for characters below x100 [In reply to] Can't Post

What output do you get here:


Code
use strict; 
use warnings;
use 5.010;

use utf8;

my $string = "Ä";
say utf8::is_utf8($string);

my $decimal_code = ord($string);
say $decimal_code;

printf "%x \n", $decimal_code;



JonathanPool
Novice

Mar 30, 2010, 10:10 AM

Post #21 of 29 (5176 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

I get:

1
196
c4


7stud
Enthusiast

Mar 30, 2010, 10:57 AM

Post #22 of 29 (5173 views)
Re: [JonathanPool] hex metacharacters for characters below x100 [In reply to] Can't Post

Everything looks correct to me. perl says the string is a utf8 string, so it isn't chopping off any bytes. The round tripping of the string confirms that.

The string contains a character whose unicode code point is U+00C4--as reported by perl, yet you say your experiments show that a regex of \x{0100} matches that character. I don't see how that's possible, and I don't get a match when I try it.

I don't have any more ideas. Sorry.

The last thing you might want to do is post your OS, as well as the output from perl -v and perl -V, then see if someone with a similar setup gets the same regex match that you are seeing.

Here's my info:


Code
$ perl -v 

This is perl, v5.10.1 (*) built for darwin-2level-thread-multi

Copyright 1987-2009, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.



Code
$ perl -V 
Summary of my perl5 (revision 5 version 10 subversion 1) configuration:

Platform:
osname=darwin, osvers=8.11.1, archname=darwin-2level-thread-multi
uname='darwin xxxx darwin kernel version 8.11.1:
wed oct 10 18:23:28 pdt 2007; root:xnu-792.25.20~1release_i386
i386 i386 '
config_args='-Dusethreads'
hint=previous, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -I/usr/local/include',
optimize='-O3',
cppflags='-no-cpp-precomp -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing
-pipe -I/usr/local/include -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing
-pipe -I/usr/local/include -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -I/usr/local/include'
ccversion='', gccversion='4.0.1 (Apple Computer, Inc. build 5370)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib
libs=-ldbm -ldl -lm -lc
perllibs=-ldl -lm -lc
libc=/usr/lib/libc.dylib, so=dylib, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/usr/local/lib'


Characteristics of this binary (from libperl):
Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV
PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP USE_ITHREADS
USE_LARGE_FILES USE_PERLIO
Built under darwin
Compiled at Dec 2 2009 12:56:11
@INC:
/usr/local/lib/perl5/5.10.1/darwin-2level
/usr/local/lib/perl5/5.10.1
/usr/local/lib/perl5/site_perl/5.10.1/darwin-2level
/usr/local/lib/perl5/site_perl/5.10.1
/usr/local/lib/perl5/site_perl/5.8.6
/usr/local/lib/perl5/site_perl
.



(This post was edited by 7stud on Mar 30, 2010, 11:12 AM)


JonathanPool
Novice

Mar 30, 2010, 11:53 AM

Post #23 of 29 (5163 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

There seems to be a misunderstanding. I didn't find \x{0100} in a regular expression matching the U+00C4 character. I found it matching the U+0100 character (A with macron) that you included in the script in your first reply.


7stud
Enthusiast

Apr 1, 2010, 5:29 AM

Post #24 of 29 (5142 views)
Re: [JonathanPool] hex metacharacters for characters below x100 [In reply to] Can't Post

We went over that previously:


Quote
My string has 'A with daeiresis' in it, and the regex looks for an 'A with macron'.


Now you are saying that you didn't actually copy and paste my script, which is what you repeatedly claimed you did?


(This post was edited by 7stud on Apr 1, 2010, 5:32 AM)


JonathanPool
Novice

Apr 1, 2010, 6:45 AM

Post #25 of 29 (5136 views)
Re: [7stud] hex metacharacters for characters below x100 [In reply to] Can't Post

I don't understand the problem here. I am not saying that I did not copy and paste your script. What is the issue?

Your quote says that your string contains A with daeiresis. I believe this is not correct. It looks like an A with macron to me.

First page Previous page 1 2 Next page Last page  View All
 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives