CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Frequently Asked Questions:
How can I match strings with multibyte characters?



Mar 15, 2001, 6:12 AM

Post #1 of 1 (39041 views)
How can I match strings with multibyte characters? Can't Post

How can I match strings with multibyte characters?

This is hard, and there's no good way. Perl does not directly support wide characters. It pretends that a byte and a character are synonymous. The following set of approaches was offered by Jeffrey Friedl, whose article in issue #5 of The Perl Journal talks about this very matter.

Let's suppose you have some weird Martian encoding where pairs of ASCII uppercase letters encode single Martian letters (i.e. the two bytes ``CV'' make a single Martian letter, as do the two bytes ``SG'', ``VS'', ``XX'', etc.). Other bytes represent single characters, just like ASCII.

So, the string of Martian ``I am CVSGXX!'' uses 12 bytes to encode the nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.

Now, say you want to search for the single character /GX/. Perl doesn't know about Martian, so it'll find the two bytes ``GX'' in the ``I am CVSGXX!'' string, even though that character isn't there: it just looks like it is because ``SG'' is next to ``XX'', but there's no real ``GX''. This is a big problem.

Here are a few ways, all painful, to deal with it:

   $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes 
# are no longer adjacent.
print "found GX!\n" if $martian =~ /GX/;

Or like this:

   @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g; 
# above is conceptually similar to: @chars = $text =~ m/(.)/g;
foreach $char (@chars) {
print "found GX!\n", last if $char eq 'GX';

Or like this:

   while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded 
print "found GX!\n", last if $1 eq 'GX';

Or like this:

   die "sorry, Perl doesn't (yet) have Martian support )-:\n";

In addition, a sample program which converts half-width to full-width katakana (in Shift-JIS or EUC encoding) is available from CPAN as

There are many double- (and multi-) byte encodings commonly used these days. Some versions of these have 1-, 2-, 3-, and 4-byte characters, all mixed.


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives