Home: Perl Programming Help: Advanced:
sort unicode text that respects whitespace?

New User

Nov 10, 2010, 3:36 AM

Views: 2585
sort unicode text that respects whitespace?

I've a bunch of strings I need to sort. Strings are unicode and now I face the unexpected difficulty that unicode-sort per definition does not respect whitespaces. Whitespaces are simply ignored when strings are compared.

As I've learned this is intended and part of the unicode-convention. Every Posix-Search will bring the same result.

For working with text in a library this is ... bad. Cause when I sort a list of Institute-names this will give the following results:

Institute for evil studies
Institutes aquiring wealth
Institute searching for god

instead of the intended result (where " " is sorted before "a")

Institute for evil studies
Institute searching for god
Institutes aquiring wealth

I help myself by replacing all whitespace with "aaaaaaaaaaaaaaaaaaaa" in the search-func but thats far from elegant and unnecessary effort.

I'm quite sure that this is a common problem and maybe there is an easy solution for my problem.

I looked at Unicode::Collate but ... I was a bit overdone by it cause my internal knowledge of unicode is limited and I dont know anything about normalization to start with ...

my code that will illustrate the problem:

I use en_US.UTF-8 as locale but any other UTF-locale should bring the same result. I tested de_AT.UTF-8 which makes no difference on this simple example. In my real application it makes huge difference cause its unicode data with load of german umlauts, spanish letters and even some russian entries.

#!/usr/bin/perl -w  

use strict;
use locale;
use utf8;
use POSIX qw(locale_h);
setlocale(LC_ALL, "en_US.UTF-8");

my @l=("Institute for evil studies",
"Institute searching for god",
"Institutes aquiring wealth");

@l=sort {$a cmp $b} @l;

print join("\n",@l),"\n";

Thanks a lot,