CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
Regular expression of Unicode to distinguish kanji and hiragana

 



solukas
New User

Nov 11, 2011, 3:58 AM

Post #1 of 6 (8202 views)
Regular expression of Unicode to distinguish kanji and hiragana Can't Post

Dear all,

I have a task to distinguish kanji and hiragana if a file has both content. This is the content of the file:
The first line is hiragana and the second line is kanji, which I have to do nothing. But the last line contains both, I will just print it out.
ひゅうが
通行
乗じて

This is the code:

open (A_FILE, "<", "kata.txt");
my(@a_lines) = <A_FILE>; # read file into list

open(my $out, ">", "modified_kata.txt") or die "Can't open modified_kata.txt: $!";

foreach $a_line (@a_lines)
{
$sentence = $a_line;
if (($sentence =~ /\p{InHiragana}/) && ($sentence =~ /\p{InCJKUnifiedIdeographs}/)){
print $out $sentence . "\n";
}
}

It seems like that the perl cannot recognise the function /\p{}/, the result is still wrong if I put use utf8; on top. Do you have any suggestions?

I am a newbie in handling the unicode. Thanks very much for your help!!

Kind regards,

Luke


rovf
Veteran

Nov 11, 2011, 4:47 AM

Post #2 of 6 (8200 views)
Re: [solukas] Regular expression of Unicode to distinguish kanji and hiragana [In reply to] Can't Post


Quote
It seems like that the perl cannot recognise the function /\p{}/


Asides from the fact that this is not a *function*, do you have reason to believe, that your Perl version does not recognize \p, or merely that it doesn't recognize the named Block (i.e. InHiragana)? In that case, please let us know which Perl version you are using.

Did you try to use script names instead of block names (i.e. \p{Hiragana} instead of \p{InHiragana})? This seems to make more sense in your application anyway.


solukas
New User

Nov 14, 2011, 1:38 AM

Post #3 of 6 (7858 views)
Re: [rovf] Regular expression of Unicode to distinguish kanji and hiragana [In reply to] Can't Post

Thanks a lot for your reply. I am using ActivePerl 5.12.1 Build 1201 at the moment. I have tried both $sentence=~ /\p{Hiragana}/ and $sentence=~ /\p{InHiragana}/, both of them do not work.

Should I have use utf8; in the top?

Actually, there are more things in my application, but it is mainly built on that if then line, if it is sorted, other things would not be a problem.

Many thanks a lot for your help indeed!!

Kindest regards,

Luke


rovf
Veteran

Nov 14, 2011, 2:33 AM

Post #4 of 6 (7849 views)
Re: [solukas] Regular expression of Unicode to distinguish kanji and hiragana [In reply to] Can't Post

use utf8 just says that the Perl source code is in Unicode, so as long as your module doesn't use literal Kana or Kanji, I don't think you need it.

Could it be that $sentence is broken? You could, just for testing, put into $sentence literally some unicode code points (\x{....}) and see whether the pattern matches then.

BTW, did you read http://perldoc.perl.org/perluniintro.html?


solukas
New User

Nov 15, 2011, 2:32 AM

Post #5 of 6 (7663 views)
Re: [rovf] Regular expression of Unicode to distinguish kanji and hiragana [In reply to] Can't Post

Dear Rovf,

The program works now after I add
use Encode qw/encode decode/; and (decode("utf-8",$sentence).

However, I do not quite understand why it works after I "decode" it.

use Encode qw/encode decode/;

open (A_FILE, "<", "kata.txt");
my(@a_lines) = <A_FILE>; # read file into list

open(my $out, ">", "modified_kata.txt") or die "Can't open modified_kata.txt: $!";

foreach $a_line (@a_lines)
{
$sentence = $a_line;
if (((decode("utf-8",$sentence) =~ /\p{Hiragana}/) && (decode("utf-8", $sentence) =~ /\p{CJKUnifiedIdeographs}/))){
print $out $sentence . "\n";
}
}

Thanks very much for your help!

Luke


rovf
Veteran

Nov 15, 2011, 3:47 AM

Post #6 of 6 (7662 views)
Re: [solukas] Regular expression of Unicode to distinguish kanji and hiragana [In reply to] Can't Post

Hmmm.... This confirms that your string did not contain correct Japanese characters.

If you have time, try this alternative: Instead of using decode in the regexp, use the :utf8 layer when reading from the file into memory.
If this works too, this would be the cleaner solution.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives