CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Regular Expressions:
Regular expression / regex substition on Unicode text

 



thomas.hedden
New User

Feb 2, 2010, 8:38 AM

Post #1 of 2 (3907 views)
Regular expression / regex substition on Unicode text Can't Post

I have a large file encoded in Unicode that I need to convert to CSV. In general, I know how to do this by regular expression substitutions using sed or Perl, but one problem I am having is that I need to put a quotation mark at the end of each line to protect the last field. The usual regex substitution ...
s/$/"/
... works fine for 7-bit ASCII text, but when I run this on my Unicode text file, the double quotation mark appears at the BEGINNING of the FOLLOWING line, not at the end of the line on which it's supposed to appear.
The file came from a Windows system, but piping through dos2unix doesn't seem to make any difference.
I've tried the "use Encode;" pragma with several different encodings, but I get the same result.
Perhaps I'm doing something wrong.
Does anyone know of a special library function intended for this purpose, a Perl pragma, etc., that
would accomplish this easily? This should be a trivial problem.
Thanks in advance for any suggestions.
Tom


thomas.hedden
New User

Feb 2, 2010, 7:42 PM

Post #2 of 2 (3885 views)
Re: [thomas.hedden] Regular expression / regex substition on Unicode text [In reply to] Can't Post

As I mentioned, the regex ...
s/$/"/
... puts the `"' at the beginning of the following line,
and piping through dos2unix doesn't matter one
way or the other. Using `\n' instead of `$' doesn't
make any difference.
However, I discovered an interesting fact: The regex ...
s/\r/"/ # use `\r' instead of `$' or `\n'
... gives the expected result, and does so without
piping through dos2unix making any difference!
I also found that a C program using a wchar_t
declaration behaves similarly. That is, a character
that appears as if it should be output BEFORE the
EOL actually appears after it, if it is matched as '\n',
but if it is matched as '\r' then it is output as
expected.
My immediate problem is solved, however I wonder
whether this doesn't show a bug in Perl or in its
regular expression engine ...
This seems to be a clear case where Unicode
text is handled differently than non-Unicode text.
Any opinions?
Tom

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives