CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
unknown character parsing problem

 



britantyo
Novice

Feb 13, 2011, 11:18 AM

Post #1 of 2 (664 views)
unknown character parsing problem Can't Post

okay ya'll, need help here.

let's get straight to the problem.
i tried to open a file which its content is very awkward. some of them are integer and the other are unknown character. here is the content example opened by vim


Code
00195264140000000xxx   00000S7000000034                 ^A1r^A3S^R00000363    01596587    5231052xxx         10 139766FFFFFF^@^@ULF2^@^T^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^F^@^@^@^B^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^F^@^@^@^B^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A2^G^A^CN^@^@^@^@^@LK@                                                            ^@^@^@^@^A1^A"U2^@^A5231052445         10PRYNK 
00195264140000000xxx 00100S7000000035 ^A3S^R^A3S^R577 5231052466 10 4899FFFFFFFF^@^@ULF2^@^T^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^]e^@^@^@^@^D^@^@^@^@^B<80>^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ U<80>^@^@^@^E^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A1b^@<9a>P<85>^@^@^@^@^B<80> ^@d^@^P^A1'^@<8a><9a><9d>U0^@^A5231052xxx 10PRYNK


hehe they're only 2 lines actually . and here is the file open by "open" perl :


Code
00195264140000002xxx   00000S7000000296                 1&#65533;s3S581                     5512006xxx         10 936479FFFFFFULF22h&#65533;&#1519;Q1-                                                            2h&#65533;&#65533;MRU15512006386         10PRYNP 
00195264140000002xxx 00000S7000000297 1&#65533;s3S573 5512006xxx 10 0614FFFFFFFFULF22f&#65533;&#65533;
&#65533;&#65533;&#65533; d1&#65533;&#65533;&#65533;&#65533;&#65533;U05512006391 10PRYNP


documentation said it's around 500 character each line that i need to parsed. i'm okay with that, i've done it couple of times. i only need to get the visible character and get rid the unknown character.

the problem is sometimes these unknown character creating a new line where they're not supposed to. and it affect the whole work,

here is the fail example of my parsing using "unpack" inside "open" :


Code
9526414000581xxxx 
&#65533;550
9526414000581xxxx
&#65533;550
9526414000581xxxx
&#65533;551
9526414000581xxxx
&#65533;42510xxx 00009
9526414000581xxxx
&#65533;00000829 01120


see the file is ruined because of these unknown character.

is there any trick to solve this problem.CrazyCrazy

thanks perlguru, and thanks to expert
Be free to decide your dream, Put the details of your dream in your head and heart And don't give up 'till you drop dead.


rovf
Veteran

Feb 16, 2011, 5:08 AM

Post #2 of 2 (648 views)
Re: [britantyo] unknown character parsing problem [In reply to] Can't Post

If I understand you right, you want to extract from a file only those characters with certain properties. You seem to classify the characters into "visible characters" and "unknown characters". Right?

Maybe you want to do something similar to the 'strings' utility?

http://unixhelp.ed.ac.uk/CGI/man-cgi?strings

So, basically, the question boils down to how to distinguish a "visible character" (to stay with your terminology) from an "unknown character", right?

Of course this means that you need to make clear, what exactly is a "visible character", respectively a "unknown" one. For instance (and ignoring Unicode issues for a moment), how would you classify the character with a hexadecimal representation of 0x88?

Actually, Perl comes with support for POSIX character classes, so if your idea of character visibility matches that of POSIX, the regular expression /[[:print:]]/ would match a "printable" character. See

perldoc perlre

for details.

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives