CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Intermediate:
Reading MS Word files



Sep 28, 2001, 8:33 AM

Post #1 of 4 (1081 views)
Reading MS Word files Can't Post

I can (with no problem) open edit and close a text file(.txt)
How can I do the same with a MS Word file(.doc)
Right now I'm just tring to print the contents of a simple Word file
like this "This is a test file for perl "
using this
use CGI qw(:all);
print header;
open(TF, "perltest.doc") || die "Can't open perltest.doc: $!";
print "perltest is -- $text"

I get this
perltest is -- ࡱ;


Sep 30, 2001, 1:07 AM

Post #2 of 4 (1075 views)
Re: Reading MS Word files [In reply to] Can't Post

First off, a MS Word doc is not a text is a Rich Text document in the most general sense (Microsoft calls the format: Rich Text+ as they put other things in the file like revision tracking, etc.).

You need to get (if one exists and if it did I wouldn't know where to get it) a module that will parse the data in the Word document down to a text format. I doubt that such a module exists as converting RTF down to text is a ugly process and removes all document formatting. No one ever gets what they think they will get when running documents through conversions.

Maybe someone knows more on the subject. My advise is not to beat your head on this too much. But then again, don't give up either!

Good luck.

Sean Shrum


Sep 30, 2001, 5:37 PM

Post #3 of 4 (1068 views)
Re: Reading MS Word files [In reply to] Can't Post

I found this and read a little in perl doc
but I have not had time to try it.
It might add two lines of text to a *.doc file and then saves it as a *.doc file???
Don't worry about checking this unless you just want to, cause like I said I haven't had time to check it myself, yet......

use strict;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';

my(@line) = ('Here is the first line of text.',
'Here is the second line of text.');
my($outputFile) = 'perltest.doc';

my($word) = Win32::OLE -> new('Word.Application', 'Quit');

my($doc) = $word -> Documents -> Add(); # Create a new document
my($range) = $doc -> {Content};

$range -> {Text} = $line[0];
for ($i=1; $i <= $#line; $i++)
$range -> InsertParagraphAfter();
$range -> InsertAfter($line[$i]);

my($paragraphCount) = $doc -> Paragraphs -> Count();
for ($i=1; $i <= $paragraphCount; $i++)
print "$i: ", $doc -> Paragraphs($i) -> Range -> {Text}, "\n";

$doc -> SaveAs($outputFile);
$doc -> Close();
$word -> Quit();
# Success.
print "Success \n";


Oct 1, 2001, 2:48 AM

Post #4 of 4 (1059 views)
Re: Reading MS Word files [In reply to] Can't Post

Don't forget these files are very complex. airo's example actually opens word (invible though), using OLE/COM.
A word file can be RTF (wich is a bit like HTML, and is readable as ASCII) Word files however, are binary.
Thay means individual bytes (or combinations of them) in a file represent data. ex. 4 bytes for size, then 4 bytes telling the size of ht e following text, then ascii characters containing the text (maybe the autor field in word), and so on. If Microsoft has documented this structure, you can maybe use it. Still it's difficult to make use of that in perl (with unpack). Word documents can also contain OLE objects, VBA programming lines, styles, images, settings. You all need to take are of those.

Either use plain text files, or make a OLE/COM connection using the sample from airo above. If you program in VBA (in word) you \'d reconize there lines.


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives