CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
parsing .docx

 



orange
User

Jun 28, 2017, 12:34 AM

Post #1 of 2 (4246 views)
parsing .docx Can't Post

to parse docx, I first extract document.xml from it with 7z.
then I use standard LibXML parser.
but to do that, I have to filter it like this:


Code
$content =~ s/<\/w:t><\/w:r><w:r w:rsidR(Pr|)=".{8}"><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"\/>(<w:lang w:val="en-IN"\/>|)<\/w:rPr><w:t( xml:space="preserve"|)>//g; 

$content =~ s/<\/w:t><\/w:r><w:r w:rsidR(Pr|)=".{8}"( w:rsidRPr=".{8}"|)><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"\/>(<w:bCs\/>|<w:b\/>|)<\/w:rPr><w:t( xml:space="preserve"|)>//g;

$content =~ s/<\/w:t><\/w:r><w:r (w:rsidR=".{8}" |)w:rsidRPr=".{8}"><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"\/>(<w:bCs\/>|<w:b\/><w:i\/>|<w:lang w:val="en-IN"\/>|)<\/w:rPr><w:t( xml:space="preserve"|)>//g;

$content =~ s/<\/w:t><\/w:r><w:proofErr w:type="(spellStart|spellEnd)"\/><w:r( w:rsidR=".{8}" w:rsidRPr=".{8}"| w:rsidRPr=".{8}"|)><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"\/><w:lang w:val="en-IN"\/><\/w:rPr><w:t( xml:space="preserve"|)>//g;

$content =~ s/<\/w:t><\/w:r><w:proofErr w:type="spellEnd"\/><w:proofErr w:type="gramEnd"\/><w:r( w:rsidR=".{8}" w:rsidRPr=".{8}"| w:rsidRPr=".{8}"|)><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"\/><w:lang w:val="en-IN"\/><\/w:rPr><w:t xml:space="preserve">//g;

$content =~ s/<\/w:t><\/w:r><w:proofErr w:type="gramEnd"\/><w:r( w:rsidRPr=".{8}"|)><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"\/>(<w:lang w:val="en-IN"\/>|)<\/w:rPr><w:t xml:space="preserve">//g;

$content =~ s/<\/w:t><\/w:r><w:bookmarkStart w:id="\d+" w:name="OLE_LINK\d+"\/>(<w:bookmarkStart w:id="\d+" w:name="OLE_LINK\d+"\/>|)(<w:bookmarkStart w:id="\d+" w:name="OLE_LINK\d+"\/>|)<w:r( w:rsidR=".{8}" w:rsidRPr=".{8}"|)><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"\/>(<w:lang w:val="en-IN"\/>|)<\/w:rPr><w:t( xml:space="preserve"|)>//g;

$content =~ s/<\/w:t><\/w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"\/>(<w:b\/>|<w:lang w:val="en-IN"\/>|)<\/w:rPr><w:t( xml:space="preserve"|)>//g;

$content =~ s/<\/w:t><\/w:r><w:r w:rsidR=".{8}"><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"\/><w:b\/><w:i\/><\/w:rPr><w:t>//g;

$content =~ s/<\/w:t><\/w:r><w:proofErr w:type="gramEnd"\/><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"\/><\/w:rPr><w:t>//g;



is there some better way to filter out those tags?
thanks.


Laurent_R
Veteran / Moderator

Jun 28, 2017, 9:29 AM

Post #2 of 2 (4234 views)
Re: [orange] parsing .docx [In reply to] Can't Post

Perhaps you could use a CPAN module aimed at MS Word documents, such as Text::Extract::Word or Win32::Word::Writer (http://search.cpan.org/~johanl/Win32-Word-Writer-0.02/lib/Win32/Word/Writer.pm).

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives