Home: Perl Programming Help: Regular Expressions:
Information Extraction from PDF



lovboy
New User

May 24, 2010, 2:18 PM


Views: 7166
Information Extraction from PDF

Hi Guys,

My first post here - as this suggests - very new to Perl...

I am looking for some elegant design solution to the following problem. It will be really nice of you all to guide me through this and refer me to the right modules and libraries available.

Task: I am trying to extract information from this kind of a PDF page (page 872) - http://www.fda.gov/downloads/Drugs/DevelopmentApprovalProcess/UCM071436.pdf

I need to provide the output in this format -->

Drug Name Approval number Patent number
------------ ------------------- ----------------
ABC 020977 5034394
XYZ 020977 5089500

How do you think I should approach this problem ?

Thanks,
lovboy


Bianca
User

Jun 7, 2010, 1:23 AM


Views: 7108
Re: [lovboy] Information Extraction from PDF

http://search.cpan.org/~cdolan/CAM-PDF-1.52/lib/CAM/PDF.pm tested?


deepeshtronics
Novice

Jul 31, 2010, 10:10 AM


Views: 6900
Re: [lovboy] Information Extraction from PDF

Hi,

I would like to help you in this.
Follow the below steps:

1] Install the following modules in your machine

CAM::PDF
Compress-Raw-Zlib
Text-PDF-0.29

2] Try to convert your pdf file into a text file by using the code given below

#!/usr/perl/bin

use warnings;
use strict;

use CAM::PDF;

my $file_name = shift;
my $pdf = CAM::PDF->new($file_name);

for my $page (1 .. $pdf->numPages()) {
my $text = $pdf->getPageText($page);
print "$text" if $text;
}

Run the script with the following command
perl script_name paf_name > output_text_file_name

Once you are done with the above steps, Kindly let me know. On the basis of input pattern in the converted text file we will pick up the rellevent information by another perl script and write it into different output file.

Thanks