Information Extraction from PDF

May 24, 2010, 2:18 PM

Hi Guys,

My first post here - as this suggests - very new to Perl...

I am looking for some elegant design solution to the following problem. It will be really nice of you all to guide me through this and refer me to the right modules and libraries available.

Task: I am trying to extract information from this kind of a PDF page (page 872) - http://www.fda.gov/downloads/Drugs/DevelopmentApprovalProcess/UCM071436.pdf

I need to provide the output in this format -->

Drug Name Approval number Patent number
------------ ------------------- ----------------
ABC 020977 5034394
XYZ 020977 5089500

How do you think I should approach this problem ?



Jun 7, 2010, 1:23 AM

http://search.cpan.org/~cdolan/CAM-PDF-1.52/lib/CAM/PDF.pm tested?


Jul 31, 2010, 10:10 AM

I would like to help you in this.
Follow the below steps:

1] Install the following modules in your machine


2] Try to convert your pdf file into a text file by using the code given below


use warnings;
use strict;

use CAM::PDF;

my $file_name = shift;
my $pdf = CAM::PDF->new($file_name);

for my $page (1 .. $pdf->numPages()) {
my $text = $pdf->getPageText($page);
print "$text" if $text;

Run the script with the following command
perl script_name paf_name > output_text_file_name

Once you are done with the above steps, Kindly let me know. On the basis of input pattern in the converted text file we will pick up the rellevent information by another perl script and write it into different output file.