Home: Need a Custom or Prewritten Perl Program?: I need a program that...:
Extracting Data from a File



GDrew
New User

Nov 11, 2010, 9:28 AM


Views: 2637
Extracting Data from a File

Hello - I am a .Net developer with very limited Perl knowledge. I am attempting to read some values from several files and output the value as a csv file. I've been reading through several Perl books, but haven't come across any examples on how to get this done. Can anyone show me an example on how this is done in Perl?

I have some files in a directory that looks like this:

1. - root
...1.1 - html
......1.1.1 - html2010
............file1.html
............file2.html
............file3.html
............etc
......1.1.2 - html2010
............file1.html
............file2.html
............file3.html

I need to read the content from the "description" and "keywords" meta (test1,test2,test3,testk1,testk2,etc) from each file and output it as a csv file.

The html pages look something like this:


Code
<!-- File 1 --> 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en" dir="ltr">
<head>
<title>Test Page1</title>
<meta name="description" content="test1,test2,test3,test4,test5"/>
<meta name="keywords" content="testk1,testk2,testk3,testk4,testk5"/>
</head>
<body>
Body of the page 1
</body>
</html>

<!-- File 2 -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en" dir="ltr">
<head>
<title>Test Page2</title>
<meta name="description" content="test6,test7,test8,test9,test10"/>
<meta name="keywords" content="testk6,testk7,testk8,testk9,testk10"/>
</head>
<body>
Body of the page 2
</body
</html>



Zhris
Enthusiast

Nov 12, 2010, 4:14 AM


Views: 2627
Re: [GDrew] Extracting Data from a File

Hi,

I'm not 100% sure on your directory structure, so the following script is only designed to process html files in a single directory. I have used HTML::TreeBuilder / HTML::Element over other modules such as HTML::HeadParser, because I prefer the level of control.

Untested:

Code
#!/usr/bin/perl     
use strict;
use warnings;
use HTML::TreeBuilder;

my $dirpath = '1.1.1 - html2010';
my @files = glob("$dirpath/*.html");

my $csv = 'output.csv';
open my $csv_fh, '>', $csv or die "cannot open $csv: $!";
foreach my $file (@files) {
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file);
my @metas = $tree->look_down("_tag", "meta");

print $csv_fh "$metas[0]->{'content'},$metas[1]->{'content'}\n";

#foreach my $meta (@metas) {
# print $csv_fh "$meta->{'content'}\n";
#}
}
close $csv_fh;


Chris


(This post was edited by Zhris on Nov 12, 2010, 4:55 AM)