CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Need a Custom or Prewritten Perl Program?: I need a program that...:
Extracting Data from a File

 



GDrew
New User

Nov 11, 2010, 9:28 AM

Post #1 of 2 (1327 views)
Extracting Data from a File Can't Post

Hello - I am a .Net developer with very limited Perl knowledge. I am attempting to read some values from several files and output the value as a csv file. I've been reading through several Perl books, but haven't come across any examples on how to get this done. Can anyone show me an example on how this is done in Perl?

I have some files in a directory that looks like this:

1. - root
...1.1 - html
......1.1.1 - html2010
............file1.html
............file2.html
............file3.html
............etc
......1.1.2 - html2010
............file1.html
............file2.html
............file3.html

I need to read the content from the "description" and "keywords" meta (test1,test2,test3,testk1,testk2,etc) from each file and output it as a csv file.

The html pages look something like this:


Code
<!-- File 1 --> 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en" dir="ltr">
<head>
<title>Test Page1</title>
<meta name="description" content="test1,test2,test3,test4,test5"/>
<meta name="keywords" content="testk1,testk2,testk3,testk4,testk5"/>
</head>
<body>
Body of the page 1
</body>
</html>

<!-- File 2 -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en" dir="ltr">
<head>
<title>Test Page2</title>
<meta name="description" content="test6,test7,test8,test9,test10"/>
<meta name="keywords" content="testk6,testk7,testk8,testk9,testk10"/>
</head>
<body>
Body of the page 2
</body
</html>



Zhris
Enthusiast

Nov 12, 2010, 4:14 AM

Post #2 of 2 (1317 views)
Re: [GDrew] Extracting Data from a File [In reply to] Can't Post

Hi,

I'm not 100% sure on your directory structure, so the following script is only designed to process html files in a single directory. I have used HTML::TreeBuilder / HTML::Element over other modules such as HTML::HeadParser, because I prefer the level of control.

Untested:

Code
#!/usr/bin/perl     
use strict;
use warnings;
use HTML::TreeBuilder;

my $dirpath = '1.1.1 - html2010';
my @files = glob("$dirpath/*.html");

my $csv = 'output.csv';
open my $csv_fh, '>', $csv or die "cannot open $csv: $!";
foreach my $file (@files) {
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file);
my @metas = $tree->look_down("_tag", "meta");

print $csv_fh "$metas[0]->{'content'},$metas[1]->{'content'}\n";

#foreach my $meta (@metas) {
# print $csv_fh "$meta->{'content'}\n";
#}
}
close $csv_fh;


Chris


(This post was edited by Zhris on Nov 12, 2010, 4:55 AM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives