
7stud
Enthusiast
Sep 22, 2010, 2:26 AM
Post #2 of 6
(1917 views)
|
|
Re: [dilbert] data Parsing for Newbie (23 Tsd Html-files)
[In reply to]
|
Can't Post
|
|
First task is to take all the 25 thousand html-files and to strip out - (parse) the therein contained adress-sets. This is a Perl-task! Sure thing! That is an html parsing task, sure thing! But since you haven't shown any of the html, it is impossible to know how to extract the data. But...you will need to use one of perl's html processing modules, like HTML::TreeBuilder to extract the data you want from the html.
how should i do this second task!? In the data you extract from the html page, look for a string that matches a regex that begins with 'ID-Number:, and then capture everything after the colon. For example:
use strict; use warnings; use 5.010; my $str = 'ID-Number: 2210202'; if ($str =~ /ID-Number: (.+)/) { my $id = $1; say $id; } --output:-- 2210202 Do the same for the url. Combine the strings.
(This post was edited by 7stud on Sep 22, 2010, 2:27 AM)
|