CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
data Parsing for Newbie (many Html-files)

 



dilbert
User

Sep 22, 2010, 2:02 AM

Post #1 of 6 (2418 views)
data Parsing for Newbie (many Html-files) Can't Post

hi

ve got 25 Tsd files - all are stored in one folder.

each site contains Adresses (see below) Each data-set has got a unique ID-Number!

First task is to take all the 25 thousand html-files and to strip out - (parse) the therein contained adress-sets.
This is a Perl-task! Sure thing!


see a dataset:

Name: Mister Miller
Adresse:
Telefon:
Fax:
ID-Nummer: 2210202
Mail-Adress: Mister_Miller@hotmail.com
Website: short url: http://www.TheWEBsite.org/[ID-Number - here 2210202]


The second task can be done with Perl:

In the last line of Adress-set there is an URL - with a short-way that is build up with two pieces


http://www.TheWEBsite.org/[ID-Number - here 2210202]

in order to rebuild the original URL i have to set the url together and call it.... short url: http://www.TheWEBsite.org./[ID-Number]


how should i do this second task!?

look forward to hear from you


(This post was edited by dilbert on Sep 22, 2010, 3:28 PM)


7stud
Enthusiast

Sep 22, 2010, 2:26 AM

Post #2 of 6 (2414 views)
Re: [dilbert] data Parsing for Newbie (23 Tsd Html-files) [In reply to] Can't Post


Quote
First task is to take all the 25 thousand html-files and to strip out - (parse) the therein contained adress-sets.
This is a Perl-task! Sure thing!


That is an html parsing task, sure thing! But since you haven't shown any of the html, it is impossible to know how to extract the data. But...you will need to use one of perl's html processing modules, like HTML::TreeBuilder to extract the data you want from the html.


Quote
how should i do this second task!?


In the data you extract from the html page, look for a string that matches a regex that begins with 'ID-Number:, and then capture everything after the colon. For example:


Code
use strict; 
use warnings;
use 5.010;

my $str = 'ID-Number: 2210202';

if ($str =~ /ID-Number: (.+)/) {
my $id = $1;
say $id;
}

--output:--
2210202


Do the same for the url. Combine the strings.


(This post was edited by 7stud on Sep 22, 2010, 2:27 AM)


dilbert
User

Sep 22, 2010, 3:11 AM

Post #3 of 6 (2406 views)
Re: [7stud] data Parsing for Newbie (23 Tsd Html-files) [In reply to] Can't Post

hello 7stud,

Many thanks for the quick reply! All sounds great
i will respond later - and give you more details

as for now - all looks good and seems to suite the needs here.

untill later

dilbert


dilbert
User

Sep 22, 2010, 3:34 PM

Post #4 of 6 (2393 views)
Re: [7stud] data Parsing for Newbie (23 Tsd Html-files) [In reply to] Can't Post

Hello 7stud - hello all.

again many many thanks for the answer!

Hello again - here i am back:


That is what i want to get - i want to gather a set of information:

country: countryname
name: myname
School-type: Type one
Adress: 20000 New York, Broadway 16
Telefon: 053333052-9899-0, Fax: 053333052-9899-55
index-number: 26666932002
Webmaster: Linus Thorwald
site registerd at: 08.03.2010
Website:


Well and i can rebuild a url with the index-number: But therefore i need to combine it and - somehow to execute it - How can i do this...


And besides this single result: i have many many files in one folder - thousands. How do i this matching with that many:




look forward to hear from you....

regards
dilbert

see the html here: (see more below )



Code
<div style="display: inline;"><div class="logo_homepage"><a class="img_inl" href="http://www.the_search_site.org/26666932002"></a></div> 


I have to extract the index-number and add it to the shorturl = http://www.the_search_site.org/ (here: 26666932002 )


how to do - how to proceed - to gather the above mentioned results?



below the (shortened html of one result):



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">


<!-- einzelergebnis.html?Id=26666932002&treffer=2139&auswahl_1=0&auswahl_2=0&auswahl_3=0&suchtext=&kategorie=&region=de&trefferzahlauswahl=alle&trefferzahl=10517&list_anfang=0&sort= >
<title>result-title: MyName, New York </title>
<img src=""Contryname" title="Contryname" />
<div style="width: 40em;">
<div style="display: inline;"><div class="logo_homepage"><a class="img_inl" href="http://www.the_search_site.org/26666932002"></a></div>
<div class="fm_linkeSpalte"><h2>My name</h2>
<span class="schulart_text">School-type: Type one</span>
<p class="einzel_text">Adress: 20000 New York, Broadway 16
<br />
Telefon: 053333052-9899-0, Fax: 053333052-9899-55
<br />
index-number: 26666932002 <br />
Webmaster: <a href="mailto: webmaster@the-site.com" class="p1">Linus Thorwald</a><br /></p> </div>
<div>
<p class="ta_left einzel_text">
</p></div>
<br /><div><p class="ta_left einzel_text">registered at: 08.03.2010</p></div>
</div>
</div>
</div>
</div>

<d-- einzelergebnis.html?Id=26666932002&treffer=2139&auswahl_1=0&auswahl_2=0&auswahl_3=0&suchtext=&kategorie=&region=de&trefferzahlauswahl=alle&trefferzahl=10517&list_anfang=0&sort=-->
</html>



7stud
Enthusiast

Sep 22, 2010, 4:28 PM

Post #5 of 6 (2391 views)
Re: [dilbert] data Parsing for Newbie (23 Tsd Html-files) [In reply to] Can't Post

Sorry, you've lost me. I don't know what you are trying to do. Write this down:

1)

2)

3)

4)

Then next to each number use one sentence to describe what you want to do in the order you want to do them. Don't use terms like Tsd file. I have no idea what that is.


(This post was edited by 7stud on Sep 22, 2010, 4:29 PM)


dilbert
User

Sep 22, 2010, 4:52 PM

Post #6 of 6 (2386 views)
Re: [7stud] data Parsing for Newbie (23 Tsd Html-files) [In reply to] Can't Post

hello 7stud

many thanks for the quick reply!

see this site here - this is the page where i gathered the information: http://schulweb.de/de/schulsuche/liste.html?trefferzahlauswahl=alle&x=29&y=9&kategorie=&region=de&auswahl_1=0&auswahl_2=0&auswahl_3=0&suchtext=

I have gathered all the results of this page: Treffer 1 - 10517 von 10517
this means i have more that 10 000 files in a folder - i got this with httrack - a good tool:

So i have all pages with the detailed information
http://schulweb.de/de/schulsuche/einzelergebnis.html?Id=3122800&treffer=623&auswahl_1=0&auswahl_2=0&auswahl_3=0&suchtext=&kategorie=&region=de&trefferzahlauswahl=alle&trefferzahl=10517&list_anfang=0&sort=

This contains a set of information:


1) i have more than 10 000 files in a folder - all look the same. They contain informations. i want to gather this set of information.

2) If i can parse one file - then i am able to do it with all the ohters

3) How to parse to get the information (the above mentioned aresses with 5 lines of text [see also below])

4) after having the adresses - i have to get the URL - it is written down in a combination of an id-number.

5) the adress-data-set contains this id-number. I only have to add this to the URL and then i get the

See the equivalents: 1 is the same as 2 - and leads to 3

1. http://schulweb.de/schule.html?id=3122800
2. http://schulweb.de/3122800
3. http://www.bbs-peine.de/

Well - you see after parsing the information i have to re-build the URL - since i want to get the full dataset


country: countryname
name: myname
School-type: Type one
Adress: 20000 New York, Broadway 16
Telefon: 053333052-9899-0, Fax: 053333052-9899-55
index-number: 26666932002
Webmaster: Linus Thorwald
site registerd at: 08.03.2010
Website:

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives