CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
data Parsing for Newbie - need to optimize the csv-output

 



dilbert
User

Feb 24, 2011, 8:50 AM

Post #1 of 1 (392 views)
data Parsing for Newbie - need to optimize the csv-output Can't Post

hello good day dear community,

i like this place. It is a great place for idea and knowlege sharing! But by far the most impressive thing i learned is that this community here is so supportive. I am overwhelmed by this experience. This forum has so many many great folks.

i have a little parser that parses a site - with 6150 records. But i need to have this in a CSV-formate. First of all see here the target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

i need all the data - with separation in the filed of

[PHP]
number
schoolnumber
school-name
Adress
Street
Postal Code
phone
fax
School-type
website

[/PHP]

BTW - see here the target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 and compare!

Well - i have a script: i am very interested what you think about this .... not all the fields are gained yet - i need more of them!

[PHP]
#!/usr/bin/perl
use strict;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);

my $total_records = 0;
my $alpha = "x";
my $results = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $percent = 0;

workDir();
chdir $processdir;
processURL();
print "\nPress <enter> to continue\n";
<>;
my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
processData();
close $outfile;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt\n";
unlink 'processing.html';

sub processURL() {
print "\nProcessing $url_to_process$alpha&a=$results&s=$range\n";
getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';

while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}

sub processData() {
while ( $range <= $total_records) {
my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
cleanup(@$row);
# Add a table column delimiter in this case ||
print $outfile join("||", @$row)."\n";
}
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
}
}

sub cleanup() {
for ( @_ ) {
s/\s+/ /g;
}
}

sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}

[/PHP]output: [PHP]

1||9752||Deutsche Schule Alamogordo USA Alamogorde - New Mexico || ||Deutschsprachige Auslandsschule||
2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET Alexandrien - Ägypten || ||Begegnungsschule (Auslandsschuldienst)||
3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503 Altötting ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de
4||0006||Max-Reger-Gymnasium Amberg Kaiser-Wilhelm-Ring 7 92224 Amberg ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de
[/PHP]
With the || being the delimiter.


My problem is: i need to have more fields - i need to have the following divided:

[PHP]
name: Volksschule Abenberg (Grundschule)
street: Güssübelstr. 2
postal-code and town: 91183 Abenberg
fax and telephone: 09178/215 09178/905060
type of school: Volksschulen
website: home.t-online.de/home/vs-abenberg [/PHP]

well - how to add more fields?
This obviously has to be done in this line here, doesn t it!?

[PHP]my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
[/PHP]
But how. I tried out several things - but i dont helped. I allways got bad results. Btw: i played around - and tried another solution - but here i have good csv-data - but unfortunatley no spider logic...

[PHP]
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/&nbsp;/ /g; # expand the spaces

my $te = new HTML::TableExtract();
$te->parse($html);

my @cols = qw(
rownum
number
name
phone
type
website
);

my @fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);

my $csv = Text::CSV->new({ binary => 1 });

foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {

# trim leading/trailing whitespace from base fields
s/^s+//, s/\s+$// for @$row;

# load the fields into the hash using a "hash slice"
my %h;
@h{@cols} = @$row;

# derive some fields from base fields, again using a hash slice
@h{qw/name street postal town/} = split /n+/, $h{name};
@h{qw/phone fax/} = split /n+/, $h{phone};

# trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for @h{qw/name street postal town/};

$csv->combine(@h{@fields});
print $csv->string, "\n";
}
}
[/PHP]

Well - with this i tried another solution - but here i have good csv-data - but unfortunatley no spider logic.
How to add the spider-logic here... !?

look forward to any and all help!

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives