CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
loop routine to capture matching data

 



artperl
Novice

Apr 7, 2015, 8:27 AM

Post #1 of 11 (4649 views)
loop routine to capture matching data Can't Post

Hello gurus,

I have an html file (see 20150327_102932_18253.body) with tables.
There are 3 sections which I would like to consolidate & just combine the data of matching column category.
So in the attachment, there are info for SITE0 (this is 1 section) then another for SITE2 (2nd section), then the total/summary (3rd section).

I would like to consolidate the data & just re-generate 1 table (still in html format) with all the data from the 3 sections (of course the primary column to show the distinct combined from all). See 20150327_102932_18253_target.body.

Note: View in web browser for visualization.

I would greatly appreciate your help how to code this in perl. Thanks much!...
Attachments: 20150327_102932_18253.body (14.7 KB)
  20150327_102932_18253_target.body (8.81 KB)


Zhris
Enthusiast

Apr 7, 2015, 10:17 AM

Post #2 of 11 (4639 views)
Re: [artperl] loop routine to capture matching data [In reply to] Can't Post

Hi,

HTML::TableExtract is capable of handling this task. If you get stuck post your attempt and any issues.

Chris


(This post was edited by Zhris on Apr 7, 2015, 10:20 AM)


artperl
Novice

Apr 7, 2015, 5:46 PM

Post #3 of 11 (4623 views)
Re: [Zhris] loop routine to capture matching data [In reply to] Can't Post

Hi Chris,

Thanks much for the inputs!... That is good to know.
On the other hand, i tried this but is not working... i'm pretty sure i did something wrong. Here is my code:

#!/usr/bin/perl
use HTML::TableExtract;

my $sFile = '/home/gexval/Scripts/IN/20150327_102932_18253.body';
print "Extracting... \n";


$te = HTML::TableExtract->new(headers => [qw(Binning Name Cat)]);
$te->parse($sFile);

foreach $ts ($te->tables) {
print "Table (", join(',', $ts->coords), "):\n";
foreach $row ($ts->rows) {
print join(',', @$row), "\n";
}
}


Zhris
Enthusiast

Apr 7, 2015, 7:06 PM

Post #4 of 11 (4621 views)
Re: [artperl] loop routine to capture matching data [In reply to] Can't Post

Hi,

Because you are providing the file path as oppose to the file contents, you will want to use the parse_file method instead.


Code
$te->parse($sFile);  
$te->parse_file($sFile);



If you have any further issues please provide any errors you receive. Inevitably this is just the beginnings of your task, you will need to think about how you want to combine the tables, perhaps by extending just one of them. In the modules synopsis, the final example will probably be of most use to you.

Chris


(This post was edited by Zhris on Apr 7, 2015, 7:13 PM)


artperl
Novice

Apr 7, 2015, 7:23 PM

Post #5 of 11 (4615 views)
Re: [Zhris] loop routine to capture matching data [In reply to] Can't Post

wow!... that was fast Chris!... thanks!...
That is indeed the issue. So i have now the working code to extract:

#!/usr/bin/perl
use HTML::TableExtract;
use strict;
use warnings;

my $sFile = '/home/gexval/Scripts/IN/20150327_102932_18253.body';
print "Extracting... \n";


my $te = HTML::TableExtract->new(headers => ['Binning', 'Name', 'Cat', 'Total count']);
$te->parse_file($sFile);

foreach my $ts ($te->tables)
{
print "my ts now is $ts \n";
foreach my $row ($ts->rows) {
print join(',', @$row), "\n";
}
}

While this is working, I'm curious to know the other error I encountered:
This line: foreach my $row ($ts->rows), if I use @$ts, I get the error: "Not an ARRAY reference..."


Zhris
Enthusiast

Apr 7, 2015, 7:50 PM

Post #6 of 11 (4612 views)
Re: [artperl] loop routine to capture matching data [In reply to] Can't Post

Hi,

What is it that you are trying to achieve by doing @$ts. Doing this treats the reference held by $ts as an array, which it is not. You can assign all table objects to an array using the following:


Code
my @ts = $te->tables;


Or an array reference:


Code
my $ts = [ $te->tables ]; # now @$ts will work.


Chris


artperl
Novice

Apr 10, 2015, 10:25 AM

Post #7 of 11 (4216 views)
Re: [Zhris] loop routine to capture matching data [In reply to] Can't Post

Thanks much Chris!...
I am now able to extract the data from the 3 tables in the html file:
1~GOOD~P~272
3~LEAK~F~24
5~CONT~F~1
7~IDDQ~F~9
18~LBIST~F~1
19~MBIST~F~5
23~POR~F~54
46~EPHY_ADC~F~1
56~PVT_MON~F~1
57~LDO~F~27
58~xDSL~F~20
61~DGASP~F~5
76~SWREG~F~1
1~GOOD~P~305
3~LEAK~F~20
7~IDDQ~F~11
18~LBIST~F~2
19~MBIST~F~5
21~SCAN~F~1
23~POR~F~34
40~EPHY_Func_inLoop~F~1
53~USB2~F~3
57~LDO~F~22
58~xDSL~F~13
61~DGASP~F~4

I'm now trying figure out how to put these together in a single table.
Frying my brain how to store temporarily into an array then write into a new html file but with single tables (instead of 3, from the original).
Appreciate much your inputs...


Zhris
Enthusiast

Apr 10, 2015, 2:17 PM

Post #8 of 11 (4179 views)
Re: [artperl] loop routine to capture matching data [In reply to] Can't Post

Hi,

Could you please post your code so I can see where you are up to. There are many ways you could accomplish this task, my preference would be to try and do it all via HTML::TableExtract and modify the document as it stands. Its actually pretty tricky and worth reading through all relevant documentation. You won't get the exact source as per your example output, but the compiled html will look identical.

The first problem you will encounter is the fact that your input html is malformed:


Code
</table> 
<tr>
<td width="160" height="21" bgcolor="#CCECFF"><b>Site 0</b></td>
<td width="590" height="21" bgcolor="#F8F8F8"><HR></td>
</tr>
<table border="0" cellspacing="1" width="98%" style="font-size: 12pt; border-collapse: collapse" bordercolor="#111111" cellpadding="0">


Since you will need to use HTML::TableExtract in tree mode, its best that you make the HTML regular again, otherwise HTML::TreeBuilder will try its best to make it regular for you by wrapping other tables around sections of your html. I think the approach I would take would be dirty but to use a regular expression to remove the malformed sections while assigning the site to the corresponding tables id attribute:


Code
use HTML::TableExtract qw/tree/; 

my $filepath = '/home/gexval/Scripts/IN/20150327_102932_18253.body';
open my $handle, '<', $filepath or die "cannot open '$filepath': $!";
my $string = do { local $/ = undef; <$handle> };
close $handle;

$string =~ s{</table>\s*\K<tr>.+?<b>site\s(\d+)</b>.+?</tr>\s*<table}{<table id="Site$1"}sig; # fix malformed html. Retrieve id, remove html, assign associated table id attribute.

my $html_tableextract = HTML::TableExtract->new( );
$html_tableextract->parse( $string );


Now you can fetch the tables list via the tables method. Based on the position of tables you want to keep and / or delete, you can modify this list until it contains just the 2 individual tables, making sure you also take a reference to the combined table.

Next you want to insert new columns into the combined table. You can do this via HTML::ElementTable's maxcol method ( calling the tree method on a HTML::TableExtract::Table object will return a HTML::ElementTable object ):


Code
my $combined_tree = $combined->tree; 
my $col_max_old = $combined_tree->maxcol;
my $col_max_new = $combined_tree->maxcol( $col_max_old + $number_of_individual_tables );


At this point you may want to choose two representational cells of the combined table to fetch suitable head attributes and row attributes via HTML::Element's all_external_attr method:


Code
my %attributes = $combined_tree->cell( $row_i, $col_i )->all_external_attr;


You can now iterate over each individual table somewhat as you are already doing, and construct a hash where the keys are the bins and the values are the counts:


Code
# $hash->{$bin} = $count; 
my $counts = { };
$counts->{$_->[0]->as_text} = $_->[3]->as_text for ( @rows[1 .. $#rows] );


You can then use the tables id to set the value of the head and the attributes you collected earlier to set the cells attributes in the corresponding column. Then iterate over the rows and lookup the count in your hash to set the value and again the attributes you collected earlier to set the cells attributes.

Finally output your modified document:


Code
print $html_tableextract->as_HTML( undef, "\t" );


I realise some of the above won't make much sense yet, but I wanted to give you hints at a direction you could take as per your request. You may prefer / find it easier to rebuild the document from scratch instead of modifying the document. Lets see how you get on and I'll try to help you to achieve your goal. As stated above, start by reading HTML::TableExtract's documentation in depth and the objects it creates / modules it subclasses along the way. Try some stuff out then report back.

Chris


(This post was edited by Zhris on Apr 10, 2015, 2:31 PM)


artperl
Novice

Apr 13, 2015, 5:34 PM

Post #9 of 11 (3947 views)
Re: [Zhris] loop routine to capture matching data [In reply to] Can't Post

Hi Chris,

Really appreciate your advise.
I see an interesting code you provide:
$string =~ s{</table>\s*\K<tr>.+?<b>site\s(\d+)</b>.+?</tr>\s*<table}{<table id="Site$1"}sig;

so does it check & move to next line?
coz I may be able to use that syntax as I'm trying to figure out how to clean-up a file & take out <tr></tr> pairs that are in different lines (no data in between).


Zhris
Enthusiast

Apr 13, 2015, 8:12 PM

Post #10 of 11 (3935 views)
Re: [artperl] loop routine to capture matching data [In reply to] Can't Post

Hi,

No problem. That regexp will fundamentally remove the extraneous tr blocks which are between tables but aren't contained inside a table, it doesn't work on a line by line basis but rather on the document string as a whole, it matches the opening tr, everything in between, and the closing tr. It is very specific to the input data you provided and won't necessarily work on another document. If your HTML was regular, I would not have used a regexp.

Chris


(This post was edited by Zhris on Apr 13, 2015, 8:13 PM)


Zhris
Enthusiast

Apr 21, 2015, 5:43 PM

Post #11 of 11 (3406 views)
Re: [artperl] loop routine to capture matching data [In reply to] Can't Post

Hi,

I guess you have already completed this task. I just wanted to post my HTML::TableExtract solution in case anyone who comes across this post in the future finds it useful:


Code
#!/usr/bin/perl 
use strict;
use warnings;
use HTML::TableExtract qw/tree/;

my $filepath = '/home/gexval/Scripts/IN/20150327_102932_18253.body';
open my $filehandle, '<', $filepath or die "cannot open '$filepath': $!";
my $string = do { local $/ = undef; <$filehandle> };
close $filehandle;

$string =~ s{</table>\s*\K<tr>.+?<b>site\s(\d+)</b>.+?</tr>\s*<table}{<table id="Site$1"}sig; # fix malformed html. Retrieve id, remove html, assign associated table id attribute.

my $html_tableextract = HTML::TableExtract->new( );
$html_tableextract->parse( $string );

my @tables = $html_tableextract->tables; # assign tables to list.
shift @tables; # remove table 0 from list.
my $combined = pop @tables; # remove table -1 ( combined ) from list.
pop( @tables )->tree->delete; # remove table -2 from list, then delete it from our document.

my $combined_tree = $combined->tree;
my $col_max_old = $combined_tree->maxcol;
my $col_max_new = $combined_tree->maxcol( $col_max_old + @tables ); # insert new col per individual by increasing maxcol.
my @combined_rows = _rows_tree( $combined_tree, undef, [ 0, ( $col_max_old + 1 ) .. $col_max_new ] ); # retrieve rows. Use tree to ensure the changes we make below are reflected in our modified tree i.e. in new cols. Col slice for efficiency i.e. ignore unused cols.
my %th_attributes = $combined_rows[0]->[0]->all_external_attr; # retrieve hash of default th attributes from suitable cell.
my %td_attributes = $combined_rows[1]->[0]->all_external_attr; # retrieve hash of default td attributes from suitable cell.

for my $individual_i ( 0 .. $#tables ) # iterate over remaining tables in list ( individuals ).
{
my $individual = $tables[$individual_i];
my $individual_tree = $individual->tree;
my $id = $individual_tree->id; # retrieve id from table id attribute.
my @individual_rows = $individual->rows; # retrieve rows.

my $counts = { };
$counts->{$_->[0]->as_text} = $_->[3]->as_text for ( @individual_rows[1 .. $#individual_rows] ); # construct counts hash.

$individual_tree->delete; # delete individual table from our document.

my $col_i = ( @tables - $individual_i ) * ( -1 ); # convert individual i to negative col i.
_set( $_->[$col_i], \%th_attributes, [ 'b', "$id count" ] ) for ( $combined_rows[0] );
_set( $_->[$col_i], \%td_attributes, [ 'b', $counts->{$_->[0]->as_text} // 0 ] ) for ( @combined_rows[1 .. $#combined_rows] );
}

print $html_tableextract->as_HTML( undef, "\t" ); # output document.

# todo: below functions would be better suited in a subclass of html table extract / html element table,

# like html table extracts rows method, but on tree level. This lets us be consistant with html table extract in tree mode when dealing with trees.
sub _rows_tree
{
my ( $tree, $row_slice, $col_slice ) = @_;

$row_slice //= [ 0 .. $tree->maxrow ];
$col_slice //= [ 0 .. $tree->maxcol ];

return map { my $row_i = $_; [ map { my $col_i = $_; $tree->cell( $row_i, $col_i ) } @$col_slice ] } @$row_slice;
}

# set element attributes and content in one hit.
sub _set
{
my ( $element, $attributes, $content ) = @_;

map { $element->attr( $_ => $attributes->{$_} ) } keys %$attributes if defined $attributes;
$element->replace_content( $content ) if defined $content;
}


Chris

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives