CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Building a string in pieces

 



PapaGeek
User

Aug 3, 2013, 1:23 PM

Post #1 of 10 (690 views)
Building a string in pieces Can't Post

Again, remember that I am a Perl newbie!

I’m working on a project to read stock quotes from the internet. The code at the bottom of this post is my first step. It reads the URL, gets rid of all of the web spacing, and then does a custom conversion of the page into a text stream: no fonts, no styles, etc.

I want to call the new sub-function with two parameters: the URL, and an optional disk file to hold the results.

I’m currently creating the stripped down stream into a file on disk. I’d like to create it as an in memory stream, or whatever they call it in Perl. Save that stream to disk if the optional path is given, and then return the stream as a single string to the calling process.

In the case of getting a stock quote, I could ask for the page that contains multiple quotes, split the result on the work “table>”, search for the split that contains column heading for Symbol and Time & Price, and finally search that table for the lines that start off with each requested symbol. But, before I do that I have to build the returned string in pieces and return it as a single string.

What are the best practices for doing this in Perl?

Pardon the code, it is newbie stuff!



Code
use Modern::Perl '2013'; 

use LWP::Simple;
use HTML::TreeBuilder;
use HTML::FormatText;

my $URL = get("http://finance.yahoo.com/quotes/C,FB,DELL");

# Get rid of spacing in raw URL
$URL =~ s/\r//g;
$URL =~ s/\n//g;
$URL =~ s/\t/ /g;
for ( my $count = 25 ; $count > 0 ; $count--)
{
$URL =~ s/ / /g;
}

my @sections = split(/</,$URL);

my $inBody = 0; # are we currently in the body of the web page
my $inStyle = 0; # are we currently defining a style in the body of the web page
my $inScript = 0; # are we currently defining a script in the body of the web page

open ( my $page, '>',"webpage.txt");

foreach my $line (@sections)
{
my @parts = split(/>/,$line); # split the page on <
my $numParts = scalar(@parts);

if ($numParts == 0) {next;}

my @pieces = split(/ /,$parts[0]);
my $thisCommand = "<" . $pieces[0] . ">";

print "$line\n";

if ( $inBody )
{
if ($thisCommand =~ /<\/BODY>/i )
{
$inBody = 0;
next;
}
}
else
{
if ($thisCommand =~ /<BODY>/i )
{
$inBody = 1;
}
else
{
next;
}
}

# We are in the body of the page, process the commands

my $commandOut = processCommand($thisCommand);

if ( $inStyle )
{
if ($thisCommand =~ /<\/STYLE>/i ) { $inStyle = 0;}
next;
}
else
{
if ($thisCommand =~ /<STYLE>/i )
{
$inStyle = 1;
next;
}
}

if ( $inScript )
{
if ($thisCommand =~ /<\/SCRIPT>/i ) { $inScript = 0;}
next;
}
else
{
if ($thisCommand =~ /<SCRIPT>/i )
{
$inScript = 1;
next;
}
}


if ( $commandOut ) { print $page $commandOut; }


if ($numParts == 2)
{
print $page $parts[1];
}

if ($numParts > 2)
{
print "How did I get here??\n";
}

}

close ( $page );

#my $Parsed = $Format->format($TreeBuilder);

#print $Parsed;

sub processCommand
{
my ($thisCommand) = @_;

if ($thisCommand =~ /<body>/i ) { return "\n";}

if ($thisCommand =~ /<div>/i ) { return undef;}
if ($thisCommand =~ /<\/div>/i ) { return undef;}

if ($thisCommand =~ /<center>/i ) { return undef;}
if ($thisCommand =~ /<\/center>/i ) { return undef;}

if ($thisCommand =~ /<em>/i ) { return undef;}
if ($thisCommand =~ /<\/em>/i ) { return undef;}

if ($thisCommand =~ /<span>/i ) { return undef;}
if ($thisCommand =~ /<\/span>/i ) { return undef;}

if ($thisCommand =~ /<b>/i ) { return undef;}
if ($thisCommand =~ /<\/b>/i ) { return undef;}

if ($thisCommand =~ /<a>/i ) { return "{";}
if ($thisCommand =~ /<\/a>/i ) { return "}";}
if ($thisCommand =~ /<br>/i ) { return " ";}

if ($thisCommand =~ /<\/li>/i ) { return "</li>\n";}
if ($thisCommand =~ /<\/p>/i ) { return "</p>\n";}
if ($thisCommand =~ /<\/tr>/i ) { return "</tr>\n";}

if ($thisCommand =~ /<table>/i ) { return "\n<table>";}
if ($thisCommand =~ /<\/table>/i ) { return "</table>\n";}

if ($thisCommand =~ /<tr>/i ) { return "\n<tr>";}
if ($thisCommand =~ /<\/tr>/i ) { return "</tr>\n";}

if ($thisCommand =~ /<td>/i ) { return "|";}
if ($thisCommand =~ /<\/td>/i ) { return "|";}
if ($thisCommand =~ /<th>/i ) { return "|";}
if ($thisCommand =~ /<\/th>/i ) { return "|";}

return $thisCommand;
}


Here is the table section of the resulting return that contains the actual quotes:


Code
<table> <thead> 
<tr>|Symbol||Time & Price||Chg & % Chg||Day's Low & High||Volume||Avg Vol||Mkt Cap||Chart||More Info|||</tr>
</thead><tbody>
<tr>|{C}||Aug 2||53.00||+0.1400||+0.26%||52.50||53.05||15,440,120||31,193,000||161.173B||{<img>}||{Chart}, {News}, {Stats}, {Options}, {Board}||<form><button></button></form>|</tr>

<tr>|{FB}||Aug 2||38.05||+0.56||+1.50%||37.50||38.49||73,058,424||50,288,100||91.59B||{<img>}||{Chart}, {News}, {Stats}, {Options}, {Board}||<form><button></button></form>|</tr>

<tr>|{DELL}||Aug 2||13.68||+0.73||+5.60%||13.55||13.68||108,752,381||25,202,800||24.02B||{<img>}||{Chart}, {News}, {Stats}, {Options}, {Board}||<form><button></button></form>|</tr>
</tbody></table>



Laurent_R
Veteran / Moderator

Aug 3, 2013, 3:07 PM

Post #2 of 10 (682 views)
Re: [PapaGeek] Building a string in pieces [In reply to] Can't Post

Hi,
from what you said, I understand your code is doing what you want, you are looking for best practices. I do not see any bad practices in your code, except that it is not very "perlish", in the sense that it is using systematically a C style syntax, rather than using the more idiomatic Perl shortcuts.

This is a first suggestion to reduce your code from 153 lines to about 103 lines using simple Perl constructs:

Code
use Modern::Perl '2013';  

use LWP::Simple;
use HTML::TreeBuilder;
use HTML::FormatText;

my $URL = get("http://finance.yahoo.com/quotes/C,FB,DELL");

# Get rid of spacing in raw URL
$URL =~ s/[\r\n]//g;
$URL =~ s/\t/ /g;
$URL =~ s/ / /g for 0..25;

my @sections = split(/</,$URL);

my $inBody = 0; # are we currently in the body of the web page
my $inStyle = 0; # are we currently defining a style in the body of the web page
my $inScript = 0; # are we currently defining a script in the body of the web page

open ( my $page, '>',"webpage.txt");

foreach my $line (@sections)
{
my @parts = split(/>/,$line); # split the page on <
my $numParts = @parts;

next unless $numParts;

my @pieces = split(/ /,$parts[0]);
my $thisCommand = "<" . $pieces[0] . ">";

print "$line\n";

if ( $inBody )
{
$inBody = 0 and next if $thisCommand =~ /<\/BODY>/i ;
}
else
{
next unless $thisCommand =~ /<BODY>/i;
$inBody = 1;
}

# We are in the body of the page, process the commands

my $commandOut = processCommand($thisCommand);

if ( $inStyle )
{
$inStyle = 0 and next if $thisCommand =~ /<\/STYLE>/i;
}
else
{
$inStyle = 1 and next if $thisCommand =~ /<STYLE>/i;
}

if ( $inScript )
{
$inScript = 0 and next if $thisCommand =~ /<\/SCRIPT>/i;
}
else
{
$inScript = 1 and next if $thisCommand =~ /<SCRIPT>/i;
}

if ( $commandOut ) { print $page $commandOut; }


print $page $parts[1] if $numParts == 2;

print "How did I get here??\n" if $numParts > 2;
}

close ( $page );

sub processCommand
{
my ($thisCommand) = @_;

if ($thisCommand =~ /<body>/i ) { return "\n";}

return undef if $thisCommand =~ /<div>/i ) or $thisCommand =~ /<\/div>/i or $thisCommand =~ /<center>/i
or $thisCommand =~ /<em>/i or $thisCommand =~ /<span>/i or $thisCommand =~ /<\/span>/i
or $thisCommand =~ /<b>/i or $thisCommand =~ /<\/b>/i ;


if ($thisCommand =~ /<a>/i ) { return "{";}
if ($thisCommand =~ /<\/a>/i ) { return "}";}
if ($thisCommand =~ /<br>/i ) { return " ";}

if ($thisCommand =~ /<\/li>/i ) { return "</li>\n";}
if ($thisCommand =~ /<\/p>/i ) { return "</p>\n";}
if ($thisCommand =~ /<\/tr>/i ) { return "</tr>\n";}

if ($thisCommand =~ /<table>/i ) { return "\n<table>";}
if ($thisCommand =~ /<\/table>/i ) { return "</table>\n";}

if ($thisCommand =~ /<tr>/i ) { return "\n<tr>";}
if ($thisCommand =~ /<\/tr>/i ) { return "</tr>\n";}

return "|" if $thisCommand =~ /<td>/i ) or $thisCommand =~ /<\/td>/i
or $thisCommand =~ /<th>/i ) or $thisCommand =~ /<\/th>/i;
return $thisCommand;
}

I have tried to get the code to do exactly the same thing, but, of course, since I can't test, I might have made an error here or there.
I guess that not everybody will agree with my changes, my logics is that, the shorter the code (so long as it does not get cryptic), the easier it is to have it bug-free. In particular, IMHO, it is much easier to avoid bugs when you see mode conde on your screen than when you have to scoll all the time. Therefore, I would suggest a second shorter version, still trying to keep the same algorithm, with a few other improvement (such as the way to open the file):


Code
use Modern::Perl '2013';  

use LWP::Simple;
use HTML::TreeBuilder;
use HTML::FormatText;

my $URL = get("http://finance.yahoo.com/quotes/C,FB,DELL");

# Get rid of spacing in raw URL
$URL =~ s/[\r\n]//g;
$URL =~ s/\t/ /g;
$URL =~ s/ / /g for 0..25;

my @sections = split(/</,$URL);

my $inBody = 0; # are we currently in the body of the web page
my $inStyle = 0; # are we currently defining a style in the body of the web page
my $inScript = 0; # are we currently defining a script in the body of the web page

open my $page, '>', "webpage.txt" or die "cannot open webpage.txt §!\n";

foreach my $line (@sections) {
my @parts = split(/>/,$line);
my $numParts = @parts;
next unless $numParts;
my @pieces = split(/ /,$parts[0]);
my $thisCommand = "<" . $pieces[0] . ">";
print "$line\n";
if ( $inBody ) {
$inBody = 0 and next if $thisCommand =~ /<\/BODY>/i ;
} else {
next unless $thisCommand =~ /<BODY>/i;
$inBody = 1;
}
my $commandOut = processCommand($thisCommand);
if ( $inStyle ) {
$inStyle = 0 and next if $thisCommand =~ /<\/STYLE>/i;
} else {
$inStyle = 1 and next if $thisCommand =~ /<STYLE>/i;
}
if ( $inScript ) {
$inScript = 0 and next if $thisCommand =~ /<\/SCRIPT>/i;
} else {
$inScript = 1 and next if $thisCommand =~ /<SCRIPT>/i;
}

print $page $commandOut if $commandOut;
print $page $parts[1] if $numParts == 2;
print "How did I get here??\n" if $numParts > 2;
}

close ( $page );

sub processCommand {
my $_ = shift;

return "\n" if /<body>/i;
return undef if /<div>/i ) or /<\/div>/i or /<center>/i /<em>/i or /<span>/i or /<\/span>/i or /<b>/i or /<\/b>/i ;

return "{" if /<a>/i ;
return "}" if /<\/a>/i ;
return " " if /<br>/i;
my $this_command = $_;
if ($thisCommand =~ /<\/li>/i ) { return "</li>\n";}
if ($thisCommand =~ /<\/p>/i ) { return "</p>\n";}
if ($thisCommand =~ /<\/tr>/i ) { return "</tr>\n";}

if ($thisCommand =~ /<table>/i ) { return "\n<table>";}
if ($thisCommand =~ /<\/table>/i ) { return "</table>\n";}

if ($thisCommand =~ /<tr>/i ) { return "\n<tr>";}
if ($thisCommand =~ /<\/tr>/i ) { return "</tr>\n";}

return "|" if /<td>/i ) or /<\/td>/i or /<th>/i ) or /<\/th>/i;
return $thisCommand;
}

I am now down at 75 lines, less than half the original code count. Again, I haven't tested the changes, there may be an error here or there; I could probably cut it down further, but it might start to become a bit cryptic, and I don't want that. In my view, the code above is just at least as clear (possibly clearer) as your original code, and the fact that it is less than half the size of the original program makes it easier to develop and debug.
I would probably go even further if I knew the context better.


(This post was edited by Laurent_R on Aug 3, 2013, 3:20 PM)


2teez
Novice

Aug 3, 2013, 4:26 PM

Post #3 of 10 (678 views)
Re: [PapaGeek] Building a string in pieces [In reply to] Can't Post

Hi PapaGeek,
Welcome to Programming in Perl once again. But before I say anything else please let me say, Please and Please NEVER parse an html file using REGEX like you are doing except it is a single line of HTML. Rather make good use of the module you have listed in your script.

That been said, if I could point to some good practice like you requested for:
1. NEVER parse html files with REGEX use modules,
2. Always check the return value of function "open" like this

Code
 open my $fh, '>', $filename or die "can't open file: $!";

or you

Code
use autodie;

3. The same principle applies to function "close"
4. use tested module to also parse your URLs, instead of using REGEXs.

Below is a script that does what you wanted but left out the part of your workings using a subroutine which I believe you can figure out. Wink. Because all that remains is just putting a sub. and do a little re-arrangement.


Code
use warnings; 
use strict;
use LWP::UserAgent;
use HTML::TreeBuilder;
use URI;

my $url = URI->new('http://finance.yahoo.com/quotes/C,FB,DELL');

my $browser = LWP::UserAgent->new;

my $rep = $browser->get($url);

if ( $rep->is_success ) {
my $tree = HTML::TreeBuilder->new;
$tree->parse( $rep->decoded_content() );
$tree->eof;
my ($table) = $tree->look_down(
_tag => q{table},
summary => q{Collection of symbols and their associated quotes}
);
for ( $tree->look_down( _tag => q{span}, class => q{wrapper} ) ) {
if ( $_->as_text =~ /(Info|Board)$/i ) {
print $_->as_text, $/;
}
else { print ' ', $_->as_text; }
}

$tree->delete(); ## In recent HTML::TreeBuilder version may not need this
}
else {
die $rep->status_line();
}

Produces...

Code
 Symbol Time & Price Chg & % Chg Day's Low & High Volume Avg Vol Mkt CapMore Info 
C Aug 2 53.00 +0.1400 +0.26% 52.50 53.05 15,440,120 31,193,000 161.173B Chart, News, Stats, Options, Board
FB Aug 2 38.05 +0.56 +1.50% 37.50 38.49 73,058,424 50,288,100 91.59B Chart, News, Stats, Options, Board
DELL Aug 2 13.68 +0.73 +5.60% 13.55 13.68 108,752,381 25,202,800 24.02B Chart, News, Stats, Options, Board

There are other module you can still check out like HTML::TableExtract from CPAN.
NOTE: I used LWP::UserAgent instead of LWP::Simple.
Hope this helps.

UPDATE:
I used HTML::TreeBuilder to parse the HTML file.
Really, instead of using this:

Code
... 
for ( $tree->look_down( _tag => q{span}, class => q{wrapper} ) ) {
...

Like it was use, one could simply say:

Code
my ($table) = $tree->look_down( 
_tag => q{table},
summary => q{Collection of symbols and their associated quotes}
);
for ( $table->look_down( _tag => q{tr} ) ) {
....

for each table row in that particular table. Which would parse the same thing.


(This post was edited by 2teez on Aug 3, 2013, 4:46 PM)


PapaGeek
User

Aug 3, 2013, 8:19 PM

Post #4 of 10 (666 views)
Re: [PapaGeek] Building a string in pieces [In reply to] Can't Post

Thank you for the come backs so far, and yes I do know that it does not look “perlish”. Like I said I’m a newbie and that should come with time. But, I will look at every line change suggested here, they are all appreciated.

My original question was how do I build this page as a string? The current code says:


Code
open ( my $page, '>',"webpage.txt"); 

Loop for each HTML command
if ( $commandOut ) { print $page $commandOut; }
print $page $parts[1];
close ($page);

I want the code to look like this: (This is of course pseudo-code, looking for the real perl methods)


Code
my $inMemoryPage = memory::stringBuilder->new 
Loop for each HTML command
if ( $commandOut ) { print $inMemoryPage $commandOut; }
print $inMemoryPage $parts[1];
my $pageString = $inMemoryPage->extractString();
$inMemoryPage->delete();
if ($fileRequested)
{
open ( my $page, '>',$fileRequested);
print $page ,$pageString;
close ($page);
}
return $pageString;


Build the reply string in memory, not as a disk file.
Create a disk file only if requested
return the reply as a single string to the caller.

I then want to use a regular expression something like:
<tr>|{(.*)}||Aug 2||(.*)||
To parse out the symbols and prices from the returned file.

I am familiar with the HTML::TreeBuilder process, but it did not give me a lot of control over which commands to pass on and which ones to hide. That is why I’m writing my own HTML custom parser, but I want it to return a single string, not a disk file!


BillKSmith
Veteran

Aug 3, 2013, 9:26 PM

Post #5 of 10 (663 views)
Re: [Laurent_R] Building a string in pieces [In reply to] Can't Post

Laurent's processcommand function can be shortened even more by combining commands which require similar processing.

(untested)


Code
sub processCommand { 
local $_ = shift;
return
m/<body>/i ? "\n" :
m/<\/? (?: div | center | em | span | b ) >/xi ? undef :
m/<a>/i ? "{" :
m/<\/a>/i ? "}" :
m/<br>/i ? " " :
m/ < ( tr | table ) > /xi ? "<$1>/n" :
m/ <\/ ( li | p | tr | table ) > /xi ? "</$1\n>":
m/<\/? ( td | th ) > /xi ? "|" :
$_
;
}

Good Luck,
Bill


PapaGeek
User

Aug 4, 2013, 9:52 PM

Post #6 of 10 (637 views)
Re: [BillKSmith] Building a string in pieces [In reply to] Can't Post

Bill, Thank you for an excellent reply. I will definitely change my code to use this style (Perlish!)

The code was written because HTML::TreeBuilder trims down the web page the way someone else decided it should be done. My code, and especially with your change, gives me full control over what the page looks like.

But, I will ask my original question again!

I wanted to create the resulting page in memory. To that end I have modified the code as:


Code
my $pageStr = ""; 

if ( $commandOut ) { $pageStr .= $commandOut}

$pageStr .= $parts[1];


The page returned from Yahoo Finance was 62,162 characters.

My process created the pseudo text only file of 9,889 characters by performing 1,841 of the .= appends to the in memory string of the file.

Is 1,841 appends efficient? Or is there a more “Perlish” way to create the output string in memory from 1,841 pieces?


BillKSmith
Veteran

Aug 5, 2013, 6:08 AM

Post #7 of 10 (622 views)
Re: [PapaGeek] Building a string in pieces [In reply to] Can't Post

As I understand your original question, you already write your results to a disk file, but you want them in a string.

Of course, you could "slurp" (read with $/=undef) the file back into a string.

Another possibility is to open the file as a memory file rather than a disk file. (Refer perldoc -f open)

Quote
Since v5.8.0, Perl has built using PerlIO by default. Unless
you've changed this (such as building Perl with "Configure
-Uuseperlio"), you can open filehandles directly to Perl scalars
via:

open($fh, ">", \$variable) || ..


I do not know enough about your application to comment on efficiency except to note that it is seldom important if your program runs in available memory and in an acceptable time. It is far more important that your program be clear to people (including yourself). Perl provides the tools to do this.
Good Luck,
Bill


PapaGeek
User

Aug 7, 2013, 3:18 AM

Post #8 of 10 (592 views)
Re: [BillKSmith] Building a string in pieces [In reply to] Can't Post


In Reply To
Laurent's processcommand function can be shortened even more by combining commands which require similar processing.

(untested)


Code
sub processCommand { 
local $_ = shift;
return
m/<body>/i ? "\n" :
m/<\/? (?: div | center | em | span | b ) >/xi ? undef :
m/<a>/i ? "{" :
m/<\/a>/i ? "}" :
m/
/i ? " " :
m/ < ( tr | table ) > /xi ? "<$1>/n" :
m/ <\/ ( li | p | tr | table ) > /xi ? "</$1\n>":
m/<\/? ( td | th ) > /xi ? "|" :
$_
;
}



Thanks for the code you recommended. It makes the function easier to follow.

I change my tabs to \t instead of | and here is the resulting code.


Code
	return 
m/<body>/i ? "\n" :
m/<\/? ( div | center | em | span | b ) >/xi ? undef :
m/<a>/i ? "{" :
m/<\/a>/i ? "}" :
m/<br>/i ? " " :
m/<tr>/i ? "\n<tr>\t" :
m/<table>/i ? "\n<table>" :
m/ <\/ ( li | p | table ) > /xi ? "</$1>\n" :
m/ <\/ ( td | th ) > /xi ? "\t" :
m/ < ( td | th ) > /xi ? undef :
$_ ;


I also dropped the ?: that you had in your example. It works the same with and without the ?: . What exactly does that expression do?


BillKSmith
Veteran

Aug 7, 2013, 4:03 AM

Post #9 of 10 (589 views)
Re: [PapaGeek] Building a string in pieces [In reply to] Can't Post

The parenthesis are needed in the regular expressions to contain the alternations. The ?: changes them to "non-capturing" parenthesis. Their result is not stored in $1, $2...etc. I have developed the habit of using plain ones only when the result is needed. This makes the processing slightly faster, and more important the expression is easier to modify because changes to the non-capturing ones do not renumber the results.
Good Luck,
Bill


PapaGeek
User

Aug 8, 2013, 7:15 AM

Post #10 of 10 (569 views)
Re: [BillKSmith] Building a string in pieces [In reply to] Can't Post

Thanks Bill, another thing to add to my regex cheat sheet

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives