CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
Search Posts SEARCH
Who's Online WHO'S
Log in LOG

Home: Perl Programming Help: Regular Expressions:
Screen Scraping from HTML page - help


New User

Feb 4, 2007, 9:15 PM

Post #1 of 5 (11509 views)
Screen Scraping from HTML page - help Can't Post

Thanks in advance. I've been hammering this for a week with no luck. I've even used RegexBuddy and still can't crack the problem. I have a NASDAQ premarket action stocks page where I want to pull out the Stock Symbol, yesterday's close, today's open, the gap % and the premarket volume.

I've got a script that pulls the HTML down and puts it in a file.
I can parse out the stock symbol but I am stuck on how the get out the close, open, gap and volume data. I think my problem is with the second (and if I could figure it out 3rd, 4th and 5th groupings) - Basically the 2nd set of round brackets or parentheses don't seem to recognize that I want Perl to save the match (yesterday's close) to $2 (assume stock symbol would be in $1)

Here's the nasdaq page -

Here's my regex so far - am using single line (/s option) and greedy matching

If this helps, here's a snippet of text that have an example where I'd like to get the data. Per RegexBuddy, my regex gets me all the way thru the $6.75 but does not understand that I want it to save the 6.75 to $2. Is my problem with strings and numbers? Appreciate any and all help!














<a target="_top"


Secure Computing



<td align="right"><nobr><b>

<td align="right"><nobr><b>

<td align="right"


<td align="right">564,434</td>

New User

Feb 5, 2007, 2:52 PM

Post #2 of 5 (11501 views)
Re: [rtwolfe] Screen Scraping from HTML page - help [In reply to] Can't Post

Are you doing all the stocks listed on the page, or you just interested in a particular one?

Intel for example, this source code may be easier to scrape (individually) since you get the html like this:

   <td >Share Volume:</td> \    
<td >63,234,844</td> \
<td >Previous Close:</td> \
<td >$&nbsp;21.23</td> \

 #Example Code: if ($Line =~ m/\Q put-1st-delimiter-here \E(.*?)\Q put-2nd-delimiter-here \E(.*?)\Q put-3rd-delimiter-here \E/i) {   
#this grabs text using the first delimiter to locate a "spot" on the page and then grab whatever is in between (using $2)
#the second and third delimiters as noted by \Q ... \E

if ($Line =~ m/\QPrevious Close\E(.*?)\Qnbsp;\E(.*?)\Q</td>\E/i) {

print $2;


This should give you "21.23" (provided you've escaped what needs to be escaped, etc.)

Just a thought...

(This post was edited by Watts on Feb 5, 2007, 3:06 PM)

New User

Feb 5, 2007, 7:43 PM

Post #3 of 5 (11489 views)
Re: [rtwolfe] Screen Scraping from HTML page - help [In reply to] Can't Post

Hi Watts:

To answer your question, I do want several stocks from the pre-market gapping page.

Specifically the ten stocks that are up in pre-market and the ten that are down.

Not interested in the 10 with most volume, just highest % change either up or down.

So I still need to parse the page with the link show earlier.

Appreciate any suggestions related to the HTML text on the

nasdaq page -

I expect it is something really small that is keeping my draft regex from working, just I can't 'see' it.

Hopefully with new sets of eyes someone else will 'get it'.. Thanks again


Apr 2, 2007, 9:39 AM

Post #4 of 5 (11353 views)
Re: [rtwolfe] Screen Scraping from HTML page - help [In reply to] Can't Post

this question was answered and resolved on another forum when this question was originally posted here.

New User

Apr 16, 2007, 7:14 PM

Post #5 of 5 (11261 views)
Re: [KevinR] Screen Scraping from HTML page - help [In reply to] Can't Post

I worked on a project very like this a few years ago. Modules such as HTML::TokeParser::Simple are well suited to the task. Avoid using regex as they tend to get very messy very quickly.

Also there is a HTML Template module which was very useful for scraping stock data.


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives