CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Beginner:
Parsing Web Page - Regex

 



edlong
New User

Dec 6, 2013, 8:37 AM

Post #1 of 4 (633 views)
Parsing Web Page - Regex Can't Post

Hello,

While parsing a simple web page, a regex that works for what seem to be the exact same string, is not working for others. Attached is the fully working program.
This is the example output where the issue occurs:
DATA NOT FOUND: Total Bedrooms:

The HTML that I am parsing looks as such:

Code
 <div class="field-items"> 
<div class="field-item odd">
<div class="field-label-inline-first">
Total Bedrooms:&nbsp;</div>
5 </div>
</div>


Thanks in advance!

This is perl 5, version 16, subversion 3 (v5.16.3) built for MSWin32-x64-multi-thread
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2012, Larry Wall

Binary build 1603 [296746] provided by ActiveState http://www.ActiveState.com
Built Mar 13 2013 13:31:10
Attachments: webtest.pl (5.88 KB)


FishMonger
Veteran / Moderator

Dec 6, 2013, 8:53 AM

Post #2 of 4 (630 views)
Re: [edlong] Parsing Web Page - Regex [In reply to] Can't Post

In almost all cases using a regex to parse HTML is a mistake because it's very fragile.

You should be using one of the HTML parsers on cpan, such as HTML::Parser.

http://search.cpan.org/~gaas/HTML-Parser-3.71/Parser.pm

If you scroll down to the bottom to the "SEE ALSO" section, you'll have a list of related modules.


Kenosis
User

Dec 6, 2013, 9:07 AM

Post #3 of 4 (625 views)
Re: [edlong] Parsing Web Page - Regex [In reply to] Can't Post

Heed FishMonger's advice, and immerse yourself in the canonical You can't parse [X]HTML with regex. Because HTML can't be parsed by regex.


edlong
New User

Dec 6, 2013, 10:38 AM

Post #4 of 4 (613 views)
Re: [Kenosis] Parsing Web Page - Regex [In reply to] Can't Post

After reviewing the options, TokeParser did the trick for me. It's not as pretty as I'd like, but I think that is primarily due to how ugly the HTML is.

Still curious why the REGEX didn't work though. Understand this isn't the best method for HTML parsing;

Thanks for the help!


(This post was edited by edlong on Dec 6, 2013, 12:45 PM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives