Parsing Web Page - Regex

New User

Dec 6, 2013, 8:37 AM

Parsing Web Page - Regex


While parsing a simple web page, a regex that works for what seem to be the exact same string, is not working for others. Attached is the fully working program.
This is the example output where the issue occurs:
DATA NOT FOUND: Total Bedrooms:

The HTML that I am parsing looks as such:

 <div class="field-items"> 
<div class="field-item odd">
<div class="field-label-inline-first">
Total Bedrooms:&nbsp;</div>
5 </div>

Thanks in advance!

This is perl 5, version 16, subversion 3 (v5.16.3) built for MSWin32-x64-multi-thread
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2012, Larry Wall

Binary build 1603 [296746] provided by ActiveState http://www.ActiveState.com
Built Mar 13 2013 13:31:10
Attachments: webtest.pl (5.88 KB)

Veteran / Moderator

Dec 6, 2013, 8:53 AM

Re: [edlong] Parsing Web Page - Regex

In almost all cases using a regex to parse HTML is a mistake because it's very fragile.

You should be using one of the HTML parsers on cpan, such as HTML::Parser.


If you scroll down to the bottom to the "SEE ALSO" section, you'll have a list of related modules.


Dec 6, 2013, 9:07 AM

Re: [edlong] Parsing Web Page - Regex

Heed FishMonger's advice, and immerse yourself in the canonical You can't parse [X]HTML with regex. Because HTML can't be parsed by regex.

New User

Dec 6, 2013, 10:38 AM

Re: [Kenosis] Parsing Web Page - Regex

After reviewing the options, TokeParser did the trick for me. It's not as pretty as I'd like, but I think that is primarily due to how ugly the HTML is.

Still curious why the REGEX didn't work though. Understand this isn't the best method for HTML parsing;

Thanks for the help!

(This post was edited by edlong on Dec 6, 2013, 12:45 PM)