Jul 30, 2009, 7:13 AM
Post #5 of 8
Re: [sarthakganguly] Mechanize : parsing a html content
[In reply to]
There is no general answer to your question: Each page is different, and so will require a unique parsing strategy. The only good thing is that sites like blogs are usually written by a program, and so have a regular structure. You'll have to download one of the pages and study it. Once you understand the page structure, you can use the methods of your HTML parser package to zero in on the parts of the page you're interested in. In your particular case, you need to identify some HTML artifact (e.g. a tag name, CSS class, or text) that always marks the article. Then you search the parse tree (assuming you're working with HTML::TreeBuilder) for that artifact and extract the article text.
Sometimes you have to do 2D analysis of the tag hierarchy (e.g. Find the tables, get the css div called "123456" inside the 3rd inner table, get the table inside the "123456" div, get the text of the paragraphs inside the table cells). It can get tedious.
(This post was edited by 1arryb on Jul 30, 2009, 7:15 AM)