CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Perl Programming Help: Intermediate:
Mechanize : parsing a html content

 



sarthakganguly
Novice

Jul 26, 2009, 11:42 AM

Post #1 of 8 (1216 views)
Mechanize : parsing a html content Can't Post

hi,
I am writing a bot for searching on blogs. For that I am using the mechanize module. The problem is I start with 1 url(or mite b 100 )Tongue. So after that I get the html content from that page by the mechanize->content function I want to parse only the content part. And I want to exclude all other things on the page(like ads,blogroll,menu,header,footer etc.etc.). Can you suggest any generelized approach to solve the same. I don't want to write specific regex for only excluding blogroll and other things.
I hope I have made myself clear enough. Any suggestions will be highly appriciated.
Thanking you
Regards
Sarthak


KevinR
Veteran


Jul 26, 2009, 12:06 PM

Post #2 of 8 (1215 views)
Re: [sarthakganguly] Mechanize : parsing a html content [In reply to] Can't Post

Look into one of the many HTML parsing modules listed on CPAN.
-------------------------------------------------


1arryb
User

Jul 27, 2009, 9:16 AM

Post #3 of 8 (1203 views)
Re: [KevinR] Mechanize : parsing a html content [In reply to] Can't Post

Elaborating on Kevin's response a bit, WWW::Mechanize->content() w/o arguments should return the raw html for the fetched page. You can then pass the content to HTML::TreeBuilder or other HTML parser of your choice to do whatever you want with it that isn't supported by the WWW::Mechanize module.

Cheers,

Larry


sarthakganguly
Novice

Jul 30, 2009, 4:10 AM

Post #4 of 8 (1173 views)
Re: [1arryb] Mechanize : parsing a html content [In reply to] Can't Post

hi ,
Thanks for the reply. But my main question I will elaborate a bit. Let's take for an example I have fetched a blogspot page. Now in that page there will be blogroll,navigation menu,comments , ads and the article. So how can I fetch only the article part from the whole page leaving all other things.
Now the thing is most of t he articles are placed inside a particular div. But the div's class or id varies from site to site. So what I am asking is how do I build an also which will fetch only the article from any given page.
Any suggestions ?
thanking you
regards
Sarthak


1arryb
User

Jul 30, 2009, 7:13 AM

Post #5 of 8 (1171 views)
Re: [sarthakganguly] Mechanize : parsing a html content [In reply to] Can't Post

Hi sarthakganguly,

There is no general answer to your question: Each page is different, and so will require a unique parsing strategy. The only good thing is that sites like blogs are usually written by a program, and so have a regular structure. You'll have to download one of the pages and study it. Once you understand the page structure, you can use the methods of your HTML parser package to zero in on the parts of the page you're interested in. In your particular case, you need to identify some HTML artifact (e.g. a tag name, CSS class, or text) that always marks the article. Then you search the parse tree (assuming you're working with HTML::TreeBuilder) for that artifact and extract the article text.

Sometimes you have to do 2D analysis of the tag hierarchy (e.g. Find the tables, get the css div called "123456" inside the 3rd inner table, get the table inside the "123456" div, get the text of the paragraphs inside the table cells). It can get tedious.

Good luck,

Larry


(This post was edited by 1arryb on Jul 30, 2009, 7:15 AM)


sarthakganguly
Novice

Jul 30, 2009, 12:26 PM

Post #6 of 8 (1165 views)
Re: [1arryb] Mechanize : parsing a html content [In reply to] Can't Post

hi arryb,
Thanks a lot for your reply. I am actually doing the same thing as u suggested with trebuilder module. Was just wondering if there is something more sofisticated and robust than that. Anyways thanks alot for your reply.

Thanking you
regards
Sarthak


KevinR
Veteran


Jul 30, 2009, 12:43 PM

Post #7 of 8 (1163 views)
Re: [sarthakganguly] Mechanize : parsing a html content [In reply to] Can't Post

If the site has an RSS feed you might be able to take advantage of that instead of parsing the HTML.
-------------------------------------------------


1arryb
User

Jul 30, 2009, 12:45 PM

Post #8 of 8 (1162 views)
Re: [sarthakganguly] Mechanize : parsing a html content [In reply to] Can't Post

Hi sarthakganguly,

"Robust" and "HTML scrape" are words that don't normally go together Unsure. If you need robust, you have to talk to the site maintainer and see if you can negotiate some kind of structured data feed.

Good luck,

Larry

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives