 |
|
Replying to:
|
crawl & analyze generic web pages. by mixxed
|
|
Post:
|
hi, i'm new to this forum so bare with me a bit :) i'm looking for a script/app that would perform the following actions in the first stage of the project: Start from: a database with links and priorities. 1. read the list and determine the highest link to crawl 2. crawl the page and store locally. 3. create a database of such links 4. analyze the page to determine: - identify primary content areas (exclude header/footer, banners, features) - type of page (blog, forum,news,etc..), - page last modified, commented, etc. - various page-related or site-related parameters - other various scores (based on formulas I provide) 5. analyze the primary content - similarity to previously known contents (trained bayesian or smth similar) - recency of the content - retrieve the main keywords 6. store scores&values from #4 and #5 in a database. I'm expecting this to be a 1-2 months project. Is there anyone here that could get this done and provide a quote ? M.
|
|
|  |