hi, i'm new to this forum so bare with me a bit :)
i'm looking for a script/app that would perform the following actions in the first stage of the project:
Start from: a database with links and priorities.
1. read the list and determine the highest link to crawl 2. crawl the page and store locally. 3. create a database of such links 4. analyze the page to determine: - identify primary content areas (exclude header/footer, banners, features) - type of page (blog, forum,news,etc..), - page last modified, commented, etc. - various page-related or site-related parameters - other various scores (based on formulas I provide) 5. analyze the primary content - similarity to previously known contents (trained bayesian or smth similar) - recency of the content - retrieve the main keywords 6. store scores&values from #4 and #5 in a database.
I'm expecting this to be a 1-2 months project. Is there anyone here that could get this done and provide a quote ?