crawl & analyze generic web pages. by mixxed
hi, i'm new to this forum so bare with me a bit :)
i'm looking for a script/app that would perform the following actions in the first stage of the project:
Start from: a database with links and priorities.
1. read the list and determine the highest link to crawl
2. crawl the page and store locally.
3. create a database of such links
4. analyze the page to determine:
- identify primary content areas (exclude header/footer, banners, features)
- type of page (blog, forum,news,etc..),
- page last modified, commented, etc.
- various page-related or site-related parameters
- other various scores (based on formulas I provide)
5. analyze the primary content
- similarity to previously known contents (trained bayesian or smth similar)
- recency of the content
- retrieve the main keywords
6. store scores&values from #4 and #5 in a database.
I'm expecting this to be a 1-2 months project.
Is there anyone here that could get this done and provide a quote ?