Folks,
To build a web crawler like Google Bot, I made following notes:
* Teach bot not to crawl already crawled links. Duplicate Links Filter.* Teach bot to identify file types to prevent crawling download links.* Teach bot to understand orders listed in the no robots file. Teach bot to understand the file’s instruction format to make sense of the instructions.* Teach bot to avoid loop traps.* Teach bot to avoid crawling large files that will overload the bot or drain the botting resources.* Teach bot to stay on domain, if only crawling initial website.* Teach bot to learn Word Synonyms to figureout what the crawled page is about.* Teach bot to not visit links that are over 255 chars in length.* Teach bot to not lose control of itself (don’t want crooks making use of the bot in anyway for spreading spam, malware or viruses. But how to do this ?).* Teach bot to …….
What else should be on my list ?