Folks,
As you know I am trying to build my own web crawler now.
I am not too bothered about ranking algorithm, I have my own up my sleeves. Plus, can worry about ranking algo once the Indexing us finished and that won’t happen unless I complete my crawler first.
I was originally planning on associating keywords/phrases with your website link (eg site homepage) based on the link anchors of all links found throughout your website that link to your concerned link (in this example, your site homepage) aswell as anchors of all links found on other sites (those foreign domains that link to your site homepage) and associate the keywords/phrases found on the page’s (eg your site homepage’s) meta tags.
But what if a new page on your website pops up on the internet and it has no meta tags nor any foreign domains linking to it. In that case, the only keywords/phrases I can associate with that new link is the anchor texts of all the links that are linking to it from your website. If only a handful of other pages are linking to it then only a handful of keywords/phrases get associated with it. That is no good.
I know Google and the like analyse the crawled page’s content using word synonyms and associate those synonym keywords to the page. That way, chances of the new page getting found under any of these keywords raise the potential to be found. But I ain’t getting into synonyms yet.
And so my main question is:
What other form of data must I associate with the crawled page apart from the anchor texts of all internal pagesinking to it and the page’s very own meta tags ?