/    Sign up×
Community /Pin to ProfileBookmark

web crawler issues

I am hosting a site that contains a lot of pdf’s.

My bandwidth is getting hosed by crawlers because they keep downloading the pdf’s.

Is there a way in php to check an IP address and detemine weather it is an actual person or a webcrawler?

can I do this in PHP?

If not is there another resource I can look into for fixing this problem?

to post a comment
PHP

4 Comments(s)

Copy linkTweet thisAlerts:
@DARTHTAMPONauthorMay 03.2006 — I have to be able to do it on server side though. I OCR my pdf's and need the text to be available to the crawlers but not the pdf's.
Copy linkTweet thisAlerts:
@balloonbuffoonMay 03.2006 — Why don't you check the referrer and make sure its from your domain? You only want these PDFs linked from your own pages, right?

--Steve
Copy linkTweet thisAlerts:
@DARTHTAMPONauthorMay 04.2006 — Yes but the way the setup works is that users can access the PDF's from anywhere. The text is hidden behind the PDF for crawlers. I just need to find a way to tell if someone is a crawler and to not include the PDF code behind the scenes. The way I am doing it now is to look at my server files and see what IP's are hitting my site the most. Then do a whois on the IP to get the range of IP's used by the search company. After I get this information I have a script that checks the users IP against my list of crawlers and then on page load cuts the PDF code so that only the text is sent.

The way I am doing it now works, but can be time consuming. Its a never ending job too since I have to check my server files everyday to eliminate the crawlers from getting my PDF's.

If anybody knows of a way to determine weather an IP or whois lookup weather the IP is a crawler or a regular person.
×

Success!

Help @DARTHTAMPON spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 6.16,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @nearjob,
tipped: article
amount: 1000 SATS,

tipper: @meenaratha,
tipped: article
amount: 1000 SATS,

tipper: @meenaratha,
tipped: article
amount: 1000 SATS,
)...