/    Sign up×
Community /Pin to ProfileBookmark

Making Sphider ignore disallowed pages?

I meant: Ignore the disallow instruction, going ahead and retrieve the page.

I tried sphider, sphider-plus and some mods to make it “ignore robots”, but it seems not enough.
I’m trying to index a third party website to help users to find other’s posts, since the owner seems too busy with the “sales, sales, sales” part.
The problem is seems they deliberately want us not to find help because they also added some “disallow” rule.

I can browse the pages, and even changed sphider agent to Firefox’s, no success.

Is it even possible to browse a website as a browser, other than faking the user agent? in other words: How many ways a server has to figure out whether it’s a robot or not what is reading the pages?

What I’m stating could be wrong, and there could be other instructions/rules in robot.txt or somewhere else, but bear with me ?

Thanks.

to post a comment
PHP

3 Comments(s)

Copy linkTweet thisAlerts:
@CharlesMar 22.2012 — The internet is designed to spread information, not keep it safe. As with life itself, the best that you can do is ask politely for the spiders to leave you alone.
Copy linkTweet thisAlerts:
@sergiozambranoauthorMar 22.2012 — The internet is designed to spread information, not keep it safe. As with life itself, the best that you can do is ask politely for the spiders to leave you alone.[/QUOTE]
ejem… amen?

What?

Did you read my description or just the title?

Is that an answer? or your signature in an empty post?
Copy linkTweet thisAlerts:
@sergiozambranoauthorMar 27.2012 — Stupidly I didn't check HOW the links appear, just where the links pointed to.

It seems the links open the pages I want with JavaScript, which Sphider can't process.

At least I know how the pages are called, and I can increment the query string while downloading. That won't index the original pages but I'll be able to create a DB I can work with.

Is there any php script or Mac Software (or Firefox/Chrome extension?) to download webpages from a url range?

Any idea?
×

Success!

Help @sergiozambrano spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.8,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...