/    Sign up×
Community /Pin to ProfileBookmark

Screen Scraping Problems

I’m trying to scrape the data for [URL=”http://www.stjohns.ca/business/businessdirectory/index.jsp?actionmode=&pageNumber=&printNumber=&keyWords=&street=&businessClass=0&businessClassSelect=0&businessSubCategorySelect=0&selectedBusinessClass=+++++++++++++++++++++++++&frmButton=List+All”]this page[/URL]. However, when I use PHPs file_get_contents($url), it only retrieves the side menus, the header and footer. All the business data is skipped over.

To be sure that this wasn’t because of some weird quirk in PHP I also used a Ruby script to do the same thing. Ruby also skips over the business data. Any ideas on what’s going on here? I need to be able to do this in PHP.

to post a comment
PHP

5 Comments(s)

Copy linkTweet thisAlerts:
@boneconeauthorNov 14.2009 — Okay I found what was causing the problem, but I don't know what to do about it.

When you go to http://www.stjohns.ca/business/businessdirectory/index.jsp, you are given a randomly generated session variable. Then when you click on the "List All" button it checks to see if this variable has been set. If not, then you are redirected back to the index.jsp page. This prevents you from linking directly to search results.

So, is there any way of getting around this one?
Copy linkTweet thisAlerts:
@donatelloNov 15.2009 — Looks like they did that to stop people like us from screenscraping.

You can try putting it into an iframe...

I tried it with two of my screenscrapers as I just happened to be fiddling with some screenscrapers when I spotted your post... same issue. ?
Copy linkTweet thisAlerts:
@boneconeauthorNov 15.2009 — I found a solution afterwords. I downloaded a PHP class called PHPCrawl which is able to accept cookies just like a browser. First I got the crawler to visit the search page in order to get the cookie. Then I got it to visit the search results page and it was able to retrieve all of the source.
Copy linkTweet thisAlerts:
@donatelloNov 16.2009 — Do you have a link to that PHP class?

Sounds like something fun to play with next weekend. ?
×

Success!

Help @bonecone spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.15,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...