/    Sign up×
Community /Pin to ProfileBookmark

PHP Web Scraping

Are there any good resources for learning web scraping with PHP? I’m mostly looking for a good book. If there are any good articles that would be good too.

What are any legal issues behind data scraping? I’ve been looking around and it seems like everything is fine except for how you use the information you scrape.

to post a comment
PHP

10 Comments(s)

Copy linkTweet thisAlerts:
@ShrineDesignsApr 14.2014 — if you are scraping via HTTP, the specification documents are a good starting point. if you are gonna be scraping different scheme other than HTTP (i.e. FTP, TCP, UDP, etc.) or would like to get down and dirty with scraping, i would recommend using sockets and stream extensions over cURL extension. i've written wrapper classes for HTTP over sockets. i'm sure i can answer some questions on that aspect.
Copy linkTweet thisAlerts:
@jedaisoulApr 14.2014 — The legal issue is that now-a-days there is generally a presumption of copyright. You may not reproduce whole articles etc. Even partial extracts can breach copyright if they materially damage the commercial interests e.g. giving the ending of a book or film.
Copy linkTweet thisAlerts:
@VNAsianauthorApr 15.2014 — Thanks for the posts guys. Ive started fooling around with the curl library. What are advantages of using sockets over curl?

What about any copyright on images? Lets say I crawl social networks looking for images and information on specific people who's names were typed into a text box.
Copy linkTweet thisAlerts:
@mukeshpatelJul 25.2014 — If you are using web scraping or web extraction for your business improvement then it's fine.

Most of the people use web scraping or web extraction tool for eCommerce or any online store to compare products price.

I have list of article resource of web data extraction but i can not post link here. If moderator allows me to post a link then i will give it to you else contact me.

Thanks !
Copy linkTweet thisAlerts:
@KalobTaulienAug 01.2014 — Hey,

Here's a great resource for gathering and parsing DOM: http://simplehtmldom.sourceforge.net/

Images, meta tags, etc. Very similar to jQuery selectors.

[B]Warning:[/B] Scraping is both exhaustive on your bandwidth and your CPU. Remember to free resources when possible.

Disclaimer: You should always have permission from web admins and owners to scrape their site. Copyrights and such can make life a nightmare if you cross the wrong person.
Copy linkTweet thisAlerts:
@tracknutAug 01.2014 — 
What about any copyright on images? Lets say I crawl social networks looking for images and information on specific people who's names were typed into a text box.[/QUOTE]

Any image is copyrighted unless you have explicit permission of the owner, or the owner has explicitly put it in the public domain (e.g. Wikimedia). Stuff you just find on social networks is going to be copyrighted most always.
Copy linkTweet thisAlerts:
@GravyAug 03.2014 — My Rules in Scraping:

• Don't publish images or text that you scrape that the owner doesn't want you to.

***- I would not use images scraped from a social media site (people like their privacy and you don't know the source of many images).

*
**
- I may consider using product images, prices and such scraped from stores for a review site (they want ads).

• Don't scrape too fast, be patient and don't put stress on their servers.

Some people say NEVER to use regex in scraping and to only use DOM. Each of these have their advantages and disadvantages. Many sites aren't coded well and are difficult to scrape data from, especially those where the structure changes frequently, for many of these and throw away project regex can be a good option.
Copy linkTweet thisAlerts:
@servicesindiaAug 07.2014 — Thanks for this post use the scrapping in php read this..Programmer
Copy linkTweet thisAlerts:
@fellowthetruAug 07.2014 — I do Agree with Gravy.

But consider this (IMHO) before u start to learn :

[B]The Good[/B] – There’s not much that’s good about web scraping. Unless you’re looking to use unsavory tactics, steal competitors’ content and pricing, or use other sites’ intellectual property, web scraping is just all around bad.

[B]The Bad[/B] – The really bad news about web scraping is that it can lead to the theft of your content, which, if used on other sites, can significantly affect your SEO performance and rankings. It can also give competitors access to your proprietary pricing and product information, which ultimately gives them a leg up in the marketplace with the customers you’re actively seeking.

[B]The Ugly[/B] – In a nutshell, web scraping can be a huge detriment to your brand. It can threaten your sales and conversions, lower your site’s SEO rankings or even get you blacklisted, negate the benefits of the content you’ve worked hard to produce, and can cause you to spend even more resources to make up for its damaging effects.
Copy linkTweet thisAlerts:
@fellowthetruAug 07.2014 — The really bad news about web scraping is that it can lead to the theft of your content, which, if used on other sites, can significantly affect your SEO performance and rankings. It can also give competitors access to your proprietary pricing and product information, which ultimately gives them a leg up in the marketplace with the customers you’re actively seeking.
×

Success!

Help @VNAsian spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.18,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...