/    Sign up×
Community /Pin to ProfileBookmark

Mass scraping for image checks

This sounds like a lot more work than it is worth, but I was “tasked” by my
company to iterate our entire domain to look for lost images (we go through
Scene7 by Adobe if anyone is familiar with it?). This seems like a fairly major
undertaking, so thanks to everyone [b]well[/b] in advance. I have a php
script that was posted in a old thread and need to expand on it dramatically.

First thing is first though, I get a division by zero error on 28
[color=red]FIXED: Someone forgot to define block size =D[/color]

[code=php]
<?php

// First set
$urls[0] = ‘www.domain.com/shoes.html’;
$urls[1] = ‘www.domain.com/guys.html’;
$urls[2] = ‘www.domain.com/girls.html’;
$urls[3] = ‘www.domain.com/youth.html’;
$urls[4] = ‘www.domain.com/skate.html’;
$urls[5] = ‘www.domain.com/dyoc’;

// — create all the individual cURL handles and set their options
$curl_handles = array();
foreach ($urls as $url) {
$curl_handles[$url] = curl_init();
curl_setopt($curl_handles[$url], CURLOPT_URL, $url);
// set other curl options here
}

// — start going through the cURL handles and running them
$curl_multi_handle = curl_multi_init();

$i = 0; // count where we are in the list so we can break up the runs into smaller blocks
$block = array(); // to accumulate the curl_handles for each group we’ll run simultaneously

foreach ($curl_handles as $a_curl_handle) {
$i++; // increment the position-counter

// add the handle to the curl_multi_handle and to our tracking “block”
curl_multi_add_handle($curl_multi_handle, $a_curl_handle);
$block[] = $a_curl_handle;

// someone forgot to define block size =D
$block_size = sizeof($urls);

// — check to see if we’ve got a “full block” to run or if we’re at the end of out list of handles
if (($i &#37; $block_size == 0) or ($i == count($curl_handles))) {
// — run the block

$running = NULL;
do {
// track the previous loop’s number of handles still running so we can tell if it changes
$running_before = $running;

// run the block or check on the running block and get the number of sites still running in $running
curl_multi_exec($curl_multi_handle, $running);

// if the number of sites still running changed, print out a message with the number of sites that are still running.
if ($running != $running_before) {
echo(“Waiting for $running sites to finish…n”);
}
} while ($running > 0);

// — once the number still running is 0, curl_multi_ is done, so check the results
foreach ($block as $handle) {
// HTTP response code
$code = curl_getinfo($handle, CURLINFO_HTTP_CODE);

// cURL error number
$curl_errno = curl_errno($handle);

// cURL error message
$curl_error = curl_error($handle);

// output if there was an error
if ($curl_error) {
echo(“&nbsp;&nbsp;&nbsp;*** cURL error: ($curl_errno) $curl_errorn”);
}

// remove the (used) handle from the curl_multi_handle
curl_multi_remove_handle($curl_multi_handle, $handle);
}

// reset the block to empty, since we’ve run its curl_handles
$block = array();
}
}

// close the curl_multi_handle once we’re done
curl_multi_close($curl_multi_handle);
?>
[/code]

So I guess the first milestone of this is to start reading the DOM for URL
address’ to start adding towards $urls[], I can see the immediate need
for array_unique() on each iteration. My biggest “issue” is trying to run
through thousands of links and not repeatedly doing it. Comparison array?

to post a comment
PHP

0Be the first to comment 😎

×

Success!

Help @ehime spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.15,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...