/    Sign up×
Community /Pin to ProfileBookmark

PHP DOMDocument.loadHTMLFile() – ERROR

[code=php]
#LOAD QUEUED ADDRESSS INTO DOMdocument()
function loadHTML($url){

#CREATE DOM DOCUMENT OBJECT
$this->dom = new DOMdocument();
#SUPPRESS ERRORS, LOAD HTML FILE INTO DOM DOCUMENT
#ERRORS HAVE BEEN SUPPRESSED, BECAUSE BAD MARKUP USUALLY LEADS TO VARIOUS DOM WARNINGS AND WE CANNOT CONTROL BAD MARKUP ;).
@$this->dom->loadHTMLFile($url);
#CREATE XPATH OBJECT FOR USE OF THE TREED DOM AND QUERY ABILITIES
$this->xpath = new DOMXPath($this->dom);

$this->address = $url;

}
[/code]

I am working on a custom crawler solution for my company, all is well thus far (kinda). I’m using DOMDocument->loadHTMLfile() with XPath queries to fetch the data from the HTML elements.

Anyways, the deal is this. When constructing the page queue I collect all the links relative to one of the current URL and then compile them, later I grab 25 in a batch and crawl those, once completed I grab another 25. I enabled error suppression on the DOMDocument object because shody markup in HTML will generate a console full of errors while printing out status checkpoints.

NOW that you have the history, here is my problem. There were a series of pages(and child pages) that kept getting passed over, the crawler would complete without any error output, until I disabled DOMDocument error suppression and tested a parent pages that had been parsed, but had skipped over all the children pages of that parental page.

[url]http://cableorganizer.com/video-projection-screens/artscreen.htm[/url] this page for example if tested via browser, loads fine. If you load that link into my DOMDocument::loadHTMLFile() you immediately receive an error:

DOMDocument::loadHTMLFile() : Failed to load stream

… I don’t get it, if it works in the browser, and the crawler successfully crawls 4500+ pages without a hitch… why does this one page generate a failed stream error.

anyone care to enlighten me?

At first I thought it was a URL length problem, but some of our URLs loaded with GET variable data are nearly twice the length… maybe there is an unspoke structure bug or something?… all help and suggestions are very much appreciated.

Thanks;
Chad

P.S. I had this same error with a page less substantial than this one earlier in the program, I couldn’t find a fix, so I tried renaming the filename of the page… it worked, however, this page is Google Indexed and linked from various sections, so changing the file name is not an option ?

to post a comment
PHP

0Be the first to comment 😎

×

Success!

Help @chadillac spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 6.2,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @meenaratha,
tipped: article
amount: 1000 SATS,

tipper: @meenaratha,
tipped: article
amount: 1000 SATS,

tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,
)...