/    Sign up×
Community /Pin to ProfileBookmark

Parse HTML code to array

Hi,

I am developing an application which is to installed in a website with thousands of html pages and extract values/attributes of certain tags.

For example, <title> tags, meta tags, href attributes of hyperlinks, etc.

Its clear that, extraction of simple tags like <title>, but when comes to tags like <a>, <img> etc, I found it hard to extract them properly.

At the moment, I have been able to extract <head>…</head>, <body>..</body> tags separately, and to make arrays of tags like <a>,<img>. But I could’nt separate attributes of them.

sample,

[code=php]
$result= file_get_contents(‘http://www.google.com’);

$bodyStart = strpos($result,'<body’);

$bodyEnd = strpos($result,'</body>’);
$bodyEnd += 7;
$bodyLength = $bodyEnd – $bodyStart;
$body = substr($result,$bodyStart,$bodyLength);
preg_match_all(“(<a.*</a>)siU”, $body, $matching_data);[/code]

Above outputs array something like this..

array(1) {
[0]=>
array(28) {
[0]=>
string(96) “<a href=”http://images.google.com/imghp?hl=en&tab=wi” onclick=gbar.qs(this) class=gb1>Images</a>”
[1]=>
string(91) “<a href=”http://maps.google.com/maps?hl=en&tab=wl” onclick=gbar.qs(this) class=gb1>Maps</a>”
[2]=>
string(92) “<a href=”http://news.google.com/nwshp?hl=en&tab=wn” onclick=gbar.qs(this) class=gb1>News</a>”
[3]=>


}
}

You can see , by that code its possible list similar tags to separate arrays, but when a a single array element taken how to separate attributes.

I mean I can use ” or space to explode tag string to parse attributes. But you see even google forgets to use “s where its much better if they are used. And certain tags may have certain attribute that have values contain space..

So, is anyone of you aware of an method with some intelligence to grab attributes and their values. [b]Regular expressions may help here, but still they are mystery to me == I don’t know a custom use ?

Think a final result would be great. Sorry my var_dump in following may be out of order, but just to give you an idea what I need finally.

array(1) {
[‘html’]=>
array(2) {
[‘head’]=>
array(2){
[‘meta’]==>
array(2){
[‘keywords’]
[‘description’]
}
[title](string)
}
[‘body’]=>
array(n){
[links]=>
array(n){
[0]=>
href(string)


[n]=>
href(string)}

[imgs]=>
array(n){
[0]=>
src(string)
alt(string)


[n]=>
src(string)
alt(string)}



}
}

Thanks and Best Regards

to post a comment
PHP

3 Comments(s)

Copy linkTweet thisAlerts:
@NogDogMar 08.2009 — Have you looked into the [url=http://php.net/dom]DOM[/url] classes/methods yet?
Copy linkTweet thisAlerts:
@GUIRauthorMar 08.2009 — Hi,

Thanks for your prompt reply.

I tried with DOM, but so far my attempts parsing HTML just gave errors like following, html pages scanned are either out of standard HTML conventions or, I may need some more things to do...

[CODE]Warning: DOMDocument::loadHTML() [function.DOMDocument-loadHTML]: Unexpected end tag : input in Entity, line: 121 in /home/......./public_html/Stats/Spider/spider.php on line 40[/CODE]

[code=php]<?php
$result= file_get_contents('http://www.google.com');


$dom = new domDocument;



$dom->loadHTML($result);


$dom->preserveWhiteSpace = false;


$content = $dom->getElementsByTagname('<body>');


$out = array();
foreach ($content as $item)
{

$out[] = $item->nodeValue;
}

var_dump($out);
?>[/code]
Copy linkTweet thisAlerts:
@NogDogMar 08.2009 — Those warnings aren't fatal, but of course if the document is sufficiently malformed then you might not be able to retrieve everything. You could try turning off strict validation just to see if that gets rid of the error. If so, it's probably parsed OK, just is not strictly valid.
[code=php]
$dom = new domDocument();
$dom->strictErrorChecking = false;
// rest of $dom stuff...
[/code]
×

Success!

Help @GUIR spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.15,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...