Hi,
I am developing an application which is to installed in a website with thousands of html pages and extract values/attributes of certain tags.
For example, <title> tags, meta tags, href attributes of hyperlinks, etc.
Its clear that, extraction of simple tags like <title>, but when comes to tags like <a>, <img> etc, I found it hard to extract them properly.
At the moment, I have been able to extract <head>…</head>, <body>..</body> tags separately, and to make arrays of tags like <a>,<img>. But I could’nt separate attributes of them.
sample,
[code=php]
$result= file_get_contents(‘http://www.google.com’);
$bodyStart = strpos($result,'<body’);
$bodyEnd = strpos($result,'</body>’);
$bodyEnd += 7;
$bodyLength = $bodyEnd – $bodyStart;
$body = substr($result,$bodyStart,$bodyLength);
preg_match_all(“(<a.*</a>)siU”, $body, $matching_data);
Above outputs array something like this..
array(1) {
[0]=>
array(28) {
[0]=>
string(96) “<a href=”
[1]=>
string(91) “<a href=”
[2]=>
string(92) “<a href=”
[3]=>
…
…
}
}
You can see , by that code its possible list similar tags to separate arrays, but when a a single array element taken how to separate attributes.
I mean I can use ” or space to explode tag string to parse attributes. But you see even google forgets to use “s where its much better if they are used. And certain tags may have certain attribute that have values contain space..
So, is anyone of you aware of an method with some intelligence to grab attributes and their values. [b]Regular expressions may help here, but still they are mystery to me == I don’t know a custom use ?
Think a final result would be great. Sorry my var_dump in following may be out of order, but just to give you an idea what I need finally.
array(1) {
[‘html’]=>
array(2) {
[‘head’]=>
array(2){
[‘meta’]==>
array(2){
[‘keywords’]
[‘description’]
}
[title
}
[‘body’]=>
array(n){
[links]=>
array(n){
[0]=>
href(string)
…
…
[n]=>
href(string)}
[imgs]=>
array(n){
[0]=>
src(string)
alt(string)
…
…
[n]=>
src(string)
alt(string)}
…
…
}
}
Thanks and Best Regards