Which Line Is Causing The Issue ?

@novice2022Oct 15.2022

Php Programmers,

Can you see the crawler code here ?
https://bytenota.com/parsing-an-xml-sitemap-in-php/

Here is the simple code that I need to modify:

[code] // sitemap url or sitemap file $sitemap = ‘https://bytenota.com/sitemap.xml’;

// get sitemap content $content = file_get_contents($sitemap);

// parse the sitemap content to object $xml = simplexml_load_string($content);

// retrieve properties from the sitemap object foreach ($xml->url as $urlElement) { // get properties $url = $urlElement->loc; $lastmod = $urlElement->lastmod; $changefreq = $urlElement->changefreq; $priority = $urlElement->priority;

// print out the properties echo ‘url: ‘. $url . ‘ ’; echo ‘lastmod: ‘. $lastmod . ‘ ’; echo ‘changefreq: ‘. $changefreq . ‘ ’; echo ‘priority: ‘. $priority . ‘ ’;

echo ‘ — ’; } [/code]

That code I got from a tutorial and it assumes the Sitemap xml file (starting point of the crawl) is listing no further xml files but html links.

Now the xml sitemap I was working on had more xml sitemaps listed.
https://www.rocktherankings.com/sitemap_index.xml
And those other more xml sitemaps were then listing the html files of the site. That means, the code on my original post was not working and was showing blank page as I have to write more code for the crawler to go one level deep to find the site’s html files. So, the crawler should start on an xml file. Find more xml files on it and then visit those xml files to finally find the html links.
Now, look at this modification of the code you see on the tutorial code:

[code] $extracted_urls = array(); $crawl_xml_files = array(); // sitemap url or sitemap file $sitemap = ‘https://www.rocktherankings.com/post-sitemap.xml’; //$sitemap = “https://www.rocktherankings.com/sitemap_index.xml”; //Has more xml files. // get sitemap content $content = file_get_contents($sitemap); // parse the sitemap content to object $xml = simplexml_load_string($content); // retrieve properties from the sitemap object foreach ($xml->url as $urlElement) { echo __LINE__; echo ‘ ’; //DELETE IN DEV MODE $path = $urlElement; $ext = pathinfo($path, PATHINFO_EXTENSION); echo ‘The extension is: ‘ .$ext; echo ‘ ’; //DELETE IN DEV MODE echo __LINE__; echo ‘ ’; //DELETE IN DEV MODE echo $urlElement; //DELETE IN DEV MODE if($ext==’xml’) //This means, the links found on the current page are not links to the site’s webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site’s html pages. { echo __LINE__; echo ‘ ’; //DELETE IN DEV MODE $crawl_xml_files[] = $url; } elseif($ext==’html’ || $ext==’htm’ || $ext==’shtml’ || $ext==’shtm’ || $ext==’php’ || $ext==’py’) //This means, the links found on the current page are the site’s html pages and are not not links to further xml sitemaps. { echo __LINE__; echo ‘ ’; //DELETE IN DEV MODE $extracted_urls[] = $extracted_url; // get properties of url (non-xml files) $extracted_urls[] = $extracted_url = $urlElement->loc; $extracted_last_mods[] = $extracted_lastmod = $urlElement->lastmod; $extracted_changefreqs[] = $extracted_changefreq = $urlElement->changefreq; $extracted_priorities[] = $extracted_priority = $urlElement->priority; } } print_r($crawl_xml_files); echo ‘ ’; //DELETE IN DEV MODE echo count($crawl_xml_files); echo ‘ ’; //DELETE IN DEV MODE if(!EMPTY($crawl_xml_files)) { foreach($crawl_xml_files AS $crawl_xml_file) { // Further sitemap url or sitemap file $sitemap = “$crawl_xml_file”; //Has more xml files. // get sitemap content $content = file_get_contents($sitemap); // parse the sitemap content to object $xml = simplexml_load_string($content); // retrieve properties from the sitemap object foreach ($xml->url as $urlElement) { $path = $urlElement; $ext = pathinfo($path, PATHINFO_EXTENSION); echo ‘The extension is: ‘ .$ext; echo ‘ ’; //DELETE IN DEV MODE echo $urlElement; //DELETE IN DEV MODE if($ext==’xml’) //This means, the links found on the current page are not links to the site’s webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site’s html pages. { echo __LINE__; echo ‘ ’; //DELETE IN DEV MODE $crawl_xml_files[] = $url; } elseif($ext==’html’ || $ext==’htm’ || $ext==’shtml’ || $ext==’shtm’ || $ext==’php’ || $ext==’py’) //This means, the links found on the current page are the site’s html pages and are not not links to further xml sitemaps. { echo __LINE__; echo ‘ ’; //DELETE IN DEV MODE $extracted_urls[] = $extracted_url; // get properties of url (non-xml files) $extracted_urls[] = $extracted_url = $urlElement->loc; $extracted_last_mods[] = $extracted_lastmod = $urlElement->lastmod; $extracted_changefreqs[] = $extracted_changefreq = $urlElement->changefreq; $extracted_priorities[] = $extracted_priority = $urlElement->priority; } } } } echo __LINE__; echo ‘ ’; //DELETE IN DEV MODE //Display all found html links. print_r($extracted_urls); //DELETE IN DEV MODE echo ‘ ’; //DELETE IN DEV MODE print_r($extracted_last_mods); //DELETE IN DEV MODE echo ‘ ’; //DELETE IN DEV MODE print_r($extracted_changefreqs); //DELETE IN DEV MODE echo ‘ ’; //DELETE IN DEV MODE print_r($extracted_priorities); //DELETE IN DEV MODE echo ‘ ’; //DELETE IN DEV MODE [/code]

It does not work. I get this echoed:

157

172

188

205

231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
Array ( )
0
308
Array ( )

Warning: Undefined variable $extracted_last_mods in C:Program FilesXampphtdocsWorkbuzzTemplatescrawler_Test.php on line 313

Warning: Undefined variable $extracted_changefreqs in C:Program FilesXampphtdocsWorkbuzzTemplatescrawler_Test.php on line 315

Warning: Undefined variable $extracted_priorities in C:Program FilesXampphtdocsWorkbuzzTemplatescrawler_Test.php on line 317

Where you think I went wrong ? On which particular lines ?
Remember, I am trying to build the crawler on the skeleton of the code you see on that tutorial as I do understand that one’s code without much trouble.
Skeleton of this tutorial code:
https://bytenota.com/parsing-an-xml-sitemap-in-php/

And so working on the code that I do understand. I hope you understand.
Atleast, if someone can point me out where I am going wrong then I reckon I can fix from then on. Right now, I am scratching my head. I get the feeling it’s failing to scrape the found xml links and it’s failing to spot the right extensions of the found links. Hence the undefined variable errors.

Thanks

to post a comment

PHP

30 Comments(s) _↴

@novice2022authorOct 15.2022 — #I have been testing and re-testing. And I have found-out that this line is failing:

<i>
 </i> // retrieve properties from the sitemap object
 foreach ($xml-&gt;url as $urlElement)
 {
 $path = $urlElement; //THIS LINE IS FAILING TO EXTRACT THE EXTENSION
 $ext = pathinfo($path, PATHINFO_EXTENSION);
 echo 'The extension is: ' .$ext; echo '&lt;br&gt;'; //DELETE IN DEV MODE

@NogDog

Mate, why is that line failing ?

I switched it to the following but no luck:

<i>
 </i>// retrieve properties from the sitemap object
 foreach ($xml-&gt;url as $urlElement) 
 {
 echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MOD
 
 <i>	</i>//File Extension extraction not working
 <i>	</i>$path_parts = pathinfo($urlElement);
 <i>	</i>echo 'The extension is: ' .$ext = $path_parts['extension']; echo '&lt;br&gt;';

Do you atleast see any errors on any my 2 codes to extract the file extension ? If so, then if you can kindly fix this part on my previous post's code then that code should work which is failing now.

@novice2022authorOct 15.2022 — #@ginerjm

Mind chiming in ?

@ginerjmOct 15.2022 — #When I am debugging and find something odd I usually try and see what I'm dealing with that's not working. Try that.

@novice2022authorOct 22.2022 — #@ginerjm#1647887

Cheers.

I have found the extension is not getting extracted as I get echoed:

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

233

The extension is:

247

Array ( )

0

334

Array ( )

So, this line is not working to extract the url's file extension:

<i>
 </i>// retrieve properties from the sitemap object
 foreach ($xml-&gt;url as $urlElement) 
 {
 echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MOD
 
 <i>	</i>//File Extension extraction not working
 <i>	</i>echo $path = $urlElement;
 <i>	</i>$ext = pathinfo($path, PATHINFO_EXTENSION);
 <i>	</i>echo 'The extension is: ' .$ext; echo '&lt;br&gt;'; //DELETE IN DEV MODE

Why is that ?

@novice2022authorOct 22.2022 — #Folks,

I do not understand why the code in my previous post is failing to extract the file extension of the urls found on a page.

The context of the above post's code is this:

<i>
 </i>$crawl_xml_files = array();
 $extracted_urls = array();
 $extracted_last_mods[] = array();
 $extracted_changefreqs[] = array();
 $extracted_priorities[] = array();
 
 // sitemap url or sitemap file
 $sitemap = 'https://www.rocktherankings.com/post-sitemap.xml';
 
 // get sitemap content
 $content = file_get_contents($sitemap);
 
 // parse the sitemap content to object
 $xml = simplexml_load_string($content);
 
 // retrieve properties from the sitemap object
 foreach ($xml-&gt;url as $urlElement) 
 {
 echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MOD
 
 <i>	</i>//File Extension extraction not working
 <i>	</i>echo $path = $urlElement;
 <i>	</i>$ext = pathinfo($path, PATHINFO_EXTENSION);
 <i>	</i>echo 'The extension is: ' .$ext; echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>
 <i>	</i>/*
 <i>	</i>//File Extension extraction not working
 <i>	</i>$path_parts = pathinfo($urlElement);
 <i>	</i>echo 'The extension is: ' .$ext = $path_parts['extension']; echo '&lt;br&gt;';
 <i>	</i>*/
 <i>	</i>
 <i>	</i>echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>echo $urlElement; //DELETE IN DEV MODE
 
 <i>	</i>if($ext=='xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
 <i>	</i>{
 <i>	</i>	echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 <i>	</i>	$crawl_xml_files[] = $urlElement;
 <i>	</i>}
 <i>	</i>elseif($ext=='html' || $ext=='htm' || $ext=='shtml' || $ext=='shtm' || $ext=='php' || $ext=='py') //This means, the links found on the current page are the site's html pages and are not not links to further xml sitemaps.
 <i>	</i>{
 <i>	</i>	echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i> 
 <i>	</i>	//$extracted_urls[] = $urlElement;
 
 <i>	</i>	// get properties of url (non-xml files)
 <i>	</i>	echo $extracted_url;
 <i>	</i>	echo $extracted_lastmod;
 <i>	</i>	echo $extracted_changefreq;
 <i>	</i>	echo $extracted_priority;
 <i>	</i>}
 }
 
 print_r($crawl_xml_files); echo '&lt;br&gt;'; //DELETE IN DEV MODE
 echo count($crawl_xml_files); echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 
 if(!EMPTY($crawl_xml_files))
 {
 foreach($crawl_xml_files AS $crawl_xml_file)
 {
 // Further sitemap url or sitemap file
 $sitemap = "$crawl_xml_file"; //Has more xml files.
 
 <i>	</i>	// get sitemap content
 <i>	</i>	$content = file_get_contents($sitemap);
 
 <i>	</i>	// parse the sitemap content to object
 <i>	</i>	$xml = simplexml_load_string($content);
 <i>	</i> 
 <i>	</i>	// retrieve properties from the sitemap object
 <i>	</i>	foreach ($xml-&gt;url as $urlElement)
 <i>	</i>	{
 <i>	</i>		 echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MOD
 <i>	</i>
 <i>	</i>		//File Extension extraction not working
 <i>	</i>		echo $path = $urlElement;
 <i>	</i>		$ext = pathinfo($path, PATHINFO_EXTENSION);
 <i>	</i>		echo 'The extension is: ' .$ext; echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i> 
 <i>	</i> 
 <i>	</i>		/*
 <i>	</i>		//File Extension extraction not working
 <i>	</i>		$path_parts = pathinfo($urlElement);
 <i>	</i>		echo 'The extension is: ' .$ext = $path_parts['extension']; echo '&lt;br&gt;';
 <i>	</i>		*/
 <i>	</i> 
 <i>	</i>		echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>		echo $urlElement; //DELETE IN DEV MODE
 <i>	</i> 
 <i>	</i>		if($ext=='xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
 <i>	</i>		{
 <i>	</i>			echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 <i>	</i>			$crawl_xml_files[] = $urlElement;
 <i>	</i>		}
 <i>	</i>		elseif($ext=='html' || $ext=='htm' || $ext=='shtml' || $ext=='shtm' || $ext=='php' || $ext=='py') //This means, the links found on the current page are the site's html pages and are not not links to further xml sitemaps.
 <i>	</i>		{
 <i>	</i>			echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i> 
 <i>	</i>		//$extracted_urls[] = $urlElement;
 
 <i>	</i>		// get properties of url (non-xml files)
 <i>	</i>		$extracted_urls[] = $extracted_url = $urlElement-&gt;loc;
 <i>	</i>		$extracted_last_mods[] = $extracted_lastmod = $urlElement-&gt;lastmod;
 <i>	</i>		$extracted_changefreqs[] = $extracted_changefreq = $urlElement-&gt;changefreq;
 <i>	</i>		$extracted_priorities[] = $extracted_priority = $urlElement-&gt;priority;
 <i>	</i> 
 <i>	</i>		echo $extracted_url;
 <i>	</i>		echo $extracted_lastmod;
 <i>	</i>		echo $extracted_changefreq;
 <i>	</i>		echo $extracted_priority;
 <i>	</i>		}
 <i>	</i>	}
 <i>	</i>}
 }
 
 echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 //Display all found html links.
 print_r($extracted_urls); //DELETE IN DEV MODE
 echo '&lt;br&gt;'; //DELETE IN DEV MODE
 print_r($extracted_last_mods); //DELETE IN DEV MODE
 echo '&lt;br&gt;'; //DELETE IN DEV MODE
 print_r($extracted_changefreqs); //DELETE IN DEV MODE
 echo '&lt;br&gt;'; //DELETE IN DEV MODE
 print_r($extracted_priorities); //DELETE IN DEV MODE
 echo '&lt;br&gt;'; //DELETE IN DEV MODE

I get echoed:

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

236

The extension is:

250

Array ( )

0

337

Array ( )

Array ( [0] => Array ( ) )

Array ( [0] => Array ( ) )

Array ( [0] => Array ( ) )

If I disconnect this following part of the code from the script and test it then it is working fine:

<i>
 </i>$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml';
 
 $path = $sitemap;
 $ext = pathinfo($path, PATHINFO_EXTENSION);
 echo 'The extension is: ' .$ext;

Working fine I claim cos it is echoing:

**177

The extension is: xml**

@novice2022authorOct 22.2022 — #@Nogdog

Do you mind aiding on my previous post as I am greatlypuzzled to what is malfunctioning in my code ?

Thanks!

@ginerjmOct 22.2022 — #The last piece works cause you have placed a true path/file value into the variable. The rest of your code we don't know what you are using cause you are not showing the contents that are being examined. Unless it it truly an empty value which your debugging seems to be showing. In that case you are trying to examine a string that is not a filename. So how about echo the contents of that file - maybe the first 2-3 rows or 100 bytes?

@SempervivumOct 22.2022 — #@novice2022#1648015 I used the xml file you posted initially:

https://bytenota.com/parsing-an-xml-sitemap-in-php/

and added a var_dump in order to make the structure visible. Output:

``<i>
 </i>object(SimpleXMLElement)[1]
 public 'sitemap' =&gt; 
 array (size=4)
 0 =&gt; 
 object(SimpleXMLElement)[2]
 public 'loc' =&gt; string 'https://www.rocktherankings.com/post-sitemap.xml' (length=48)
 public 'lastmod' =&gt; string '2022-10-10T10:23:30+00:00' (length=25)
 1 =&gt; 
 object(SimpleXMLElement)[3]
 public 'loc' =&gt; string 'https://www.rocktherankings.com/page-sitemap.xml' (length=48)
 public 'lastmod' =&gt; string '2022-10-21T15:20:35+00:00' (length=25)
 2 =&gt; 
 object(SimpleXMLElement)[4]
 public 'loc' =&gt; string 'https://www.rocktherankings.com/case-study-sitemap.xml' (length=54)
 public 'lastmod' =&gt; string '2022-06-29T14:29:51+00:00' (length=25)
 3 =&gt; 
 object(SimpleXMLElement)[5]
 public 'loc' =&gt; string 'https://www.rocktherankings.com/glossary-sitemap.xml' (length=52)
 public 'lastmod' =&gt; string '2022-09-25T13:57:50+00:00' (length=25)<i>
 </i>`</CODE>
 The xml contains an array in the top level named <C>sitemap</C>. Each element in this array is an object containing the members <C>loc</C> which is the url obviously and <C>lastmod</C>.<br/>
 Knowing about this, the code can be made runnable easily:
 <CODE>`<i>
 </i>    $sitemap = 'sitemap.xml';
 // get sitemap content
 $content = file_get_contents($sitemap);
 
 // parse the sitemap content to object
 $xml = simplexml_load_string($content);
 var_dump($xml);
 // Init arrays
 $crawl_xml_files = [];
 $extracted_urls = [];
 $extracted_last_mods = [];
 // retrieve properties from the sitemap object
 foreach ($xml-&gt;sitemap as $item) {
 // provide path of curren xml/html file
 $path = (string)$item-&gt;loc;
 // get pathinfo
 $ext = pathinfo($path, PATHINFO_EXTENSION);
 echo 'The extension is: ' . $ext;
 echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 echo $item; //DELETE IN DEV MODE
 
 if ($ext == 'xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
 {
 echo __LINE__;
 echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 $crawl_xml_files[] = $path;
 } elseif ($ext == 'html' || $ext == 'htm' || $ext == 'shtml' || $ext == 'shtm' || $ext == 'php' || $ext == 'py') //This means, the links found on the current page are the site's html pages and are not not links to further xml sitemaps.
 {
 echo __LINE__;
 echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 $extracted_urls[] = $path;
 
 // get properties of url (non-xml files)
 // $extracted_urls[] = $extracted_url = $urlElement-&gt;loc;
 $extracted_last_mods[] = $extracted_lastmod = $item-&gt;lastmod;
 // $extracted_changefreqs[] = $extracted_changefreq = $urlElement-&gt;changefreq;
 // $extracted_priorities[] = $extracted_priority = $urlElement-&gt;priority;
 }
 }
 var_dump($crawl_xml_files);
 var_dump($extracted_urls);
 var_dump($extracted_last_mods);<i>
 </i>``

@novice2022authorOct 22.2022 — #@ginerjm#1648017

You are right. The string is a url. Is that the issue ?

@novice2022authorOct 22.2022 — #@Sempervivum#1648018

Thank you very much! I appreciate it.

I can see a few changes you made to my script.

1

My Original Code

<i>
 </i>// retrieve properties from the sitemap object
 foreach ($xml-&gt;url as $urlElement) 
 {
 echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MOD
 
 <i>	</i>//File Extension extraction not working
 <i>	</i>echo $path = $urlElement;
 <i>	</i>$ext = pathinfo($path, PATHINFO_EXTENSION);
 <i>	</i>echo 'The extension is: ' .$ext; echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>echo $urlElement; //DELETE IN DEV MODE

Your Modified Code

<i>
 </i>// retrieve properties from the sitemap object
 foreach ($xml-&gt;sitemap as $item) {
 // provide path of curren xml/html file
 $path = (string)$item-&gt;loc;
 // get pathinfo
 $ext = pathinfo($path, PATHINFO_EXTENSION);
 echo 'The extension is: ' . $ext;
 echo '&lt;br&gt;'; //DELETE IN DEV MODE
 echo $item; //DELETE IN DEV MODE

2.

My Original Code

<i>
 </i>if($ext=='xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
 {
 echo __LINE__; echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 <i>	</i>	$crawl_xml_files[] = $urlElement;

Your Modified Code

<i>
 </i> if ($ext == 'xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
 {
 echo __LINE__;
 echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 <i>    </i>        $crawl_xml_files[] = $path;

@novice2022authorOct 22.2022 — #@Sempervivum#1648018

I just do not understand one part of your code:

<i>
 </i>$sitemap = 'sitemap.xml';

Where did you get that url ?

Was not the sitemap url:

https://www.rocktherankings.com/post-sitemap.xml

How did it change in your url to a shorter version ?

@SempervivumOct 22.2022 — #@novice2022#1648021 I simply downloaded the content of that URL to my local file system. That way it resides in the same path as my PHP file.

The essential modification is this:

foreach ($xml**->sitemap** as $item) {

As the xml structure contains an array named `sitemap` I had to use this name.

I'm using $item for the current item in a loop just out of habit, you can change this name according to your preferences.

@novice2022authorOct 22.2022 — #@Sempervivum#1648022

Ok. Thanks.

You know of any good sitemap that lists xml files and if you click the xml file then it lists further xml files and so on to more than 3 levels deep where finally the xml file lists html files ?

I want to test our code.

@SempervivumOct 22.2022 — #@novice2022#1648023 Unfortunately I do not. In the past I dealt with XML and evaluating it by simplexml but not with sitemaps.

@novice2022authorOct 22.2022 — #@Sempervivum#1648024

What is the difference between the two ?

**Is it like simplehtmldom thingy ?**

@novice2022authorOct 22.2022 — #@Sempervivum

Your part is working that I modified....

<i>
 </i>var_dump($crawl_xml_files);
 var_dump($extracted_urls);
 var_dump($extracted_last_mods);
 var_dump($extracted_changefreqs);
 var_dump($extracted_priorities);

But why this not working ?

<i>
 </i>foreach($crawl_xml_files as $crawl_xml_file)
 {
 echo 'Xml File to crawl: ' .$crawl_xml_file;
 }
 
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_urls as $extracted_url)
 <i>	</i>{
 <i>	</i>	echo 'Extracted Url: ' .$extracted_url;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_last_mods as $extracted_last_mod)
 <i>	</i>{
 <i>	</i>	echo 'Extracted last Mod: ' .$extracted_last_mod;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_changefreqs as $extracted_changefreq)
 <i>	</i>{
 <i>	</i>	echo 'Extracted Change Frequency: ' .$extracted_changefreq;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_priorities as $extracted_priority)
 <i>	</i>{
 <i>	</i>	echo 'Extracted Priority: ' .$extracted_priority;
 <i>	</i>}

Full Code

<i>
 </i>//$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml';
 //$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.
 $sitemap = 'https://bytenota.com/sitemap.xml';
 // get sitemap content
 $content = file_get_contents($sitemap);
 
 <i>    </i>// parse the sitemap content to object
 <i>    </i>$xml = simplexml_load_string($content);
 <i>    </i>var_dump($xml);
 <i>    </i>// Init arrays
 <i>    </i>$crawl_xml_files = [];
 <i>    </i>$extracted_urls = [];
 <i>    </i>$extracted_last_mods = [];
 <i>	</i>$extracted_changefreqs = [];
 <i>	</i>$extracted_priorities = [];
 <i>	</i>
 <i>    </i>// retrieve properties from the sitemap object
 <i>	</i>//foreach ($xml-&gt;url as $urlElement) 
 <i>    </i>foreach ($xml-&gt;sitemap as $item) {
 <i>    </i>    // provide path of curren xml/html file
 <i>	</i>	//$path = $urlElement;
 <i>    </i>    $path = (string)$item-&gt;loc;
 <i>    </i>    // get pathinfo
 <i>    </i>    $ext = pathinfo($path, PATHINFO_EXTENSION);
 <i>    </i>    echo 'The extension is: ' . $ext;
 <i>    </i>    echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 <i>    </i>    echo $item; //DELETE IN DEV MODE
 
 <i>    </i>    if ($ext == 'xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
 <i>    </i>    {
 <i>    </i>        echo __LINE__;
 <i>    </i>        echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 <i>    </i>        //$crawl_xml_files[] = $urlElement;
 <i>	</i>		$crawl_xml_files[] = $path;
 <i>    </i>    } elseif ($ext == 'html' || $ext == 'htm' || $ext == 'shtml' || $ext == 'shtm' || $ext == 'php' || $ext == 'py') //This means, the links found on the current page are the site's html pages and are not not links to further xml sitemaps.
 <i>    </i>    {
 <i>    </i>        echo __LINE__;
 <i>    </i>        echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 <i>    </i>        $extracted_urls[] = $path;
 
 <i>    </i>        // get properties of url (non-xml files)
 <i>    </i>        $extracted_urls[] = $extracted_url = $urlElement-&gt;loc;
 <i>    </i>        $extracted_last_mods[] = $extracted_lastmod = $item-&gt;lastmod;
 <i>    </i>        $extracted_changefreqs[] = $extracted_changefreq = $urlElement-&gt;changefreq;
 <i>    </i>        $extracted_priorities[] = $extracted_priority = $urlElement-&gt;priority;
 <i>    </i>    }
 <i>    </i>}
 <i>    </i>var_dump($crawl_xml_files);
 <i>    </i>var_dump($extracted_urls);
 <i>    </i>var_dump($extracted_last_mods);
 <i>	</i>var_dump($extracted_changefreqs);
 <i>	</i>var_dump($extracted_priorities);
 <i>	</i>
 <i>	</i>foreach($crawl_xml_files as $crawl_xml_file)
 <i>	</i>{
 <i>	</i>	echo 'Xml File to crawl: ' .$crawl_xml_file;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_urls as $extracted_url)
 <i>	</i>{
 <i>	</i>	echo 'Extracted Url: ' .$extracted_url;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_last_mods as $extracted_last_mod)
 <i>	</i>{
 <i>	</i>	echo 'Extracted last Mod: ' .$extracted_last_mod;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_changefreqs as $extracted_changefreq)
 <i>	</i>{
 <i>	</i>	echo 'Extracted Change Frequency: ' .$extracted_changefreq;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_priorities as $extracted_priority)
 <i>	</i>{
 <i>	</i>	echo 'Extracted Priority: ' .$extracted_priority;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE

@novice2022authorOct 22.2022 — #@Sempervivum#1648022

<<As the xml structure contains an array named sitemap I had to use this name.>>

Mmm. So, that means, all websites' xml sitemap will not have the same structure ?

If not, then how does googlebot manage to crawl every websites' xml sitemaps ?

I am trying to build a crawler that will work on any xml sitemap. How to achieve this ?

We need to add some code on the crawler that will inspect the xml sitemap's structure ?

How to code this ?

@novice2022authorOct 22.2022 — #@Sempervivum

Btw, your code works on infinite levels deep. Yes ?

Xml file lists more xml files.

Those xml files, in level 1, list more xml files.

Those xml files, in level 2, list more xml files.

And so on. Until a level lists html files.

In above example, imagine the html files were found on level 10 deep.

Our code should work on infinite levels as I do not want to restrict the levels.

And so, I ask, can your code go to infinite levels ?

my code only went to 2 levels.

Do not forget my previous 2 replies to you.

@SempervivumOct 22.2022 — #@novice2022#1648026

Taking a look at the output of the first var_dump tells that the array at the top level is named `url` in this sitemap.

After modifying this it turns out that many of the URLs don't have an extension.

@novice2022authorOct 22.2022 — #@nogdog

If not too much to ask, can you help me here ?

https://forum.webdeveloper.com/d/401610-which-line-is-causing-the-issue/17

@SempervivumOct 22.2022 — #PS:
>Btw, your code works on infinite levels deep. Yes ?

No, it does not. It needs adjustments if it should do.

@novice2022authorOct 22.2022 — #@Sempervivum#1648031

Do you mind adjusting for our learning purpose ?

@novice2022authorOct 22.2022 — #@Sempervivum

Do you know what is odd ?

Check this xml sitemap out:

https://bytenota.com/sitemap.xml

Do you see all those urls ? They exist. Right ?

But the vardump only shows url in the first position only!

Look:

**object(SimpleXMLElement)#1 (1) { ["url"]=> array(528) { [0]=> object(SimpleXMLElement)#2 (4) { ["loc"]=> object(SimpleXMLElement)#530 (0) { } ["lastmod"]=> object(SimpleXMLElement)#531 (0) { } ["changefreq"]=> object(SimpleXMLElement)#532 (0) { } ["priority"]=> object(SimpleXMLElement)#533 (0) { } } [1]=> object(SimpleXMLElement)#3 (4) { ["loc"]=> object(SimpleXMLElement)#533 (0) { } ["lastmod"]=> object(SimpleXMLElement)#532 (0) { } ["changefreq"]=> object(SimpleXMLElement)#531 (0) { } ["priority"]=> object(SimpleXMLElement)#530 (0) { } } [2]=> object(SimpleXMLElement)#4 (4) { ["loc"]=> object(SimpleXMLElement)#530 (0) { } ["lastmod"]=> object(SimpleXMLElement)#531 (0) { } ["changefreq"]=> object(SimpleXMLElement)#532 (0) { } ["priority"]=> object(SimpleXMLElement)#533 (0) { } } [3]=> object(SimpleXMLElement)#5 (4) { ["loc"]=> object(SimpleXMLElement)#533 (0) { } ["lastmod"]=> object(SimpleXMLElement)#532 (0) { } ["changefreq"]=> object(SimpleXMLElement)#531 (0) { } ["priority"]=> object(SimpleXMLElement)#530 (0) { } } [4]=> object(SimpleXMLElement)#6 (4) { ["loc"]=> object(SimpleXMLElement)#530 (0) { } ["lastmod"]=> object(SimpleXMLElement)#531 (0) { } ["changefreq"]=> object(SimpleXMLElement)#532 (0) { } ["priority"]=> object(SimpleXMLElement)#533 (0) { } } [5]=> object(SimpleXMLElement)#7 (4) { ["loc"]=> object(SimpleXMLElement)#533 (0) { } ["lastmod"]=> object(SimpleXMLElement)#532 (0) { } ["changefreq"]=> object(SimpleXMLElement)#531 (0) { } ["priority"]=> object(SimpleXMLElement)#530 (0) { } } [6]=> object(SimpleXMLElement)#8 (4) { ["loc"]=> object(SimpleXMLElement)#530 (0) { } ["lastmod"]=> object(SimpleXMLElement)#531 (0) { } ["changefreq"]=> object(SimpleXMLElement)#532 (0) { } ["priority"]=> object(SimpleXMLElement)#533 (0) { } } [7]=> object(SimpleXMLElement)#9 (4) { ["loc"]=> object(SimpleXMLElement)#533 (0) { } ["lastmod"]=> object(SimpleXMLElement)#532 (0) { } ["changefreq"]=> object(SimpleXMLElement)#531 (0) { } ["priority"]=> object(SimpleXMLElement)#530 (0) { } } [8]=> object(SimpleXMLElement)#10 (4) { ["loc"]=> object(SimpleXMLElement)#530 (0) { } ["lastmod"]=> object(SimpleXMLElement)#531 (0) { } ["changefreq"]=> object(SimpleXMLElement)#532 (0) { } ["priority"]=> object(SimpleXMLElement)#533 (0) { } } [9]=> object(SimpleXMLElement)#11 (4) { ["loc"]=> object(SimpleXMLElement)#533 (0) { } ["lastmod"]=> object(SimpleXMLElement)#532 (0) { } ["changefreq"]=> object(SimpleXMLElement)#531 (0) { } ["priority"]=> object(SimpleXMLElement)#530 (0) { } } [10]=> object(SimpleXMLElement)#12 (4) { ["loc"]=> object(SimpleXMLElement)#530 (0) { } ["lastmod"]=> object(SimpleXMLElement)#531 (0) { } ["changefreq"]=> object(SimpleXMLElement)#532 (0) { } ["priority"]=> object(SimpleXMLElement)#533 (0) { } } [11]=>**

CODE

<i>
 </i>$sitemap = 'https://bytenota.com/sitemap.xml';
 // get sitemap content
 $content = file_get_contents($sitemap);
 
 <i>    </i>// parse the sitemap content to object
 <i>    </i>$xml = simplexml_load_string($content);
 <i>    </i>var_dump($xml);
 <i>    </i>// Init arrays
 <i>    </i>$crawl_xml_files = [];
 <i>    </i>$extracted_urls = [];
 <i>    </i>$extracted_last_mods = [];
 <i>	</i>$extracted_changefreqs = [];
 <i>	</i>$extracted_priorities = [];
 <i>	</i>
 <i>    </i>// retrieve properties from the sitemap object
 <i>	</i>//foreach ($xml-&gt;url as $urlElement) 
 <i>    </i>foreach ($xml-&gt;sitemap as $item) {
 <i>    </i>    // provide path of curren xml/html file
 <i>	</i>	//$path = $urlElement;
 <i>    </i>    $path = (string)$item-&gt;loc;
 <i>    </i>    // get pathinfo
 <i>    </i>    $ext = pathinfo($path, PATHINFO_EXTENSION);
 <i>    </i>    echo 'The extension is: ' . $ext;
 <i>    </i>    echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 <i>    </i>    echo $item; //DELETE IN DEV MODE
 
 <i>    </i>    if ($ext == 'xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
 <i>    </i>    {
 <i>    </i>        echo __LINE__;
 <i>    </i>        echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 <i>    </i>        //$crawl_xml_files[] = $urlElement;
 <i>	</i>		$crawl_xml_files[] = $path;
 <i>    </i>    } elseif ($ext == 'html' || $ext == 'htm' || $ext == 'shtml' || $ext == 'shtm' || $ext == 'php' || $ext == 'py') //This means, the links found on the current page are the site's html pages and are not not links to further xml sitemaps.
 <i>    </i>    {
 <i>    </i>        echo __LINE__;
 <i>    </i>        echo '&lt;br&gt;'; //DELETE IN DEV MODE
 
 <i>    </i>        $extracted_urls[] = $path;
 
 <i>    </i>        // get properties of url (non-xml files)
 <i>    </i>        $extracted_urls[] = $extracted_url = $urlElement-&gt;loc;
 <i>    </i>        $extracted_last_mods[] = $extracted_lastmod = $item-&gt;lastmod;
 <i>    </i>        $extracted_changefreqs[] = $extracted_changefreq = $urlElement-&gt;changefreq;
 <i>    </i>        $extracted_priorities[] = $extracted_priority = $urlElement-&gt;priority;
 <i>    </i>    }
 <i>    </i>}
 <i>    </i>var_dump($crawl_xml_files);
 <i>    </i>var_dump($extracted_urls);
 <i>    </i>var_dump($extracted_last_mods);
 <i>	</i>var_dump($extracted_changefreqs);
 <i>	</i>var_dump($extracted_priorities);
 <i>	</i>
 <i>	</i>foreach($crawl_xml_files as $crawl_xml_file)
 <i>	</i>{
 <i>	</i>	echo 'Xml File to crawl: ' .$crawl_xml_file;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_urls as $extracted_url)
 <i>	</i>{
 <i>	</i>	echo 'Extracted Url: ' .$extracted_url;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_last_mods as $extracted_last_mod)
 <i>	</i>{
 <i>	</i>	echo 'Extracted last Mod: ' .$extracted_last_mod;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_changefreqs as $extracted_changefreq)
 <i>	</i>{
 <i>	</i>	echo 'Extracted Change Frequency: ' .$extracted_changefreq;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE
 <i>	</i>
 <i>	</i>foreach($extracted_priorities as $extracted_priority)
 <i>	</i>{
 <i>	</i>	echo 'Extracted Priority: ' .$extracted_priority;
 <i>	</i>}
 <i>	</i>
 <i>	</i>echo __LINE__;
 <i>    </i>echo '&lt;br&gt;'; //DELETE IN DEV MODE

@novice2022authorOct 22.2022 — #@Sempervivum

Can you show me the code that you used to echo this ?

<i>
 </i>object(SimpleXMLElement)[1]
 public 'sitemap' =&gt; 
 array (size=4)
 0 =&gt; 
 object(SimpleXMLElement)[2]
 public 'loc' =&gt; string 'https://www.rocktherankings.com/post-sitemap.xml' (length=48)
 public 'lastmod' =&gt; string '2022-10-10T10:23:30+00:00' (length=25)
 1 =&gt; 
 object(SimpleXMLElement)[3]
 public 'loc' =&gt; string 'https://www.rocktherankings.com/page-sitemap.xml' (length=48)
 public 'lastmod' =&gt; string '2022-10-21T15:20:35+00:00' (length=25)
 2 =&gt; 
 object(SimpleXMLElement)[4]
 public 'loc' =&gt; string 'https://www.rocktherankings.com/case-study-sitemap.xml' (length=54)
 public 'lastmod' =&gt; string '2022-06-29T14:29:51+00:00' (length=25)
 3 =&gt; 
 object(SimpleXMLElement)[5]
 public 'loc' =&gt; string 'https://www.rocktherankings.com/glossary-sitemap.xml' (length=52)
 public 'lastmod' =&gt; string '2022-09-25T13:57:50+00:00' (length=25)

@SempervivumOct 22.2022 — #I have to finish this for today as other tasks are waiting.

In order to adjust the code I would have to read the docu of sitemaps first. Do this on your own. This resource seems to be appropriate:

https://en.wikipedia.org/wiki/Sitemaps

@SempervivumOct 22.2022 — #Obviously the sitemap I used was not the one from your initial posting.

``<i>
 </i>&lt;?xml version="1.0" encoding="UTF-8"?&gt;&lt;?xml-stylesheet type="text/xsl" href="//www.rocktherankings.com/wp-content/plugins/wordpress-seo/css/main-sitemap.xsl"?&gt;
 &lt;sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"&gt;
 &lt;sitemap&gt;
 &lt;loc&gt;https://www.rocktherankings.com/post-sitemap.xml&lt;/loc&gt;
 &lt;lastmod&gt;2022-10-10T10:23:30+00:00&lt;/lastmod&gt;
 &lt;/sitemap&gt;
 &lt;sitemap&gt;
 &lt;loc&gt;https://www.rocktherankings.com/page-sitemap.xml&lt;/loc&gt;
 &lt;lastmod&gt;2022-10-21T15:20:35+00:00&lt;/lastmod&gt;
 &lt;/sitemap&gt;
 &lt;sitemap&gt;
 &lt;loc&gt;https://www.rocktherankings.com/case-study-sitemap.xml&lt;/loc&gt;
 &lt;lastmod&gt;2022-06-29T14:29:51+00:00&lt;/lastmod&gt;
 &lt;/sitemap&gt;
 &lt;sitemap&gt;
 &lt;loc&gt;https://www.rocktherankings.com/glossary-sitemap.xml&lt;/loc&gt;
 &lt;lastmod&gt;2022-09-25T13:57:50+00:00&lt;/lastmod&gt;
 &lt;/sitemap&gt;
 &lt;/sitemapindex&gt;
 &lt;!-- XML Sitemap generated by Yoast SEO --&gt;<i>
 </i>``

@novice2022authorOct 26.2022 — #@Sempervivum#1648036

Any chance you can get your above crawler code to work (in it's simplest coding form) on any site ? Work on any xml sitemap ?

You know, I got Ubot Studio. Withit I can build desktop bots (.exe). I can easily build a bot to auto visit domains and find their site maps and extract links. But that would mean, I would have to keep my home pc on 24/7 to crawl the whole web. I'd rather webmasters came to my webform and submitted their xml sitemaps so my web bot (.php) can then crawl their links. That way, I won't have to keep my pc on 24/7 since the web crawler will be on the vps host side and not on mine. Hence, all the fuss to build a php web crawler.

There are tonnes of php crawlers online. Free ones. But I do not understand their codes as they are oop and I do not like building my website with other peoples' codes. I get no satisfaction that way. I prefer to learn and build things myself and then use my own little baby. My own built Frankenstein. Get kick that way.

So if you can just make as little amendment as possible so it detects the structure (in your case you manually detected an array named "sitemap"). Because, if I keep the code as is then it won';'t work on other sites since theiur structure will be different.

I think you understand.

Maybe I call my web crawler "Frankenstien Crawler" or "Semp"? Lol!

Once you have helped me on that, then I go and try to memorise the code, make slight changes (so not an exact copy of your code) and then get going setting my crawler loose on the www. :)

@SempervivumOct 26.2022 — #@novice2022#1648071 Did you read the article on Wikipedia I linked to?

@novice2022authorOct 26.2022 — #@Sempervivum#1648072

I saw the link few days back when I was closing my pc to roll off to bed. Today, I forgot about the link. Sorry.

Checking now.

Btw, I am curious.

If each website's xml sitemap has different tree names then how come a general xml sitemap crawler extract all the site links ? I mean, the crawler will be programmed to look for a certain named parent and child. It won't know what the site's xml tree nodes are called. What the parent is called, what the childs are called etc. in order to look for those particularly named nodes.

Most likely, the parser has some ai to detect the node names. Extract the parent and child names and then use these to find or extract the site links. That's the bit of code I need.

Anyway, I get the feeling your wikipedia link answers this but if it does not then you are welcome to answer it.

Cheers!

@novice2022authorOct 26.2022 — #@Sempervivum

I was wondering why your wiki link was looking familiar.

I read this a wk earlier:

https://www.sitemaps.org/protocol.html

Nevertheless, this is what I learnt to use:

<i>
 </i>foreach ($xml-&gt;url as $urlElement) //Extracts html file urls.

<i>
 </i>foreach ($xml-&gt;sitemap as $urlElement) //Extracts Sitemap Urls.

Also in #PHP _↴

Column count doesn't match value count at row 1 Error Multiple Cookies and retention Connection Problems?

Success!

Help @novice2022 spread the word by sharing this article on Twitter...

Tweet This

Which Line Is Causing The Issue ?

30 Comments(s) _↴

Also in #PHP _↴

Success!

Social

Version

Which Line Is Causing The Issue ?

30 Comments(s) ↴

Also in #PHP ↴

Success!

The web is an endless sea of information. Don't miss the boat... Subscribe!

Social

Version

30 Comments(s) _↴

Also in #PHP _↴