Php Programmers,
Can you see the crawler code here ?
Here is the simple code that I need to modify:
[code]
// sitemap url or sitemap file
$sitemap = ‘https://bytenota.com/sitemap.xml’;
// get sitemap content
$content = file_get_contents($sitemap);
// parse the sitemap content to object
$xml = simplexml_load_string($content);
// retrieve properties from the sitemap object
foreach ($xml->url as $urlElement) {
// get properties
$url = $urlElement->loc;
$lastmod = $urlElement->lastmod;
$changefreq = $urlElement->changefreq;
$priority = $urlElement->priority;
// print out the properties
echo ‘url: ‘. $url . ‘<br>’;
echo ‘lastmod: ‘. $lastmod . ‘<br>’;
echo ‘changefreq: ‘. $changefreq . ‘<br>’;
echo ‘priority: ‘. $priority . ‘<br>’;
echo ‘<br>—<br>’;
}
That code I got from a tutorial and it assumes the Sitemap xml file (starting point of the crawl) is listing no further xml files but html links.
Now the xml sitemap I was working on had more xml sitemaps listed.
And those other more xml sitemaps were then listing the html files of the site. That means, the code on my original post was not working and was showing blank page as I have to write more code for the crawler to go one level deep to find the site’s html files. So, the crawler should start on an xml file. Find more xml files on it and then visit those xml files to finally find the html links.
Now, look at this modification of the code you see on the tutorial code:
[code]
$extracted_urls = array();
$crawl_xml_files = array();
// sitemap url or sitemap file
$sitemap = ‘https://www.rocktherankings.com/post-sitemap.xml’;
//$sitemap = “https://www.rocktherankings.com/sitemap_index.xml”; //Has more xml files.
// get sitemap content
$content = file_get_contents($sitemap);
// parse the sitemap content to object
$xml = simplexml_load_string($content);
// retrieve properties from the sitemap object
foreach ($xml->url as $urlElement)
{
echo __LINE__; echo ‘<br>’; //DELETE IN DEV MODE
$path = $urlElement;
$ext = pathinfo($path, PATHINFO_EXTENSION);
echo ‘The extension is: ‘ .$ext; echo ‘<br>’; //DELETE IN DEV MODE
echo __LINE__; echo ‘<br>’; //DELETE IN DEV MODE
echo $urlElement; //DELETE IN DEV MODE
if($ext==’xml’) //This means, the links found on the current page are not links to the site’s webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site’s html pages.
{
echo __LINE__; echo ‘<br>’; //DELETE IN DEV MODE
$crawl_xml_files[] = $url;
}
elseif($ext==’html’ || $ext==’htm’ || $ext==’shtml’ || $ext==’shtm’ || $ext==’php’ || $ext==’py’) //This means, the links found on the current page are the site’s html pages and are not not links to further xml sitemaps.
{
echo __LINE__; echo ‘<br>’; //DELETE IN DEV MODE
$extracted_urls[] = $extracted_url;
// get properties of url (non-xml files)
$extracted_urls[] = $extracted_url = $urlElement->loc;
$extracted_last_mods[] = $extracted_lastmod = $urlElement->lastmod;
$extracted_changefreqs[] = $extracted_changefreq = $urlElement->changefreq;
$extracted_priorities[] = $extracted_priority = $urlElement->priority;
}
}
print_r($crawl_xml_files); echo ‘<br>’; //DELETE IN DEV MODE
echo count($crawl_xml_files); echo ‘<br>’; //DELETE IN DEV MODE
if(!EMPTY($crawl_xml_files))
{
foreach($crawl_xml_files AS $crawl_xml_file)
{
// Further sitemap url or sitemap file
$sitemap = “$crawl_xml_file”; //Has more xml files.
// get sitemap content
$content = file_get_contents($sitemap);
// parse the sitemap content to object
$xml = simplexml_load_string($content);
// retrieve properties from the sitemap object
foreach ($xml->url as $urlElement)
{
$path = $urlElement;
$ext = pathinfo($path, PATHINFO_EXTENSION);
echo ‘The extension is: ‘ .$ext; echo ‘<br>’; //DELETE IN DEV MODE
echo $urlElement; //DELETE IN DEV MODE
if($ext==’xml’) //This means, the links found on the current page are not links to the site’s webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site’s html pages.
{
echo __LINE__; echo ‘<br>’; //DELETE IN DEV MODE
$crawl_xml_files[] = $url;
}
elseif($ext==’html’ || $ext==’htm’ || $ext==’shtml’ || $ext==’shtm’ || $ext==’php’ || $ext==’py’) //This means, the links found on the current page are the site’s html pages and are not not links to further xml sitemaps.
{
echo __LINE__; echo ‘<br>’; //DELETE IN DEV MODE
$extracted_urls[] = $extracted_url;
// get properties of url (non-xml files)
$extracted_urls[] = $extracted_url = $urlElement->loc;
$extracted_last_mods[] = $extracted_lastmod = $urlElement->lastmod;
$extracted_changefreqs[] = $extracted_changefreq = $urlElement->changefreq;
$extracted_priorities[] = $extracted_priority = $urlElement->priority;
}
}
}
}
echo __LINE__; echo ‘<br>’; //DELETE IN DEV MODE
//Display all found html links.
print_r($extracted_urls); //DELETE IN DEV MODE
echo ‘<br>’; //DELETE IN DEV MODE
print_r($extracted_last_mods); //DELETE IN DEV MODE
echo ‘<br>’; //DELETE IN DEV MODE
print_r($extracted_changefreqs); //DELETE IN DEV MODE
echo ‘<br>’; //DELETE IN DEV MODE
print_r($extracted_priorities); //DELETE IN DEV MODE
echo ‘<br>’; //DELETE IN DEV MODE
It does not work. I get this echoed:
48
157
172
188
205
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
231
The extension is:
237
Array ( )
0
308
Array ( )
Warning: Undefined variable $extracted_last_mods in C:Program FilesXampphtdocsWorkbuzzTemplatescrawler_Test.php on line 313
Warning: Undefined variable $extracted_changefreqs in C:Program FilesXampphtdocsWorkbuzzTemplatescrawler_Test.php on line 315
Warning: Undefined variable $extracted_priorities in C:Program FilesXampphtdocsWorkbuzzTemplatescrawler_Test.php on line 317
Where you think I went wrong ? On which particular lines ?
Remember, I am trying to build the crawler on the skeleton of the code you see on that tutorial as I do understand that one’s code without much trouble.
Skeleton of this tutorial code:
And so working on the code that I do understand. I hope you understand.
Atleast, if someone can point me out where I am going wrong then I reckon I can fix from then on. Right now, I am scratching my head. I get the feeling it’s failing to scrape the found xml links and it’s failing to spot the right extensions of the found links. Hence the undefined variable errors.
Thanks