I have two scripts that I’ve been fiddling with for a site analytics tool/site that I’ve been playing with and would like to combine the two so that the email extractor parses the entire site and lists all emails found.
This is more for fun and site forensics than as a spambot email harvester, as there are already plenty of spam harvesting tools out there already and PHP is an impractical means of performing this task.
Anyway, I have a link extractor that works fine:
[code=php]
<!– Link Extractor START –>
<?php
// findlinks.php
// php code example: find links in an html page
// mallsop.com 2006 gpl
echo “<form method=post action=”$PHP_SELF”> n”;
echo “<p><table align=”absmiddle” width=”100%” bgcolor=”#cccccc” name=”tablesiteopen” border=”0″>n”;
echo “<tr><td align=left>”;
if ($_POST[“FindLinks”]) {
$urlname = trim($_POST[“urlname”]);
if ($urlname == “”) {
echo “Please enter a URL. <br>n”;
}
else { // open the html page and parse it
$page_title = “n/a”;
$links[0] = “n/a”;
//$meta_descr = “n/a”;
//$meta_keywd = “n/a”;
if ($handle = @fopen($urlname, “r”)) { // must be able to read it
$content = “”;
while (!feof($handle)) {
$part = fread($handle, 1024);
$content .= $part;
// if (eregi(“</head>”, $part)) break;
}
fclose($handle);
$lines = preg_split(“/r?n|r/”, $content); // turn the content into rows
// boolean
$is_title = false;
//$is_descr = false;
//$is_keywd = false;
$is_href = false;
$index = 0;
//$close_tag = ($xhtml) ? ” />” : “>”; // new in ver. 1.01
foreach ($lines as $val) {
if (eregi(“<title>(.*)</title>”, $val, $title)) {
$page_title = $title[1];
$is_title = true;
}
if (eregi(“<a href=(.*)</a>”, $val, $alink)) {
$newurl = $alink[1];
$newurl = eregi_replace(‘ target=”_blank”‘, “”, $newurl);
$newurl = trim($newurl);
$pos1 = strpos($newurl, “/>”);
if ($pos1 !== false) {
$newurl = substr($newurl, 1, $pos1);
}
$pos2 = strpos($newurl, “>”);
if ($pos2 !== false) {
$newurl = substr($newurl, 1, $pos2);
}
$newurl = eregi_replace(“””, “”, $newurl);
$newurl = eregi_replace(“>”, “”, $newurl);
//if (!eregi(“http”, $newurl)) { // local
// $newurl = “http://”.$_SERVER[“HTTP_HOST”].”/”.$newurl;
// }
if (!eregi(“http”, $newurl)) { // local
$pos1 = strpos($newurl, “/”);
if ($pos1 == 0) {
$newurl = substr($newurl, 1);
}
$newurl = $urlname.$newurl;
}
// put in array of found links
$links[$index] = $newurl;
$index++;
$is_href = true;
}
} // foreach lines done
echo “<p><b>Page Summary</b><br>n”;
echo “<b>Url:</b> “.$urlname.”<br>n”;
if ($is_title) {
echo “<b>Title:</b> “.$page_title.”<br>n”;
}
else {
echo “No title found<br>n”;
}
echo “<b>Links:</b><br>n”;
if ($is_href) {
foreach ($links as $myval) {
echo “Link: “.$myval.”<br>n”;
}
}
else {
echo “No links found<br>n”;
}
echo “End</p>n”;
} // fopen handle ok
else {
echo “<br>The url $urlname does not exist or there was an fopen error.<br>”;
}
echo “<br /><br /><h4><a href=”http://www.site-search.org/link-extractor.php” title=”Link Extractor”>Try Again</a></h4>”;
} // end else urlname given
} // else find links now submit
else {
$urlname = “”; // or whatever page you like
echo “<br /><br />n”;
echo “<p><h2>Link Extractor</h2><br>n”;
echo “File or URL: <input type=”TEXT” name=”urlname” value=”http://” maxlength=”255″ size=”80″>n”;
echo “<input type=”SUBMIT” name=”FindLinks” value=”Extract Links”></font><br></p> n”;
echo “<br /><br />n”;
}
echo “</td></tr>”;
echo “</table></p>”;
echo “</form></BODY></HTML>n”;
?>
<!– Link Extractor END –>
You can see it in action on this page:
[URL=”http://www.site-search.org/link-extractor.php”]Link Extractor
Then I have this quirky and rather crash-prone email extractor that I am still fiddling with:
[code=php]
<!– Email Extractor START –>
<?php
###############################################################
# Email Extractor 1.0
###############################################################
# Visit http://www.zubrag.com/scripts/ for updates
###############################################################
$the_url = isset($_REQUEST[‘url’]) ? htmlspecialchars($_REQUEST[‘url’]) : ”;
?>
<form method=”post”>
Please enter full URL of the page to parse (including http://):<br />
<input type=”text” name=”url” size=”65″ value=”http://<?php echo str_replace(‘http://’, ”, $the_url); ?>”/><br />
or enter text directly into textarea below:<br />
<textarea name=”text” cols=”50″ rows=”15″></textarea>
<br />
<input type=”submit” value=”Parse Emails” />
</form>
<?php
if (isset($_REQUEST[‘url’]) && !empty($_REQUEST[‘url’])) {
// fetch data from specified url
$text = file_get_contents($_REQUEST[‘url’]);
}
elseif (isset($_REQUEST[‘text’]) && !empty($_REQUEST[‘text’])) {
// get text from text area
$text = $_REQUEST[‘text’];
}
// parse emails
if (!empty($text)) {
$res = preg_match_all(
“/[a-z0-9]+([_\.-][a-z0-9]+)*@([a-z0-9]+([.-][a-z0-9]+)*)+\.[a-z]{2,}/i”,
$text,
$matches
);
if ($res) {
foreach(array_unique($matches[0]) as $email) {
echo $email . “<br />”;
}
}
else {
echo “No emails found.”;
}
}
?>
<!– Email Extractor END –>
Here is the script in action: [URL=”http://www.site-search.org/email-extractor.php”]email extractor
I would like to combine the two so that the script will parse the entire site and extract all found email addresses.