Combining my LinkExtractor and Email Extractor…

@donatelloDec 04.2010

I have two scripts that I’ve been fiddling with for a site analytics tool/site that I’ve been playing with and would like to combine the two so that the email extractor parses the entire site and lists all emails found.

This is more for fun and site forensics than as a spambot email harvester, as there are already plenty of spam harvesting tools out there already and PHP is an impractical means of performing this task.

Anyway, I have a link extractor that works fine:

[code=php] <!– Link Extractor START –> <?php // findlinks.php // php code example: find links in an html page // mallsop.com 2006 gpl

echo “<form method=post action=”$PHP_SELF”> n”; echo “<p><table align=”absmiddle” width=”100%” bgcolor=”#cccccc” name=”tablesiteopen” border=”0″>n”; echo “<tr><td align=left>”; if ($_POST[“FindLinks”]) { $urlname = trim($_POST[“urlname”]); if ($urlname == “”) { echo “Please enter a URL. <br>n”; } else { // open the html page and parse it

$page_title = “n/a”; $links[0] = “n/a”; //$meta_descr = “n/a”; //$meta_keywd = “n/a”;

if ($handle = @fopen($urlname, “r”)) { // must be able to read it $content = “”; while (!feof($handle)) { $part = fread($handle, 1024); $content .= $part; // if (eregi(“</head>”, $part)) break; } fclose($handle); $lines = preg_split(“/r?n|r/”, $content); // turn the content into rows

// boolean $is_title = false; //$is_descr = false; //$is_keywd = false; $is_href = false; $index = 0;

//$close_tag = ($xhtml) ? ” />” : “>”; // new in ver. 1.01 foreach ($lines as $val) { if (eregi(“<title>(.*)</title>”, $val, $title)) { $page_title = $title[1]; $is_title = true; } if (eregi(“<a href=(.*)</a>”, $val, $alink)) {

$newurl = $alink[1]; $newurl = eregi_replace(‘ target=”_blank”‘, “”, $newurl); $newurl = trim($newurl); $pos1 = strpos($newurl, “/>”); if ($pos1 !== false) { $newurl = substr($newurl, 1, $pos1); } $pos2 = strpos($newurl, “>”); if ($pos2 !== false) { $newurl = substr($newurl, 1, $pos2); } $newurl = eregi_replace(“””, “”, $newurl); $newurl = eregi_replace(“>”, “”, $newurl);

//if (!eregi(“http”, $newurl)) { // local // $newurl = “http://”.$_SERVER[“HTTP_HOST”].”/”.$newurl; // } if (!eregi(“http”, $newurl)) { // local $pos1 = strpos($newurl, “/”); if ($pos1 == 0) { $newurl = substr($newurl, 1); } $newurl = $urlname.$newurl; }

// put in array of found links $links[$index] = $newurl; $index++; $is_href = true;

}

} // foreach lines done

echo “<p><b>Page Summary</b><br>n”; echo “<b>Url:</b> “.$urlname.”<br>n”; if ($is_title) { echo “<b>Title:</b> “.$page_title.”<br>n”; } else { echo “No title found<br>n”; } echo “<b>Links:</b><br>n”; if ($is_href) { foreach ($links as $myval) { echo “Link: “.$myval.”<br>n”; } } else { echo “No links found<br>n”; } echo “End</p>n”; } // fopen handle ok else { echo “<br>The url $urlname does not exist or there was an fopen error.<br>”; } echo “<br /><br /><h4><a href=”http://www.site-search.org/link-extractor.php” title=”Link Extractor”>Try Again</a></h4>”; } // end else urlname given } // else find links now submit else { $urlname = “”; // or whatever page you like echo “<br /><br />n”; echo “<p><h2>Link Extractor</h2><br>n”; echo “File or URL: <input type=”TEXT” name=”urlname” value=”http://” maxlength=”255″ size=”80″>n”; echo “<input type=”SUBMIT” name=”FindLinks” value=”Extract Links”></font><br></p> n”; echo “<br /><br />n”; } echo “</td></tr>”; echo “</table></p>”; echo “</form></BODY></HTML>n”;

?> <!– Link Extractor END –>

[/code]

You can see it in action on this page:
~~[URL=”http://www.site-search.org/link-extractor.php”]~~Link Extractor[/URL]

Then I have this quirky and rather crash-prone email extractor that I am still fiddling with:

[code=php] <!– Email Extractor START –> <?php

############################################################### # Email Extractor 1.0 ############################################################### # Visit http://www.zubrag.com/scripts/ for updates ###############################################################

$the_url = isset($_REQUEST[‘url’]) ? htmlspecialchars($_REQUEST[‘url’]) : ”; ?>

<form method=”post”> Please enter full URL of the page to parse (including http://):<br /> <input type=”text” name=”url” size=”65″ value=”http://<?php echo str_replace(‘http://’, ”, $the_url); ?>”/><br /> or enter text directly into textarea below:<br /> <textarea name=”text” cols=”50″ rows=”15″></textarea> <br /> <input type=”submit” value=”Parse Emails” /> </form>

<?php if (isset($_REQUEST[‘url’]) && !empty($_REQUEST[‘url’])) { // fetch data from specified url $text = file_get_contents($_REQUEST[‘url’]); } elseif (isset($_REQUEST[‘text’]) && !empty($_REQUEST[‘text’])) { // get text from text area $text = $_REQUEST[‘text’]; }

// parse emails if (!empty($text)) { $res = preg_match_all( “/[a-z0-9]+([_\.-][a-z0-9]+)*@([a-z0-9]+([.-][a-z0-9]+)*)+\.[a-z]{2,}/i”, $text, $matches );

if ($res) { foreach(array_unique($matches[0]) as $email) { echo $email . “<br />”; } } else { echo “No emails found.”; } }

?> <!– Email Extractor END –> [/code]

Here is the script in action: ~~[URL=”http://www.site-search.org/email-extractor.php”]~~email extractor[/URL]

I would like to combine the two so that the script will parse the entire site and extract all found email addresses.

to post a comment

PHP

Combining my LinkExtractor and Email Extractor…

1 Comments(s) _↴

Also in #PHP _↴

Success!

Social

Version

Combining my LinkExtractor and Email Extractor…

1 Comments(s) ↴

Also in #PHP ↴

Success!

The web is an endless sea of information. Don't miss the boat... Subscribe!

Social

Version

1 Comments(s) _↴

Also in #PHP _↴