/    Sign up×
Community /Pin to ProfileBookmark

Combining my LinkExtractor and Email Extractor…

I have two scripts that I’ve been fiddling with for a site analytics tool/site that I’ve been playing with and would like to combine the two so that the email extractor parses the entire site and lists all emails found.

This is more for fun and site forensics than as a spambot email harvester, as there are already plenty of spam harvesting tools out there already and PHP is an impractical means of performing this task.

Anyway, I have a link extractor that works fine:

[code=php]
<!– Link Extractor START –>
<?php
// findlinks.php
// php code example: find links in an html page
// mallsop.com 2006 gpl

echo “<form method=post action=”$PHP_SELF”> n”;
echo “<p><table align=”absmiddle” width=”100%” bgcolor=”#cccccc” name=”tablesiteopen” border=”0″>n”;
echo “<tr><td align=left>”;
if ($_POST[“FindLinks”]) {
$urlname = trim($_POST[“urlname”]);
if ($urlname == “”) {
echo “Please enter a URL. <br>n”;
}
else { // open the html page and parse it

$page_title = “n/a”;
$links[0] = “n/a”;
//$meta_descr = “n/a”;
//$meta_keywd = “n/a”;

if ($handle = @fopen($urlname, “r”)) { // must be able to read it
$content = “”;
while (!feof($handle)) {
$part = fread($handle, 1024);
$content .= $part;
// if (eregi(“</head>”, $part)) break;
}
fclose($handle);
$lines = preg_split(“/r?n|r/”, $content); // turn the content into rows

// boolean
$is_title = false;
//$is_descr = false;
//$is_keywd = false;
$is_href = false;
$index = 0;

//$close_tag = ($xhtml) ? ” />” : “>”; // new in ver. 1.01
foreach ($lines as $val) {
if (eregi(“<title>(.*)</title>”, $val, $title)) {
$page_title = $title[1];
$is_title = true;
}
if (eregi(“<a href=(.*)</a>”, $val, $alink)) {

$newurl = $alink[1];
$newurl = eregi_replace(‘ target=”_blank”‘, “”, $newurl);
$newurl = trim($newurl);
$pos1 = strpos($newurl, “/>”);
if ($pos1 !== false) {
$newurl = substr($newurl, 1, $pos1);
}
$pos2 = strpos($newurl, “>”);
if ($pos2 !== false) {
$newurl = substr($newurl, 1, $pos2);
}
$newurl = eregi_replace(“””, “”, $newurl);
$newurl = eregi_replace(“>”, “”, $newurl);

//if (!eregi(“http”, $newurl)) { // local
// $newurl = “http://”.$_SERVER[“HTTP_HOST”].”/”.$newurl;
// }
if (!eregi(“http”, $newurl)) { // local
$pos1 = strpos($newurl, “/”);
if ($pos1 == 0) {
$newurl = substr($newurl, 1);
}
$newurl = $urlname.$newurl;
}

// put in array of found links
$links[$index] = $newurl;
$index++;
$is_href = true;

}

} // foreach lines done

echo “<p><b>Page Summary</b><br>n”;
echo “<b>Url:</b> “.$urlname.”<br>n”;
if ($is_title) {
echo “<b>Title:</b> “.$page_title.”<br>n”;
}
else {
echo “No title found<br>n”;
}
echo “<b>Links:</b><br>n”;
if ($is_href) {
foreach ($links as $myval) {
echo “Link: “.$myval.”<br>n”;
}
}
else {
echo “No links found<br>n”;
}
echo “End</p>n”;
} // fopen handle ok
else {
echo “<br>The url $urlname does not exist or there was an fopen error.<br>”;
}
echo “<br /><br /><h4><a href=”http://www.site-search.org/link-extractor.php” title=”Link Extractor”>Try Again</a></h4>”;
} // end else urlname given
} // else find links now submit
else {
$urlname = “”; // or whatever page you like
echo “<br /><br />n”;
echo “<p><h2>Link Extractor</h2><br>n”;
echo “File or URL: <input type=”TEXT” name=”urlname” value=”http://” maxlength=”255″ size=”80″>n”;
echo “<input type=”SUBMIT” name=”FindLinks” value=”Extract Links”></font><br></p> n”;
echo “<br /><br />n”;
}
echo “</td></tr>”;
echo “</table></p>”;
echo “</form></BODY></HTML>n”;

?>
<!– Link Extractor END –>

[/code]

You can see it in action on this page:
[URL=”http://www.site-search.org/link-extractor.php”]Link Extractor[/URL]

Then I have this quirky and rather crash-prone email extractor that I am still fiddling with:

[code=php]
<!– Email Extractor START –>
<?php

###############################################################
# Email Extractor 1.0
###############################################################
# Visit http://www.zubrag.com/scripts/ for updates
###############################################################

$the_url = isset($_REQUEST[‘url’]) ? htmlspecialchars($_REQUEST[‘url’]) : ”;
?>

<form method=”post”>
Please enter full URL of the page to parse (including http://):<br />
<input type=”text” name=”url” size=”65″ value=”http://<?php echo str_replace(‘http://’, ”, $the_url); ?>”/><br />
or enter text directly into textarea below:<br />
<textarea name=”text” cols=”50″ rows=”15″></textarea>
<br />
<input type=”submit” value=”Parse Emails” />
</form>

<?php
if (isset($_REQUEST[‘url’]) && !empty($_REQUEST[‘url’])) {
// fetch data from specified url
$text = file_get_contents($_REQUEST[‘url’]);
}
elseif (isset($_REQUEST[‘text’]) && !empty($_REQUEST[‘text’])) {
// get text from text area
$text = $_REQUEST[‘text’];
}

// parse emails
if (!empty($text)) {
$res = preg_match_all(
“/[a-z0-9]+([_\.-][a-z0-9]+)*@([a-z0-9]+([.-][a-z0-9]+)*)+\.[a-z]{2,}/i”,
$text,
$matches
);

if ($res) {
foreach(array_unique($matches[0]) as $email) {
echo $email . “<br />”;
}
}
else {
echo “No emails found.”;
}
}

?>
<!– Email Extractor END –>
[/code]

Here is the script in action: [URL=”http://www.site-search.org/email-extractor.php”]email extractor[/URL]

I would like to combine the two so that the script will parse the entire site and extract all found email addresses.

to post a comment
PHP

1 Comments(s)

Copy linkTweet thisAlerts:
@donatelloauthorDec 04.2010 — There is another link extractor that I'm using which sometimes displays relative URLS instead of absolute URLs... not good... but it will display hyperlinked URLs...

Here is that alternate script:
[code=php]
<!-- URL Extractor START -->
<table width="700" border="0" cellpadding="2" cellspacing="0" class="bodytext">
<tr>
<td width="600"><form name="form1" method="post" action="">
<div align="left">
<table width="100&#37;" border="0" cellspacing="0" cellpadding="2">
<tr>
<td align="left"><p> Enter URL:<input name="url" size="80" type="text" class="bodytext" id="url" value="http://<?php echo str_replace('http://', '', $_POST[url]); ?>"><label> </td><td>&nbsp;<input name="Submit" type="submit" class="bodytext" value="Extract Links">
</label>
</td>
</tr>
</table>
</div>
</form></td>
</tr>
</table>

<?php
$url = $_POST["url"];
$var = fread_url($url);

preg_match_all ("/a[s]+[^>]*?href[s]?=[s"']+".
"(.*?)["']+.*?>"."([^<]+|.*?)?</a>/",
$var, &$matches);

$matches = $matches[1];
$list = array();

foreach($matches as $var)
{
echo "<a href="$var">$var</a><br />";
}


// The fread_url function allows you to get a complete
// page. If CURL is not installed replace the contents with
// a fopen / fget loop

function fread_url($url,$ref="")
{
if(function_exists("curl_init")){
$ch = curl_init();
$user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; ".
"Windows NT 5.0)";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt( $ch, CURLOPT_HTTPGET, 1 );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION , 1 );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION , 1 );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_REFERER, $ref );
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
$html = curl_exec($ch);
curl_close($ch);
}
else{
$hfile = fopen($url,"r");
if($hfile){
while(!feof($hfile)){
$html.=fgets($hfile,1024);
}
}
}
return $html;
}

?>

<!-- URL Extractor END -->

[/code]


Here is that script in action:

URL Extractor script
×

Success!

Help @donatello spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.21,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...