Menu
I want to dynamically convert some html documents to plain text. The html documents contain links like this:
[code=html]<a href=”http://www.domain.tld/file.ext”>interesting link text</a>
and in my plain text doc I want them converted to:
[code=html]interesting link text (http://www.domain.tld/file.ext)
how would I do that?
[code=php]<?php
$regex="/<a(?:.*?)href[\s]*\=[\s]*([\"'])(.*?)\1(?:.*?)>(.*?)<\/a>/";
$mats=array();
preg_match_all($regex, "<a class="test" href="test.html">this is a test</a>blah<a href='#test'>test</a>", $mats);
echo "<textarea rows="40" cols="80">";
print_r($mats);
echo "</textarea>";
?>[/code]
This code demonstrates the regex on some example text. It's probably not perfect, but it's a start.[code=php]
<?php
function replace_links($text)
{
$search = '/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i';
$replace = '2 (1)';
return(preg_replace($search, $replace, $text));
}
# test:
$test = <<<EOD
This is a test. <a href="http://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
EOD;
echo replace_links($test);
?>
[/code]
Thanks to everyone who is trying to help with this. Nogdog, I have a couple of questions for you,
first: could that be modified slightly to strip out mailto: so only the email address shows.
Example: <a href="mailto:[email protected] ">Bokeh</a>
converted to
Bokeh ([email protected] )[/quote]
[code=php]
<?php
function replace_links($text)
{
$search = array('/(<as[^>]*href=['"])mailto:([^'"]+['"][^>]*>([^<]+)</a>)/i',
'/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i');
$replace = array('12',
'2 (1)');
return(preg_replace($search, $replace, $text));
}
# test:
$test = <<<EOD
This is a test. <a href="http://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
And <a href="mailto:[email protected]">this</a> is an email.
EOD;
echo replace_links($test);
?>
[/code]
Second: How long have you been working with PCRE?[/quote]
and third: roughly how much time did you spend working out the logic and getting that REGEX together if you don't mind me asking?[/QUOTE]
[code=php]$search = '/<as[^>]*href=['"]*[(mailto:)]*([(http://)]+[^'"s]+)['"s][^>]*>([^<]+)</a>/i';
.[/code]
I don't no yet if it is flakey though. This also rejects links without the http schema. Thanks for the help.[code=php]
function replace_links($text)
{
$search = array(
'/<as[^>]*href=['"]mailto:([^'"]+)['"][^>]*>([^<]+)</a>/i',
'/<as[^>]*href=mailto:([^s>]+)[^>]*>([^<]+)</a>/i',
'/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i',
'/<as[^>]*href=([^s>]+)[^>]*>([^<]+)</a>/i');
$replace = '2 (1)';
return(preg_replace($search, $replace, $text));
}
# test:
$test = <<<EOD
This is a test. <a href="http://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
And <a href="mailto:[email protected]">this</a> is an email.
This is a test. <a href=http://www.google.com/>Google</a>
More complex: <a id=test href=http://www.w3.org/TR/html4/
title="link to w3.org">HTML 4.01 Specification</a>
And <a href=mailto:[email protected] id=test>this</a> is an email.
EOD;
echo replace_links($test);
[/code]
Ok. One last thing! Can we ignore links without a schema?[/QUOTE]
[code=php]function replace_links($text, $protocol = TRUE)
{
$search[] = '/<as[^>]*href=['"]mailto:([^'"]+)['"][^>]*>([^<]+)</a>/i';
$search[] = '/<as[^>]*href=mailto:([^s>]+)[^>]*>([^<]+)</a>/i';
if(!empty($protocol)){
$search[] = '/<as[^>]*href=['"](http://[^'"]+)['"][^>]*>([^<]+)</a>/i';
$search[] = '/<as[^>]*href=(http://[^s>]+)[^>]*>([^<]+)</a>/i';
$search[] = '/<as[^>]*href=['"](ftp://[^'"]+)['"][^>]*>([^<]+)</a>/i';
$search[] = '/<as[^>]*href=(ftp://[^s>]+)[^>]*>([^<]+)</a>/i';
}else{
$search[] = '/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i';
$search[] = '/<as[^>]*href=([^s>]+)[^>]*>([^<]+)</a>/i';
}
$replace = '2 (1)';
return(preg_replace($search, $replace, $text));
}[/code]
[code=php]
function replace_links($text)
{
$search = array(
'/<as[^>]*href=['"]mailto:([^'"]+)['"][^>]*>([^<]+)</a>/i',
'/<as[^>]*href=mailto:([^s>]+)[^>]*>([^<]+)</a>/i',
'/<as[^>]*href=['"](h?[ft]tp:[^'"]+)['"][^>]*>([^<]+)</a>/i',
'/<as[^>]*href=(h?[ft]tp:[^s>]+)[^>]*>([^<]+)</a>/i',
'/<as[^>]*>([^<]*)</a>/i');
$replace = array('2 (1)',
'2 (1)',
'2 (1)',
'2 (1)',
'1');
return(preg_replace($search, $replace, $text));
}
# test:
$test = <<<EOD
This is a test. <a href="ftp://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
And <a href="mailto:[email protected]">this</a> is an email.
This is a test. <a href=http://www.google.com/>Google</a>
More complex: <a id=test href=ftp://www.w3.org/TR/html4/
title="link to w3.org">HTML 4.01 Specification</a>
And <a href=mailto:[email protected] id=test>this</a> is an email.
Here is <a href="/files/test.php">a local link</a> to a file.
EOD;
echo replace_links($test);
[/code]
You could simplify things by moving all of the regular expressions into the one regular expression making them alternativesI can't see how you could do that because the replacement array is based on the corresponding source array. By the way stephen did you have an example of a multipart/related or relative email?
[alt1|alt2|alt3][/QUOTE]
Nogdog, one last problem I hope, it falls over with this:
More complex: <a id='test' href='http://www.w3.org/TR/html4/ ' title="link to w3.org">HTML <b>4.01</b> Specification</a>
The nested tag upsets it.[/QUOTE]
[code=php]
function replace_links($text)
# convert HTML links to textual representations
# "<a href="http:/a.b.com/">test</a>" -> "test (http:/a.b.com/)"
{
# define regexp components for main regexp:
$start = '<as[^>]*href='; # start of A link
$mail_q = '['"]mailto:([^'"]+)['"]'; # quoted mailto
$mail_u = 'mailto:([^s>]+)'; # unquoted mailto
$link_q = '['"](h?[ft]tp:[^'"]+)['"]'; # quoted http or ftp link
$link_u = '(h?[ft]tp:[^s>]+)'; # unquoted http or ftp link
$end = '[^>]*>(.+)</a>'; # end of A link
$search = array("/$start(?:$mail_q|$mail_u|$link_q|$link_u)$end/i",
'/<as[^>]*>(.*)</a>/i'); # local file or other non-match
$replace = array('5 (1234)', '1');
return(preg_replace($search, $replace, $text));
}
### TEST ###
$test = <<<EOD
This is a test. <a href="ftp://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
And <a href="mailto:[email protected]">this</a> is an email.
This is a test. <a href=http://www.google.com/>Google</a>
More complex: <a id=test href=ftp://www.w3.org/TR/html4/
title="link to w3.org">HTML 4.01 Specification</a>
And <a href=mailto:[email protected] id=test>this</a> is an email.
Here is <a href="/files/test.php">a local link</a> to a file.
Even more complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML <b>4.01</b> Specification</a>
EOD;
echo replace_links($test);
[/code]
[code=php]'/<imgs[^>]*alt=['"]([^'"]+)['"][^>]*>/i'
'/<as[^>]*>([^<]*)</a>/i'[/code]
<i>
</i>'/<as[^>]*[color=red](?!http:)[/color]>([^<]*)</a>/i'
[code=html]<?php
function delete_local_links($text)
{
$search = array("/<as[^>]*(?!http:)>(.+)</a>/i");
$replace = array('1');
return(preg_replace($search, $replace, $text));
}
### TEST ###
$test = <<<EOD
This is a test. <a href="ftp://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
And <a href="mailto:[email protected]">this</a> is an email.
This is a test. <a href=http://www.google.com/>Google</a>
More complex: <a id=test href=ftp://www.w3.org/TR/html4/
title="link to w3.org">HTML 4.01 Specification</a>
And <a href=mailto:[email protected] id=test>this</a> is an email.
Here is <a href="/files/test.php">a local link</a> to a file.
Even more complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML <b>4.01</b> Specification</a>
EOD;
echo delete_local_links($test);
?>[/code]
0.1.9 — BETA 5.18