/    Sign up×
Community /Pin to ProfileBookmark

REGEX advice! Converting html to plain text

I want to dynamically convert some html documents to plain text. The html documents contain links like this:

[code=html]<a href=”http://www.domain.tld/file.ext”>interesting link text</a>[/code]

and in my plain text doc I want them converted to:

[code=html]interesting link text (http://www.domain.tld/file.ext)[/code]

how would I do that?

to post a comment
PHP

24 Comments(s)

Copy linkTweet thisAlerts:
@bathurst_guySep 23.2005 — I have a PHP script for a clients site, they wanted a CMS that allowed them to enter in plain text into a textarea and depending on how it was entered in converted to headings paragraphs tables and hyperlinks, and I also needed it to convert back to the plain text if they wanted to edit the information. I can give you the script of that if you want to have a look and see if it helps. The only difference is that my format comes out as:

http://www.domain.tld/file.ext interesting link text

in the textarea, but shouldnt be hard to manipulate to how you want
Copy linkTweet thisAlerts:
@bokehauthorSep 23.2005 — If it was coming from the text area I would have the raw data so things would be easier. What I am having trouble with is extracting the links from the html and replacing them based on their own content. If I use preg_replace I need to know the replacement when I write the script but I want the replacement base on the part being replaced.
Copy linkTweet thisAlerts:
@HaganeNoKokoroSep 23.2005 — Here's a regex I've been playing with:[code=php]<?php
$regex="/<a(?:.*?)href[\s]*\=[\s]*([\"'])(.*?)\1(?:.*?)>(.*?)<\/a>/";
$mats=array();
preg_match_all($regex, "<a class="test" href="test.html">this is a test</a>blah<a href='#test'>test</a>", $mats);
echo "<textarea rows="40" cols="80">";
print_r($mats);
echo "</textarea>";
?>[/code]
This code demonstrates the regex on some example text. It's probably not perfect, but it's a start.
Copy linkTweet thisAlerts:
@NogDogSep 23.2005 — Here's what I came up with, for better or worse:
[code=php]
<?php
function replace_links($text)
{
$search = '/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i';
$replace = '2 (1)';
return(preg_replace($search, $replace, $text));
}
# test:
$test = <<<EOD
This is a test. <a href="http://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
EOD;

echo replace_links($test);
?>
[/code]
Copy linkTweet thisAlerts:
@Stephen_PhilbinSep 23.2005 — I'd go in for using the W3C DOM. I don't think a regexp could do it. I can't imagine a regexp would be able to account for unique instances of many possible attributes in any order. I'd rather just use the DOM to say "grab the child text node and the href attribute value and spit 'em out in this format".
Copy linkTweet thisAlerts:
@bokehauthorSep 23.2005 — Thanks to everyone who is trying to help with this. Nogdog, I have a couple of questions for you,

first: could that be modified slightly to strip out mailto: so only the email address shows.

Example: <a href="mailto:[email protected]">Bokeh</a>

converted to

Bokeh ([email protected])

Second: How long have you been working with PCRE?

and third: roughly how much time did you spend working out the logic and getting that REGEX together if you don't mind me asking?
Copy linkTweet thisAlerts:
@bokehauthorSep 23.2005 — One other area it falls over is non quoted strings like this:

<a href=http://www.google.com/ alt=123>Google</a>
Copy linkTweet thisAlerts:
@NogDogSep 23.2005 — Thanks to everyone who is trying to help with this. Nogdog, I have a couple of questions for you,

first: could that be modified slightly to strip out mailto: so only the email address shows.

Example: <a href="mailto:[email protected]">Bokeh</a>

converted to

Bokeh ([email protected])[/quote]

There's probably a slicker way to do it, but this seems to work:
[code=php]
<?php
function replace_links($text)
{
$search = array('/(<as[^>]*href=['"])mailto:([^'"]+['"][^>]*>([^<]+)</a>)/i',
'/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i');
$replace = array('12',
'2 (1)');
return(preg_replace($search, $replace, $text));
}
# test:
$test = <<<EOD
This is a test. <a href="http://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
And <a href="mailto:[email protected]">this</a> is an email.
EOD;

echo replace_links($test);
?>
[/code]

Second: How long have you been working with PCRE?[/quote]
I've dabbled with Perl over the last several years for automating various tasks at work, for some of which I used Perl reg_exps to manipulate text. I started dabbling a bit with PHP 2-3 years ago to help maintain/upgrade some tools, and really started getting into it on my own in the last year or so. So using the PCRE regexps in PHP was not too daunting since I'd essentially learned the basics of them already. I've learned a lot more over the last year since I've been participating in this forum and trying to solve various regexp "challenges" posted here. ?
and third: roughly how much time did you spend working out the logic and getting that REGEX together if you don't mind me asking?[/QUOTE]
Well, I wasn't timing myself, but I'd guess about a half hour total - maybe a bit less, including one false start in the logic plus testing and debugging a few stupid PHP errors (as usual).
Copy linkTweet thisAlerts:
@bokehauthorSep 23.2005 — I was guessing you were asleep because of your time zone so I had a play with the pattern and came up with this:[code=php]$search = '/<as[^>]*href=['"]*[(mailto:)]*([(http://)]+[^'"s]+)['"s][^>]*>([^<]+)</a>/i';

.[/code]
I don't no yet if it is flakey though. This also rejects links without the http schema. Thanks for the help.
Copy linkTweet thisAlerts:
@NogDogSep 23.2005 — Unemployed right now ( ? ) so keeping weird hours. Actually, I'd still normally be asleep by now, but I can't pass up this puzzle. ?

Here's what I just came up with:
[code=php]
function replace_links($text)
{
$search = array(
'/<as[^>]*href=['"]mailto:([^'"]+)['"][^>]*>([^<]+)</a>/i',
'/<as[^>]*href=mailto:([^s>]+)[^>]*>([^<]+)</a>/i',
'/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i',
'/<as[^>]*href=([^s>]+)[^>]*>([^<]+)</a>/i');
$replace = '2 (1)';
return(preg_replace($search, $replace, $text));
}
# test:
$test = <<<EOD
This is a test. <a href="http://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
And <a href="mailto:[email protected]">this</a> is an email.
This is a test. <a href=http://www.google.com/>Google</a>
More complex: <a id=test href=http://www.w3.org/TR/html4/
title="link to w3.org">HTML 4.01 Specification</a>
And <a href=mailto:[email protected] id=test>this</a> is an email.
EOD;

echo replace_links($test);
[/code]

I first tried to come up with a single regexp that handled all conditions, but that just made my head hurt. So I broke it down into the four contingiencies you see above. Enjoy!
Copy linkTweet thisAlerts:
@bokehauthorSep 23.2005 — Ok. One last thing! Can we ignore links without a schema?
Copy linkTweet thisAlerts:
@NogDogSep 23.2005 — Ok. One last thing! Can we ignore links without a schema?[/QUOTE]
Probably, but I don't know what that means. ? Anyway, I'm off to bed, so I'll check back in a few hours.
Copy linkTweet thisAlerts:
@bokehauthorSep 23.2005 — I meant the protocol (http://). Something like the following but it is starting to get unwieldy now:
[code=php]function replace_links($text, $protocol = TRUE)
{
$search[] = '/<as[^>]*href=['"]mailto:([^'"]+)['"][^>]*>([^<]+)</a>/i';
$search[] = '/<as[^>]*href=mailto:([^s>]+)[^>]*>([^<]+)</a>/i';
if(!empty($protocol)){
$search[] = '/<as[^>]*href=['"](http://[^'"]+)['"][^>]*>([^<]+)</a>/i';
$search[] = '/<as[^>]*href=(http://[^s>]+)[^>]*>([^<]+)</a>/i';
$search[] = '/<as[^>]*href=['"](ftp://[^'"]+)['"][^>]*>([^<]+)</a>/i';
$search[] = '/<as[^>]*href=(ftp://[^s>]+)[^>]*>([^<]+)</a>/i';
}else{
$search[] = '/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i';
$search[] = '/<as[^>]*href=([^s>]+)[^>]*>([^<]+)</a>/i';
}
$replace = '2 (1)';
return(preg_replace($search, $replace, $text));
}[/code]
Copy linkTweet thisAlerts:
@NogDogSep 23.2005 — [code=php]
function replace_links($text)
{
$search = array(
'/<as[^>]*href=['"]mailto:([^'"]+)['"][^>]*>([^<]+)</a>/i',
'/<as[^>]*href=mailto:([^s>]+)[^>]*>([^<]+)</a>/i',
'/<as[^>]*href=['"](h?[ft]tp:[^'"]+)['"][^>]*>([^<]+)</a>/i',
'/<as[^>]*href=(h?[ft]tp:[^s>]+)[^>]*>([^<]+)</a>/i',
'/<as[^>]*>([^<]*)</a>/i');
$replace = array('2 (1)',
'2 (1)',
'2 (1)',
'2 (1)',
'1');
return(preg_replace($search, $replace, $text));
}
# test:
$test = <<<EOD
This is a test. <a href="ftp://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
And <a href="mailto:[email protected]">this</a> is an email.
This is a test. <a href=http://www.google.com/>Google</a>
More complex: <a id=test href=ftp://www.w3.org/TR/html4/
title="link to w3.org">HTML 4.01 Specification</a>
And <a href=mailto:[email protected] id=test>this</a> is an email.
Here is <a href="/files/test.php">a local link</a> to a file.
EOD;

echo replace_links($test);
[/code]
Copy linkTweet thisAlerts:
@bokehauthorSep 23.2005 — Thanks for all the time you have spent on this Nogdog. I only started playing with PHP earlier this year so my knowledge is still pretty basic. I have just bought a couple of books on REGEX: both O'reilly's. Mastering REGEX and the pocket reference but there just aren't enough hours in the day.
Copy linkTweet thisAlerts:
@felgallSep 23.2005 — You could simplify things by moving all of the regular expressions into the one regular expression making them alternatives

[alt1|alt2|alt3]
Copy linkTweet thisAlerts:
@bokehauthorSep 23.2005 — You could simplify things by moving all of the regular expressions into the one regular expression making them alternatives

[alt1|alt2|alt3][/QUOTE]
I can't see how you could do that because the replacement array is based on the corresponding source array. By the way stephen did you have an example of a multipart/related or relative email?
Copy linkTweet thisAlerts:
@bokehauthorSep 24.2005 — Nogdog, one last problem I hope, it falls over with this:

More complex: <a id='test' href='http://www.w3.org/TR/html4/' title="link to w3.org">HTML <b>4.01</b> Specification</a>

The nested tag upsets it.
Copy linkTweet thisAlerts:
@bokehauthorSep 24.2005 — Also, the following is a non related issue. I have an html document that will be sent by email. I want to strip out any <img> or <a> tags that don't have a full URL starting http and just leave the either the link text or the alt text.
Copy linkTweet thisAlerts:
@NogDogSep 24.2005 — Nogdog, one last problem I hope, it falls over with this:

More complex: <a id='test' href='http://www.w3.org/TR/html4/' title="link to w3.org">HTML <b>4.01</b> Specification</a>

The nested tag upsets it.[/QUOTE]

Latest enhancements for above situation along with general streamlining and some commenting:
[code=php]
function replace_links($text)
# convert HTML links to textual representations
# "<a href="http:/a.b.com/">test</a>" -> "test (http:/a.b.com/)"
{
# define regexp components for main regexp:
$start = '<as[^>]*href='; # start of A link
$mail_q = '['"]mailto:([^'"]+)['"]'; # quoted mailto
$mail_u = 'mailto:([^s>]+)'; # unquoted mailto
$link_q = '['"](h?[ft]tp:[^'"]+)['"]'; # quoted http or ftp link
$link_u = '(h?[ft]tp:[^s>]+)'; # unquoted http or ftp link
$end = '[^>]*>(.+)</a>'; # end of A link

$search = array("/$start(?:$mail_q|$mail_u|$link_q|$link_u)$end/i",
'/<as[^>]*>(.*)</a>/i'); # local file or other non-match
$replace = array('5 (1234)', '1');
return(preg_replace($search, $replace, $text));
}
### TEST ###
$test = <<<EOD
This is a test. <a href="ftp://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
And <a href="mailto:[email protected]">this</a> is an email.
This is a test. <a href=http://www.google.com/>Google</a>
More complex: <a id=test href=ftp://www.w3.org/TR/html4/
title="link to w3.org">HTML 4.01 Specification</a>
And <a href=mailto:[email protected] id=test>this</a> is an email.
Here is <a href="/files/test.php">a local link</a> to a file.
Even more complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML <b>4.01</b> Specification</a>
EOD;

echo replace_links($test);
[/code]
Copy linkTweet thisAlerts:
@bokehauthorSep 24.2005 — Thanks for that. It seems to work flawlessly. Don't worry about the last thing I posted above as I will have a go my self and come back for help when I get stuck. Thanks again.
Copy linkTweet thisAlerts:
@bokehauthorSep 24.2005 — Ok! I have these two REGEXes:
[code=php]'/<imgs[^>]*alt=['"]([^'"]+)['"][^>]*>/i'
'/<as[^>]*>([^<]*)</a>/i'[/code]

The first finds all image tags and extracts the alt text, the second matches all links and extracts the link text. how do I modify them so that they don't find a match if those tags contain 'http:' ?
Copy linkTweet thisAlerts:
@NogDogSep 24.2005 — I think a negative lookahead assertion (don't that sound fancy!) is needed:
<i>
</i>'/&lt;as[^&gt;]*[color=red](?!http:)[/color]&gt;([^&lt;]*)&lt;/a&gt;/i'

See the "Assertions" section of http://www.php.net/manual/en/reference.pcre.pattern.syntax.php for more info. I've only played with them a couple times, so no guarantees.
Copy linkTweet thisAlerts:
@bokehauthorSep 24.2005 — Hmm! I tried that and it still matches every single link. Very strange! Here is the code:
[code=html]<?php
function delete_local_links($text)
{
$search = array("/<as[^>]*(?!http:)>(.+)</a>/i");
$replace = array('1');
return(preg_replace($search, $replace, $text));
}


### TEST ###
$test = <<<EOD
This is a test. <a href="ftp://www.google.com/">Google</a>
More complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML 4.01 Specification</a>
And <a href="mailto:[email protected]">this</a> is an email.
This is a test. <a href=http://www.google.com/>Google</a>
More complex: <a id=test href=ftp://www.w3.org/TR/html4/
title="link to w3.org">HTML 4.01 Specification</a>
And <a href=mailto:[email protected] id=test>this</a> is an email.
Here is <a href="/files/test.php">a local link</a> to a file.
Even more complex: <a id='test' href='http://www.w3.org/TR/html4/'
title="link to w3.org">HTML <b>4.01</b> Specification</a>
EOD;
echo delete_local_links($test);
?>[/code]
×

Success!

Help @bokeh spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.18,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...