REGEX advice! Converting html to plain text

@bokehSep 23.2005

I want to dynamically convert some html documents to plain text. The html documents contain links like this:

[code=html]<a href=”http://www.domain.tld/file.ext”>interesting link text</a>[/code]

and in my plain text doc I want them converted to:

[code=html]interesting link text (http://www.domain.tld/file.ext)[/code]

how would I do that?

to post a comment

PHP

24 Comments(s) _↴

@bathurst_guySep 23.2005 — #I have a PHP script for a clients site, they wanted a CMS that allowed them to enter in plain text into a textarea and depending on how it was entered in converted to headings paragraphs tables and hyperlinks, and I also needed it to convert back to the plain text if they wanted to edit the information. I can give you the script of that if you want to have a look and see if it helps. The only difference is that my format comes out as:

http://www.domain.tld/file.ext interesting link text

in the textarea, but shouldnt be hard to manipulate to how you want

@bokehauthorSep 23.2005 — #If it was coming from the text area I would have the raw data so things would be easier. What I am having trouble with is extracting the links from the html and replacing them based on their own content. If I use preg_replace I need to know the replacement when I write the script but I want the replacement base on the part being replaced.

@HaganeNoKokoroSep 23.2005 — #Here's a regex I've been playing with:

[code=php]<?php
 $regex="/<a(?:.*?)href[\s]*\=[\s]*([\"'])(.*?)\1(?:.*?)>(.*?)<\/a>/";
 $mats=array();
 preg_match_all($regex, "<a class="test" href="test.html">this is a test</a>blah<a href='#test'>test</a>", $mats);
 echo "<textarea rows="40" cols="80">";
 print_r($mats);
 echo "</textarea>";
 ?>[/code]

This code demonstrates the regex on some example text. It's probably not perfect, but it's a start.

@NogDogSep 23.2005 — #Here's what I came up with, for better or worse:

[code=php]
 <?php
 function replace_links($text)
 {
 $search = '/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i';
 $replace = '2 (1)';
 return(preg_replace($search, $replace, $text));
 }
 # test:
 $test = <<<EOD
 This is a test. <a href="http://www.google.com/">Google</a>
 More complex: <a id='test' href='http://www.w3.org/TR/html4/'
 title="link to w3.org">HTML 4.01 Specification</a>
 EOD;
 
 echo replace_links($test);
 ?>
 [/code]

@Stephen_PhilbinSep 23.2005 — #I'd go in for using the W3C DOM. I don't think a regexp could do it. I can't imagine a regexp would be able to account for unique instances of many possible attributes in any order. I'd rather just use the DOM to say "grab the child text node and the href attribute value and spit 'em out in this format".

@bokehauthorSep 23.2005 — #Thanks to everyone who is trying to help with this. Nogdog, I have a couple of questions for you,

first: could that be modified slightly to strip out mailto: so only the email address shows.

Example: <a href="mailto:[email protected]">Bokeh</a>

converted to

Bokeh ([email protected])

Second: How long have you been working with PCRE?

and third: roughly how much time did you spend working out the logic and getting that REGEX together if you don't mind me asking?

@bokehauthorSep 23.2005 — #One other area it falls over is non quoted strings like this:

<a href=http://www.google.com/ alt=123>Google</a>

@NogDogSep 23.2005 — #Thanks to everyone who is trying to help with this. Nogdog, I have a couple of questions for you,

first: could that be modified slightly to strip out mailto: so only the email address shows.

Example: <a href="mailto:[email protected]">Bokeh</a>

converted to

Bokeh ([email protected])[/quote]
There's probably a slicker way to do it, but this seems to work:

[code=php]
 <?php
 function replace_links($text)
 {
 $search = array('/(<as[^>]*href=['"])mailto:([^'"]+['"][^>]*>([^<]+)</a>)/i',
 '/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i');
 $replace = array('12', 
 '2 (1)');
 return(preg_replace($search, $replace, $text));
 }
 # test:
 $test = <<<EOD
 This is a test. <a href="http://www.google.com/">Google</a>
 More complex: <a id='test' href='http://www.w3.org/TR/html4/'
 title="link to w3.org">HTML 4.01 Specification</a>
 And <a href="mailto:[email protected]">this</a> is an email.
 EOD;
 
 echo replace_links($test);
 ?>
 [/code]

Second: How long have you been working with PCRE?[/quote]
I've dabbled with Perl over the last several years for automating various tasks at work, for some of which I used Perl reg_exps to manipulate text. I started dabbling a bit with PHP 2-3 years ago to help maintain/upgrade some tools, and really started getting into it on my own in the last year or so. So using the PCRE regexps in PHP was not too daunting since I'd essentially learned the basics of them already. I've learned a lot more over the last year since I've been participating in this forum and trying to solve various regexp "challenges" posted here. ?
and third: roughly how much time did you spend working out the logic and getting that REGEX together if you don't mind me asking?[/QUOTE]
Well, I wasn't timing myself, but I'd guess about a half hour total - maybe a bit less, including one false start in the logic plus testing and debugging a few stupid PHP errors (as usual).

@bokehauthorSep 23.2005 — #I was guessing you were asleep because of your time zone so I had a play with the pattern and came up with this:

[code=php]$search = '/<as[^>]*href=['"]*[(mailto:)]*([(http://)]+[^'"s]+)['"s][^>]*>([^<]+)</a>/i';
 
 .[/code]

I don't no yet if it is flakey though. This also rejects links without the http schema. Thanks for the help.

@NogDogSep 23.2005 — #Unemployed right now ( ? ) so keeping weird hours. Actually, I'd still normally be asleep by now, but I can't pass up this puzzle. ?

Here's what I just came up with:

[code=php]
 function replace_links($text)
 {
 $search = array(
 '/<as[^>]*href=['"]mailto:([^'"]+)['"][^>]*>([^<]+)</a>/i',
 '/<as[^>]*href=mailto:([^s>]+)[^>]*>([^<]+)</a>/i',
 '/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i',
 '/<as[^>]*href=([^s>]+)[^>]*>([^<]+)</a>/i');
 $replace = '2 (1)';
 return(preg_replace($search, $replace, $text));
 }
 # test:
 $test = <<<EOD
 This is a test. <a href="http://www.google.com/">Google</a>
 More complex: <a id='test' href='http://www.w3.org/TR/html4/'
 title="link to w3.org">HTML 4.01 Specification</a>
 And <a href="mailto:[email protected]">this</a> is an email.
 This is a test. <a href=http://www.google.com/>Google</a>
 More complex: <a id=test href=http://www.w3.org/TR/html4/
 title="link to w3.org">HTML 4.01 Specification</a>
 And <a href=mailto:[email protected] id=test>this</a> is an email.
 EOD;
 
 echo replace_links($test);
 [/code]

I first tried to come up with a single regexp that handled all conditions, but that just made my head hurt. So I broke it down into the four contingiencies you see above. Enjoy!

@bokehauthorSep 23.2005 — #Ok. One last thing! Can we ignore links without a schema?

@NogDogSep 23.2005 — #Ok. One last thing! Can we ignore links without a schema?[/QUOTE]
Probably, but I don't know what that means. ? Anyway, I'm off to bed, so I'll check back in a few hours.

@bokehauthorSep 23.2005 — #I meant the protocol (http://). Something like the following but it is starting to get unwieldy now:

[code=php]function replace_links($text, $protocol = TRUE)
 {
 $search[] = '/<as[^>]*href=['"]mailto:([^'"]+)['"][^>]*>([^<]+)</a>/i';
 $search[] = '/<as[^>]*href=mailto:([^s>]+)[^>]*>([^<]+)</a>/i';
 if(!empty($protocol)){
 $search[] = '/<as[^>]*href=['"](http://[^'"]+)['"][^>]*>([^<]+)</a>/i';
 $search[] = '/<as[^>]*href=(http://[^s>]+)[^>]*>([^<]+)</a>/i';
 $search[] = '/<as[^>]*href=['"](ftp://[^'"]+)['"][^>]*>([^<]+)</a>/i';
 $search[] = '/<as[^>]*href=(ftp://[^s>]+)[^>]*>([^<]+)</a>/i';
 }else{ 
 $search[] = '/<as[^>]*href=['"]([^'"]+)['"][^>]*>([^<]+)</a>/i';
 $search[] = '/<as[^>]*href=([^s>]+)[^>]*>([^<]+)</a>/i';
 }
 $replace = '2 (1)';
 return(preg_replace($search, $replace, $text));
 }[/code]

@NogDogSep 23.2005 — #

[code=php]
 function replace_links($text)
 {
 $search = array(
 '/<as[^>]*href=['"]mailto:([^'"]+)['"][^>]*>([^<]+)</a>/i',
 '/<as[^>]*href=mailto:([^s>]+)[^>]*>([^<]+)</a>/i',
 '/<as[^>]*href=['"](h?[ft]tp:[^'"]+)['"][^>]*>([^<]+)</a>/i',
 '/<as[^>]*href=(h?[ft]tp:[^s>]+)[^>]*>([^<]+)</a>/i',
 '/<as[^>]*>([^<]*)</a>/i');
 $replace = array('2 (1)',
 '2 (1)',
 '2 (1)',
 '2 (1)',
 '1');
 return(preg_replace($search, $replace, $text));
 }
 # test:
 $test = <<<EOD
 This is a test. <a href="ftp://www.google.com/">Google</a>
 More complex: <a id='test' href='http://www.w3.org/TR/html4/'
 title="link to w3.org">HTML 4.01 Specification</a>
 And <a href="mailto:[email protected]">this</a> is an email.
 This is a test. <a href=http://www.google.com/>Google</a>
 More complex: <a id=test href=ftp://www.w3.org/TR/html4/
 title="link to w3.org">HTML 4.01 Specification</a>
 And <a href=mailto:[email protected] id=test>this</a> is an email.
 Here is <a href="/files/test.php">a local link</a> to a file.
 EOD;
 
 echo replace_links($test);
 [/code]

@bokehauthorSep 23.2005 — #Thanks for all the time you have spent on this Nogdog. I only started playing with PHP earlier this year so my knowledge is still pretty basic. I have just bought a couple of books on REGEX: both O'reilly's. Mastering REGEX and the pocket reference but there just aren't enough hours in the day.

@felgallSep 23.2005 — #You could simplify things by moving all of the regular expressions into the one regular expression making them alternatives

[alt1|alt2|alt3]

@bokehauthorSep 23.2005 — #You could simplify things by moving all of the regular expressions into the one regular expression making them alternatives

[alt1|alt2|alt3][/QUOTE]I can't see how you could do that because the replacement array is based on the corresponding source array. By the way stephen did you have an example of a multipart/related or relative email?

@bokehauthorSep 24.2005 — #Nogdog, one last problem I hope, it falls over with this:

More complex: <a id='test' href='http://www.w3.org/TR/html4/' title="link to w3.org">HTML <b>4.01</b> Specification</a>

The nested tag upsets it.

@bokehauthorSep 24.2005 — #Also, the following is a non related issue. I have an html document that will be sent by email. I want to strip out any <img> or <a> tags that don't have a full URL starting http and just leave the either the link text or the alt text.

@NogDogSep 24.2005 — #Nogdog, one last problem I hope, it falls over with this:

More complex: <a id='test' href='http://www.w3.org/TR/html4/' title="link to w3.org">HTML <b>4.01</b> Specification</a>

The nested tag upsets it.[/QUOTE]
Latest enhancements for above situation along with general streamlining and some commenting:

[code=php]
 function replace_links($text)
 # convert HTML links to textual representations
 # "<a href="http:/a.b.com/">test</a>"  -> "test (http:/a.b.com/)"
 {
 # define regexp components for main regexp:
 $start  = '<as[^>]*href=';               # start of A link
 $mail_q = '['"]mailto:([^'"]+)['"]';   # quoted mailto
 $mail_u = 'mailto:([^s>]+)';             # unquoted mailto
 $link_q = '['"](h?[ft]tp:[^'"]+)['"]'; # quoted http or ftp link
 $link_u = '(h?[ft]tp:[^s>]+)';           # unquoted http or ftp link
 $end    = '[^>]*>(.+)</a>';              # end of A link
 
 $search = array("/$start(?:$mail_q|$mail_u|$link_q|$link_u)$end/i",
 '/<as[^>]*>(.*)</a>/i');  # local file or other non-match
 $replace = array('5 (1234)', '1');
 return(preg_replace($search, $replace, $text));
 }
 ### TEST ###
 $test = <<<EOD
 This is a test. <a href="ftp://www.google.com/">Google</a>
 More complex: <a id='test' href='http://www.w3.org/TR/html4/'
 title="link to w3.org">HTML 4.01 Specification</a>
 And <a href="mailto:[email protected]">this</a> is an email.
 This is a test. <a href=http://www.google.com/>Google</a>
 More complex: <a id=test href=ftp://www.w3.org/TR/html4/
 title="link to w3.org">HTML 4.01 Specification</a>
 And <a href=mailto:[email protected] id=test>this</a> is an email.
 Here is <a href="/files/test.php">a local link</a> to a file.
 Even more complex: <a id='test' href='http://www.w3.org/TR/html4/' 
 title="link to w3.org">HTML <b>4.01</b> Specification</a>
 EOD;
 
 echo replace_links($test);
 [/code]

@bokehauthorSep 24.2005 — #Thanks for that. It seems to work flawlessly. Don't worry about the last thing I posted above as I will have a go my self and come back for help when I get stuck. Thanks again.

@bokehauthorSep 24.2005 — #Ok! I have these two REGEXes:

[code=php]'/<imgs[^>]*alt=['"]([^'"]+)['"][^>]*>/i'
 '/<as[^>]*>([^<]*)</a>/i'[/code]

The first finds all image tags and extracts the alt text, the second matches all links and extracts the link text. how do I modify them so that they don't find a match if those tags contain 'http:' ?

@NogDogSep 24.2005 — #I think a negative lookahead assertion (don't that sound fancy!) is needed:

<i>
 </i>'/&lt;as[^&gt;]*[color=red](?!http:)[/color]&gt;([^&lt;]*)&lt;/a&gt;/i'

See the "Assertions" section of http://www.php.net/manual/en/reference.pcre.pattern.syntax.php for more info. I've only played with them a couple times, so no guarantees.

@bokehauthorSep 24.2005 — #Hmm! I tried that and it still matches every single link. Very strange! Here is the code:

[code=html]<?php
 function delete_local_links($text)
 {
 $search = array("/<as[^>]*(?!http:)>(.+)</a>/i");
 $replace = array('1');
 return(preg_replace($search, $replace, $text));
 }
 
 
 ### TEST ###
 $test = <<<EOD
 This is a test. <a href="ftp://www.google.com/">Google</a>
 More complex: <a id='test' href='http://www.w3.org/TR/html4/'
 title="link to w3.org">HTML 4.01 Specification</a>
 And <a href="mailto:[email protected]">this</a> is an email.
 This is a test. <a href=http://www.google.com/>Google</a>
 More complex: <a id=test href=ftp://www.w3.org/TR/html4/
 title="link to w3.org">HTML 4.01 Specification</a>
 And <a href=mailto:[email protected] id=test>this</a> is an email.
 Here is <a href="/files/test.php">a local link</a> to a file.
 Even more complex: <a id='test' href='http://www.w3.org/TR/html4/'
 title="link to w3.org">HTML <b>4.01</b> Specification</a>
 EOD;
 echo delete_local_links($test);
 ?>[/code]

Also in #PHP _↴

Infinate dropdown lists & text field arrays [RESOLVED] Problem with href Online Store/shopping Cart!!!

Success!

Help @bokeh spread the word by sharing this article on Twitter...

Tweet This

REGEX advice! Converting html to plain text

24 Comments(s) _↴

Also in #PHP _↴

Success!

Social

Version

REGEX advice! Converting html to plain text

24 Comments(s) ↴

Also in #PHP ↴

Success!

The web is an endless sea of information. Don't miss the boat... Subscribe!

Social

Version

24 Comments(s) _↴

Also in #PHP _↴