Regex matching unicode values

@tarsusMay 19.2009

I’m trying to write a script that will take a string and replace some of the more common special characters that are present when someone copies and pastes text, like curly quotes and M-dashes.

Please note, I am aware of htmlspecialchars(), and this will not work for me. I do not want all HTML special characters to be replaced, just select characters. Therefore, I need a way to match them specifically.

Observe this simple test. The unicode value for the curly left quote is 8220, and the PHP regex syntax for matching unicode values is supposed to be x{0000} with the /u modifier on the regex.

[code=php] $str = “Here’s a quote: “My quote””; echo (preg_match(‘/x{8220}/u’, $str)) ? ‘match’ : ‘no match’;[/code]

With this simple test, I get “no match.” Am I misunderstanding something about the way the literal character can be matched with the unicode value? (I know I could simply use /“/, but I’d prefer a method that doesn’t require special characters in the code itself.)

to post a comment

PHP

10 Comments(s) _↴

@tarsusauthorMay 19.2009 — #I just discovered that 8220 is the decimal code for left double quotes. What's needed is the hexidecimal, which should be 201C. I tried using that in the regex instead; still no match.

@MalgrimMay 19.2009 — #

[code=php]$str = "Here's a quote: &#8220;My quote&#8221;";
 echo (preg_match('/x{201C}/u', $str)) ? 'match' : 'no match'; [/code]

works fine for me ...

@tarsusauthorMay 19.2009 — #!? It does?

I have double checked; it still doesn't match for me. What version of PHP is your platform? Mine is 5.2.1.

Just to be absolutely clear: When you say it "worked," you do mean you got the text "match" instead of "no match," right?

@MalgrimMay 19.2009 — #PHP Version 5.2.8

and yes, that's what I meant.

The problem is much more likely the page encoding (of the source code file), it should be UTF-8 encoded itself, or you might have to convert the String before matching.

@tarsusauthorMay 19.2009 — #How do you do that?

@MindzaiMay 19.2009 — #It should be a setting in your editors preferences, although most default to utf8 these days - which are you using? If its the string itself you can use utf8_encode().

@tarsusauthorMay 19.2009 — #For now, the string I'm testing with is simply defined in the script, as you see it above. I got the string by typing it in Microsoft Word and copying/pasting it.

I changed the script to this:

[code=php]
 $str = utf8_encode("Here's a quote: “My quote”");
 echo (preg_match('/x{201C}/u', $str)) ? 'match' : 'no match';[/code]

And still get "no match."

@NogDogMay 19.2009 — #For now, the string I'm testing with is simply defined in the script, as you see it above. I got the string by typing it in Microsoft Word and copying/pasting it....[/quote]
It may not be UTF-8 then; it might be in one of M$'s own character sets. I used [url=http://www.charles-reace.com/blog/2008/10/15/filtering-ms-word-text/]this solution[/url] on one web site I worked on where people were copying M$ Word resumes into a form's textarea.

@MindzaiMay 20.2009 — #^{^^{^}} Very handy thanks.

I made a slight modification just to make it a bit easier to see what is replaced by what, hope you dont mind me posting it here (ill remove if so).

EDIT: well I did, but the forum seems to have messed it up by converting the replacement characters, but you get the idea!

[code=php]
 <?php
 function filterText($text)
 {
 $map = array (
 '&' => '&amp;',
 '<' => '&lt;',
 '>' => '&gt;',
 '"' => '&quot;',
 chr(212) => '&#8216;',
 chr(213) => '&#8217;',
 chr(210) => '&#8220;',
 chr(211) => '&#8221;',
 chr(209) => '&#8211;',
 chr(208) => '&#8212;',
 chr(201) => '&#8230;',
 chr(145) => '&#8216;',
 chr(146) => '&#8217;',
 chr(147) => '&#8220;',
 chr(148) => '&#8221;',
 chr(151) => '&#8211;',
 chr(150) => '&#8212;',
 chr(133) => '&#8230;'
 );
 
 return str_replace(array_keys($map), $map, $text);
 }
 ?>
 [/code]

@NogDogMay 20.2009 — #^{^^{^}} Very handy thanks.

I made a slight modification just to make it a bit easier to see what is replaced by what, hope you dont mind me posting it here (ill remove if so).
[/QUOTE]

Any code I post on my blog is freeware, as far as I'm concerned -- that's why it's there. Besides, I...uh...re-used it from Chris Shiflett's blog, and I think that code was submitted in a reader's comment in [i]his[/i] blog. So, long story short, I feel no particular ownership for that code. ?

Also in #PHP _↴

htaccess redirecting Buttons with dynamic content / mouseover/selected effect / interpretation sytax error help

Success!

Help @tarsus spread the word by sharing this article on Twitter...

Tweet This

about: ({
version: 0.1.9 — BETA 5.18,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});

changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...

recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...

Regex matching unicode values

10 Comments(s) ↴

Also in #PHP ↴

Success!

The web is an endless sea of information. Don't miss the boat... Subscribe!

Social

Version

10 Comments(s) _↴

Also in #PHP _↴