/    Sign up×
Community /Pin to ProfileBookmark

Regex matching unicode values

I’m trying to write a script that will take a string and replace some of the more common special characters that are present when someone copies and pastes text, like curly quotes and M-dashes.

Please note, I am aware of htmlspecialchars(), and this will not work for me. I do not want all HTML special characters to be replaced, just select characters. Therefore, I need a way to match them specifically.

Observe this simple test. The unicode value for the curly left quote is 8220, and the PHP regex syntax for matching unicode values is supposed to be x{0000} with the /u modifier on the regex.

[code=php]
$str = “Here’s a quote: “My quote””;
echo (preg_match(‘/x{8220}/u’, $str)) ? ‘match’ : ‘no match’;[/code]

With this simple test, I get “no match.” Am I misunderstanding something about the way the literal character can be matched with the unicode value? (I know I could simply use /“/, but I’d prefer a method that doesn’t require special characters in the code itself.)

to post a comment
PHP

10 Comments(s)

Copy linkTweet thisAlerts:
@tarsusauthorMay 19.2009 — I just discovered that 8220 is the decimal code for left double quotes. What's needed is the hexidecimal, which should be 201C. I tried using that in the regex instead; still no match.
Copy linkTweet thisAlerts:
@MalgrimMay 19.2009 — [code=php]$str = "Here's a quote: “My quote”";
echo (preg_match('/x{201C}/u', $str)) ? 'match' : 'no match'; [/code]

works fine for me ...
Copy linkTweet thisAlerts:
@tarsusauthorMay 19.2009 — !? It does?

I have double checked; it still doesn't match for me. What version of PHP is your platform? Mine is 5.2.1.

Just to be absolutely clear: When you say it "worked," you do mean you got the text "match" instead of "no match," right?
Copy linkTweet thisAlerts:
@MalgrimMay 19.2009 — PHP Version 5.2.8

and yes, that's what I meant.

The problem is much more likely the page encoding (of the source code file), it should be UTF-8 encoded itself, or you might have to convert the String before matching.
Copy linkTweet thisAlerts:
@tarsusauthorMay 19.2009 — How do you do that?
Copy linkTweet thisAlerts:
@MindzaiMay 19.2009 — It should be a setting in your editors preferences, although most default to utf8 these days - which are you using? If its the string itself you can use utf8_encode().
Copy linkTweet thisAlerts:
@tarsusauthorMay 19.2009 — For now, the string I'm testing with is simply defined in the script, as you see it above. I got the string by typing it in Microsoft Word and copying/pasting it.

I changed the script to this:
[code=php]
$str = utf8_encode("Here's a quote: “My quote”");
echo (preg_match('/x{201C}/u', $str)) ? 'match' : 'no match';[/code]


And still get "no match."
Copy linkTweet thisAlerts:
@NogDogMay 19.2009 — For now, the string I'm testing with is simply defined in the script, as you see it above. I got the string by typing it in Microsoft Word and copying/pasting it....[/quote]
It may not be UTF-8 then; it might be in one of M$'s own character sets. I used [url=http://www.charles-reace.com/blog/2008/10/15/filtering-ms-word-text/]this solution[/url] on one web site I worked on where people were copying M$ Word resumes into a form's textarea.
Copy linkTweet thisAlerts:
@MindzaiMay 20.2009 — ^^ Very handy thanks.

I made a slight modification just to make it a bit easier to see what is replaced by what, hope you dont mind me posting it here (ill remove if so).

EDIT: well I did, but the forum seems to have messed it up by converting the replacement characters, but you get the idea!

[code=php]
<?php
function filterText($text)
{
$map = array (
'&' => '&amp;',
'<' => '&lt;',
'>' => '&gt;',
'"' => '&quot;',
chr(212) => '&#8216;',
chr(213) => '&#8217;',
chr(210) => '&#8220;',
chr(211) => '&#8221;',
chr(209) => '&#8211;',
chr(208) => '&#8212;',
chr(201) => '&#8230;',
chr(145) => '&#8216;',
chr(146) => '&#8217;',
chr(147) => '&#8220;',
chr(148) => '&#8221;',
chr(151) => '&#8211;',
chr(150) => '&#8212;',
chr(133) => '&#8230;'
);

return str_replace(array_keys($map), $map, $text);
}
?>
[/code]
Copy linkTweet thisAlerts:
@NogDogMay 20.2009 — ^^ Very handy thanks.

I made a slight modification just to make it a bit easier to see what is replaced by what, hope you dont mind me posting it here (ill remove if so).
[/QUOTE]


Any code I post on my blog is freeware, as far as I'm concerned -- that's why it's there. Besides, I...uh...re-used it from Chris Shiflett's blog, and I think that code was submitted in a reader's comment in [i]his[/i] blog. So, long story short, I feel no particular ownership for that code. ?
×

Success!

Help @tarsus spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.18,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...