/    Sign up×
Community /Pin to ProfileBookmark

Non-English RegExp for removing non-alphanumeric

I have used this script:

[CODE]someString.replace(/[^A-Za-z0-9 .]/g, ”)[/CODE]

…many times to remove non-alphanumeric and non “.” and ” ” characters but am having to re-think its use as I start working on non-American English languages for string replacement. The reason for this is that this RegExp also pulls out special characters such as “ó” and “ñ”. I’m not certain, but I think it would also remove all double-byte characters such as various Asian-language words.

Has anyone run into this problem and have they found a simple coding solution to catch all non-English special characters?

Yours,
Dave

to post a comment
JavaScript

7 Comments(s)

Copy linkTweet thisAlerts:
@KorFeb 24.2011 — Yes, it is possible, but you should specify the special characters you want to allow. And use the suitable utf charcode, probably utf-8 in your case.
<i>
</i>&lt;!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"&gt;
&lt;html&gt;
&lt;head&gt;
&lt;title&gt;Untitled Document&lt;/title&gt;
&lt;meta http-equiv="Content-Type" content="text/html; charset=[COLOR="Blue"]utf-8[/COLOR]"&gt;
&lt;meta http-equiv="Content-Style-Type" content="text/css"&gt;
&lt;meta http-equiv="Content-Script-Type" content="text/javascript"&gt;
&lt;script type="text/JavaScript"&gt;
function valid(f) {
f.value=f.value.replace(/[^A-z[COLOR="Blue"]&amp;#199;&amp;#209;Q&amp;#192;&amp;#193;&amp;#200;&amp;#201;&amp;#205;&amp;#204;&amp;#207;&amp;#211;&amp;#210;&amp;#218;&amp;#217;&amp;#220;[/COLOR]]/ig,'');
}
&lt;/script&gt;
&lt;/head&gt;
&lt;body&gt;&lt;br&gt;
&lt;form id="myform" action=""&gt;
&lt;input name="mytext" type="text" onkeyup="valid(this)" onblur="valid(this)"&gt;
&lt;/form&gt;
&lt;/body&gt;
&lt;/html&gt;

Copy linkTweet thisAlerts:
@Sylvan012authorFeb 24.2011 — So what you're suggesting is build a library of uppercase and lowercase special characters and using that for the RegExp?

Makes sense...

Is there a single or simpler command, however, that would include double-byte or all non-English-but-still-alphanumeric characters that you know of?

Yours,

Dave
Copy linkTweet thisAlerts:
@KorFeb 24.2011 — 
Is there a single or simpler command, however, that would include double-byte or all non-English-but-still-alphanumeric characters that you know of?
[/QUOTE]

Well, somehow yes, but the RegExp range is not related with the single or double byte characters, it is related with the ASCII range of them. A range like [a-z] covers the ASCII from 97(a) to 122(z). If your special characters you want to allow are in a [I]continuous[/I] range, you may use only the first and the last term of the range, as in extended ASCII.
Copy linkTweet thisAlerts:
@Sylvan012authorFeb 24.2011 — Well, somehow yes, but the RegExp range is not related with the single or double byte characters, it is related with the ASCII range of them. A range like [a-z] covers the ASCII from 97(a) to 122(z). If your special characters you want to allow are in a [I]continuous[/I] range, you may use only the first and the last term of the range.[/QUOTE]

Ooohhh... Ok, that's an excellent observation I'd not thought of.

Off-hand, do you know of a good online resource that could provide an incremental list of all such characters?

I'll search for my own, certainly, but if you've got experience with a good resource like that, I thought I'd ask. ?

Yours,

Dave
Copy linkTweet thisAlerts:
@KorFeb 24.2011 — http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm

For instance [&#199;-&#209;] should cover all the characters from ASCII extended 128 = &#199; to 165 =&#209;.
Copy linkTweet thisAlerts:
@Sylvan012authorFeb 24.2011 — http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm

For instance [Ç-Ñ] should cover all the characters from ASCII extended 128 = Ç to 165 =Ñ.[/QUOTE]


Whoa... Ok, that's pretty useful and awesome! Thank you!
Copy linkTweet thisAlerts:
@KorFeb 24.2011 — If you are interested in the matter, here's some additional information about handling special characters, ASCII codes, foreign alphabets. Unicode and Regular Expressions.

You said something about Asian characters and double-byte characters. In fact there are several way of encoding (encoding and characters set are different animals): 1, 2, 3, or 4 bytes per code point (or combination). Code points are a sort of a mapping between characters and numbers. There are "mysterious" name for those Unicode variants (UTF-36 covers 4 bytes per code point, while UTF-8 covers 1,2,3, and 4 bytes per code point ? )

Take care, also about the difference between bit and byte ?

When in comes about Regular Expressions, a range of characters can be created not only upon their ASCII decimal value, but also on their Unicode or Hexa values. A detailed explanation/standard:

http://unicode.org/reports/tr18/

And an additional note about Regular Expressions in JavaScript. Usually people think that RegExp notation is universal. Well, it is more or less a truth. There are some small differences (in fact few incomplete implementations) from a language to another. In JavaScript, the RegExp are similar with those in Pearl.
×

Success!

Help @Sylvan012 spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 6.17,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @nearjob,
tipped: article
amount: 1000 SATS,

tipper: @meenaratha,
tipped: article
amount: 1000 SATS,

tipper: @meenaratha,
tipped: article
amount: 1000 SATS,
)...