Follow me on Twitter for my latest adventures!
I thought I'd share my favorite regex of all time:
[ -~]
Any ideas what this regexp matches?
It matches all ASCII characters from the space to the tilde. What are those characters? They're all printable characters!
Take a look at the ASCII table. The printable characters start at the space and end at the tilde:

[ -~] matches all printable ascii characters (light blue background)
I love this.
Update:
Would you like to wear this regular expression? I just had an idea to make it a t-shirt as this blog post is getting so popular. Check out this t-shirt that I made:
This shirt is available through Teespring. Teespring is like Kickstarter for t-shirts. If 50 people buy the shirt, everyone gets their shirt. Otherwise no one gets charged and no one gets the shirt.



Facebook
Plurk
more
GitHub
LinkedIn
FriendFeed
Google Plus
Amazon wish list
Comments
I'm curious: For what do you use it? This is a good regex, I can agree, though, I'm not sure, where it could be properly used.
P.S. Only latin, also:
$ echo München | sed 's/[^ -~]/\ /g'
M nchen
$ echo Привет, мир | sed 's/[^ -~]/\ /g'
,
ascii only!
ascii only means it's practically useless for, you know, other 6,599,505,516 peoples who lives on earth.
Strange, I grepped his whole blog post, and couldn't get a match for "useful for everyone", "unicode compliant", "recommended", "best practice", "i18n", "Good idea", or "Drop this bad boy into production TODAY".
But maybe I'm using the wrong character classes.
/slow_clap
>It matches all ASCII characters from the space to the tilde. What are those characters? They're all printable characters!
你他妈的 is printable and belive or not even Texans need it... If it can't match unicode chars, it's useless for everybody...
He said didn't say "all THE printable characters"
It's not even every ASCII printable character. What about áãàéíóúùèõñ etc?
It's practically useless.
The ASCII character set ranges from 0 to 127 only. You are talking about the Extended ASCII Character set, which is a different beastie.
just in my country is useful...
Like those 6,599,505,516 people do not use base64 and other tons of encodings that work on top of ascii.
Yeah, Base64 solves everything. You can just write a regex to match all printable Unicode characters in UTF-8 with Base64 encoding.
You made off by one error. i just had a daughter born. It are 6,599,505,517 peoples.
Put down your pitchfork. This post is not a political statement about string encoding.
There is a lot of data out there with encoding issues, and a lot of it is very thorny to work through, and the worst case is when you don't even know what corruption or encoding some given data has. In these cases it is at the very least, useful to be able to know whether or not your data contains non-ASCII characters, and that use IS applicable to everyone.
It has plenty of uses, especially when processing or migrating legacy data which may have formatting codes embedded into it for brevity. I have used this regex, and variations, many times over the years for doing just that.
I've used the inverse of this (with \n) to search for accidentally inserted characters like tabs and non-breaking spaces in text files. I've also tried to edit some files to remove characters other than [\n\040-\176]. And I have some Japanese word lists that can be parsed with regex because some fields only contain ASCII and some non-ASCII characters.
[[:print:]] matches あ by default in Ruby 1.9 but not 1.8.
Unicode sucks anyway.
Everyone who uses the internet should learn and use English, else GTFO.
ETL data cleaning from bad sources... a must in production database environments.
Well, it matches every *ASCII* printable character.
Not as good as it seems, but looks good though ;)
It's not even every ASCII printable character. What about áãàéíóúùèõñ etc?
áãàéíóúùèõñ are not ASCII. ASCII is a 7 bit code (as shown in the table above).
You confused ASCII with a codepage like ISO-8859. ASCII consists of exactly the 128 codes shown in the table above.
Haters gonna hate...*sigh*
Cute quiz, however for an actual piece of code in a live program I'd much rather see the self-describing /[[:print:]]/
+1 for readability. Code is meant to be human readable. Thus something that's obvious is better than the cryptic.
That being said, anyone who's has to do regexp will occasionally create some gnarly patterns.
Also, not all regex parsers support [[:print]] -- but all of them (that I know of) support character ranges.
isn't it just the same as /./ ?
if not, whats the difference ?
This will capture ALL characters, printable or not.
Cute, except that it matches spaces spaces but not tabs. Behaviour with alternate code pages may be unpredictable. This kind of thing is good if you know exactly what you are doing and why it works in that context. In some other context it's liable not to do what you think.
A tab is a control character, not a printable character.
Why would you even need a blog for this!
Sweet. My favourite is from Mail::RFC822::Address cf: http://www.ex-parrot.com/pdw/Mail-RFC822-Address/Mail-RFC822-Address.html
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
\t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
\t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)
Where's the T-shirt?
It only comes in XXXXL
Sorry this regex is too big to put into a t-shirt. ;)
That's the joke!
Is that Windows 8 written in perl?
Where could I use this in a real life example? Why would I want to match all ASCII characters?
…although others said so before: Why would you want to limit yourself to ASCII?
Notwithstanding that limitation: Clever! May even be useful, *if* your input data should be limited to ASCII characters (usernames & passwords come to mind, although umlauts etc. in a password will make it harder to brute force)
It isn't so much as limiting it to ASCII but why would you want to match ASCII characters at all. What purpose does it serve to limit and only match ASCII characters?
How about if you're filtering data to pass to a legacy system that only understands ASCII?
archving 7-bit ASCII documents in the Planetary Data System
http://pds.jpl.nasa.gov/
e.g.
tr -d '\n\r -z' < blah.txt | wc -c
should return zero. I've used this 'trick' many times.
How would I use this, then? Drop all invalid characters? Replace them all with a fixed value? Check for bad strings and abort? Besides being kind of hard to read, this is pretty limiting in what your options are when you encounter such a character.
I'd use something like #map which is easier to understand, and more flexible in what it can do when it encounters a non-ASCII character.
clever!
man re_format :print: works better
Please stop discriminating us with names including ÅÄÖ. Or even 그레.
Yay, it's my favorite too!
D:
I so angly, lound-eye
(╯°□°)╯︵ ┻━┻
Thanks! I'll be putting this puppy to work tomorrow!
Just curious, but in what context are you intending to use this?
A tshirt with [^ -~] would be funnier.
and the caption could be "I wan't to believe" or even "for all the important things we don't see" :)
Or "In control"
I prefer "The regex that will change the world"
s/world/word/
I'd love a t-shirt, but would you consider adding a quantifyer? i.e.:
[ -~]+
At last, a perfect solution for validating text sent to my Daisy Wheel printer... although I think I scrapped it 30 years ago...
What's with people?
The title of this post is "My favorite regex of all time". It's one man's opinion. It's subjective. The man has a favorite regex, he's not advocating its use.
He even states, "[ -~] matches all printable ascii characters (light blue background)" and he's correct. Why do people feel the need to point out that it doesn't work for some situation when he never said it would.
I work with a lot of legacy print streams. I can think of several places that I might be able to use this. Thanks for sharing.
Thanks Jack. My thoughts exactly. Finally someone that doesn't have a huge stick up their ass.
Well, this is also an example of why I detest regex's - seeing this in code wouldn't exactly clearly document the intent of the developer.
There are three kinds of programmers out there: Those who understand regexes, those who don't, and those who pretend that they understand regexes.
I have yet to personally meet programmers falling in the first two groups!
If 50 people buy that shirt, not only do they get the shirt, they get a free "you're a tool" look from everyone they explain it to.
Nice regex, good to parse some slug urls easily
Another great article. Thanks for sharing your insight! Cheers
Cool! And checkout all the a-holes with their heads up their butts!
Hey, I need to find all the other strings which are not email id and need to remove them from a text file.
Which means i need to find the all the words which do not contain '@'.
is there any regex to do so ?
Use sed:
This will print all lines that don't contain
@.Had cause to write the regex
the other day in an attempt to match all names starting with exactly one underscore.
It's not right for that purpose, though, and I had to correct it to something less pleasing using negative lookahead (because it doesn't match a name consisting of just a single underscore which is a name starting with exactly one underscore).
(PS, comment system seems a bit borked if you have a space as the third “letter” in your name.)
[^ -~]+|& # x 2 6 6 5 ;
(no spaces after &)
Leave a new comment