Follow me on Twitter for my latest adventures!
Famous Perl One-Liners Explained, Part VII: Handy Regular Expressions

This is the seventh part of a nine-part article on famous Perl one-liners. Perl is not Perl without regular expressions, therefore in this part I will come up with and explain various Perl regular expressions. Please see part one for the introduction of the series.
Famous Perl one-liners is my attempt to create "perl1line.txt" that is similar to "awk1line.txt" and "sed1line.txt" that have been so popular among Awk and Sed programmers, and Unix sysadmins. I will release the perl1line.txt in the next part of the series.
The article on famous Perl one-liners consists of nine parts:
- Part I: File spacing.
- Part II: Line numbering.
- Part III: Calculations.
- Part IV: String creation and array creation.
- Part V: Text conversion and substitution.
- Part VI: Selective printing and deleting of certain lines.
- Part VII: Handy regular expressions (this part).
- Part VIII: Release of perl1line.txt.
- Part IX: Release of Perl One-Liners e-book.
After I am done with the next part of the article, I will release the whole article series as a pdf e-book! Please subscribe to my blog to be the first to get it. You can also follow me on Twitter.
Awesome news: I have written an e-book based on this article series. Check it out:
And here are today's one-liners:
109. Match something that looks like an IP address.
/^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/
This regex doesn't guarantee that the thing that got matched is in fact a valid IP. All it does is match something that looks like an IP. It matches a number followed by a dot four times. For example, it matches a valid IP 81.198.240.140 and it also matches an invalid IP such as 923.844.1.999.
Here is how it works. The ^ at the beginning of regex is an anchor that matches the beginning of string. Next \d{1,3} matches one, two or three consecutive digits. The \. matches a dot. The $ at the end is an anchor that matches the end of the string. It's important to use both ^ and $ anchors, otherwise strings like foo213.3.1.2bar would also match.
This regex can be simplified by grouping the first three repeated \d{1,3}\. expressions:
/^(\d{1,3}\.){3}\d{1,3}$/
110. Test if a number is in range 0-255.
/^([0-9]|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$/
Here is how it works. A number can either be one digit, two digit or three digit. If it's a one digit number then we allow it to be anything [0-9]. If it's two digit, we also allow it to be any combination of [0-9][0-9]. However if it's a three digit number, it has to be either one hundred-something or two-hundred something. If it'e one hundred-something, then 1[0-9][0-9] matches it. If it's two hundred-something then it's either something up to 249, which is matched by 2[0-4][0-9] or it's 250-255, which is matched by 25[0-5].
111. Match an IP address.
my $ip_part = qr|([0-9]|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])|;
if ($ip =~ /^($ip_part\.){3}$ip_part$/) {
say "valid ip";
}
This regexp combines the previous two. It uses the my $ip_part = qr/.../ operator compiles the regular expression and puts it in $ip_part variable. Then the $ip_part is used to match the four parts of the IP address.
112. Check if the string looks like an email address.
/.+@.+\..+/
This regex makes sure that the string looks like an email address. Notice that I say "looks like". It doesn't guarantee it is an email address. Here is how it works - first it matches something up to the @ symbol, then it matches as much as possible until it finds a dot, and then it matches some more. If this succeeds, then it it's something that at least looks like email address with the @ symbol and a dot in it.
For example, cats@catonmat.net matches but cats@catonmat doesn't because the regex can't match the dot \. that is necessary.
Much more robust way to check if a string is a valid email would be to use Email::Valid module:
use Email::Valid;
print (Email::Valid->address('john@example.com') ? 'valid email' : 'invalid email');
113. Check if the string is a decimal number.
Checking if the string is a number is really difficult. I based my regex and explanation on the one in Perl Cookbook.
Perl offers \d that matches digits 0-9. So we can start with:
/^\d+$/
This regex matches one or more digits \d starting at the beginning of the string ^ and ending at the end of the string $. However this doesn't match numbers such as +3 and -3. Let's modify the regex to match them:
/^[+-]?\d+$/
Here the [+-]? means match an optional plus or a minus before the digits. This now matches +3 and -3 but it doesn't match -0.3. Let's add that:
/^[+-]?\d+\.?\d*$/
Now we have expanded the previous regex by adding \.?\d*, which matches an optional dot followed by zero or more numbers. Now we're in business and this regex also matches numbers like -0.3 and 0.3.
Much better way to match a decimal number is to use Regexp::Common module that offers various useful regexes. For example, to match an integer you can use $RE{num}{int} from Regexp::Common.
How about positive hexadecimal numbers? Here is how:
/^0x[0-9a-f]+$/i
This matches the hex prefix 0x followed by hex number itself. The /i flag at the end makes sure that the match is case insensitive. For example, 0x5af matches, 0X5Fa matches but 97 doesn't, cause it's just a decimal number.
It's better to use $RE{num}{hex} because it supports negative numbers, decimal places and number grouping.
Now how about octal? Here is how:
/^0[0-7]+$/
Octal numbers are prefixed by 0, which is followed by octal digits 0-7. For example, 013 matches but 09 doesn't, cause it's not a valid octal number.
It's better to use $RE{num}{oct} because of the same reasons as above.
Finally binary:
/^[01]+$/
Binary base consists of just 0s and 1s. For example, 010101 matches but 210101 doesn't, because 2 is not a valid binary digit.
It's better to use $RE{num}{bin} because of the same reasons as above.
114. Check if a word appears twice in the string.
/(word).*\1/
This regex matches word followed by something or nothing at all, followed by the same word. Here the (word) captures the word in group 1 and \1 refers to contents of group 1, therefore it's almost the same as writing /(word).*word/
For example, silly things are silly matches /(silly).*\1/, but silly things are boring doesn't, because silly is not repeated in the string.
115. Increase all numbers by one in the string.
$str =~ s/(\d+)/$1+1/ge
Here we use the substitution operator s///. It matches all integers (\d+), puts them in capture group 1, then it replaces them with their value incremented by one $1+1. The g flag makes sure it finds all the numbers in the string, and the e flag evaluates $1+1 as a Perl expression.
For example, this 1234 is awesome 444 gets turned into this 1235 is awesome 445.
116. Extract HTTP User-Agent string from the HTTP headers.
/^User-Agent: (.+)$/
HTTP headers are formatted as Key: Value pairs. It's very easy to parse such strings, you just instruct the regex engine to save the Value part in $1 group variable.
For example, if the HTTP headers contain,
Host: localhost:8000 Connection: keep-alive User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US) Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Encoding: gzip,deflate,sdch Accept-Language: en-US,en;q=0.8 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Then the regular expression will extract the Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US) string.
117. Match printable ASCII characters.
/[ -~]/
This is really tricky and smart. To understand it, take a look at man ascii. You'll see that space starts at value 0x20 and the ~ character is 0x7e. All the characters between a space and ~ are printable. This regular expression matches exactly that. The [ -~] defines a range of characters from space till ~. This is my favorite regexp of all time.
You can invert the match by placing ^ as the first character in the group:
/[^ -~]/
This matches the opposite of [ -~].
118. Match text between two HTML tags.
m|<strong>([^<]*)</strong>|
This regex matches everything between <strong>...</strong> HTML tags. The trick here is the ([^<]*), which matches as much as possible until it finds a < character, which starts the next tag.
Alternatively you can write:
m|<strong>(.*?)</strong>|
But this is a little different. For example, if the HTML is <strong><em>hello</em></strong> then the first regex doesn't match anything because the < follows <strong> and ([^<]*) matches as little as possible. The second regex matches <em>hello</em> because the (.*?)</strong> matches as little as possible until it finds </strong>, which happens to be <em>hello</em>.
However don't use regular expressions for matching and parsing HTML. Use modules like HTML::TreeBuilder to accomplish the task cleaner.
119. Replace all <b> tags with <strong>
$html =~ s|<(/)?b>|<$1strong>|g
Here I assume that the HTML is in variable $html. Next the <(/)?b> matches the opening and closing <b> tags, captures the optional closing tag slash in group $1 and then replaces the matched tag with either <strong> or </strong>, depending on if it was an opening or closing tag.
120. Extract all matches from a regular expression.
my @matches = $text =~ /regex/g;
Here the regular expression gets evaluated in the list context that makes it return all the matches. The matches get put in the @matches variable.
For example, the following regex extracts all numbers from a string:
my $t = "10 hello 25 moo 31 foo"; my @nums = $text =~ /\d+/g;
@nums now contains (10, 25, 30).
Perl one-liners explained e-book
I've now written the "Perl One-Liners Explained" e-book based on this article series. I went through all the one-liners, improved explanations, fixed mistakes and typos, added a bunch of new one-liners, added an introduction to Perl one-liners and a new chapter on Perl's special variables. Please take a look:
Have Fun!
Thanks for reading the article! In the next part I am releasing the perl1line.txt that will contain all the one-liners in a single file.



Facebook
Plurk
more
GitHub
LinkedIn
FriendFeed
Google Plus
Amazon wish list
Comments
I created a oneliner that checks for point 119 how many numbers it matches between 0 and 255. This number should be 256, but its not.
perl -E 'for (0..255) { $count+=m/^(([0-9])|([0-9][0-9])|([12][0-5][0-5]))$/}END{say$count}'I was confident it was right. I never checked it, I only trusted my reasoning. Now it will be interesting to find where I went wrong.
Edit: Aha, 106 - 109, 116 - 119, ..., 156 - 199 don't get matched, and 206-209, 216-219, 226-229, 236-239, 246-249.
Here is the fix:
Matches 0-9 or 10-99 or 100-199 or 200-249 or 250-255.
Here is the original, broken one that I had:
Nicely done, but #119 is in error as of this writing. It matches a single digit, or two digits, or a one or two followed by two digits under six. That means it won't match its own one-liner number! Or any other useful numbers like 192 and 168 and 127, which come up a lot in IP addresses.
I believe a correct regexp would be:
/^1?\d{1,2}|2[0-4]\d|25[0-5]$/
That matches:
- any one or two digits, optionally starting with "1", so that's 0-9, 00-99, and also 100-199 and (redundantly but harmlessly) 10-19 again
- 200 to 249
- 250 to 255
I haven't run a test, but I'm reasonably confident that it's right.
Yours is correct:
$ perl -E 'for (0..255) { $count+=m/^1?\d{1,2}|2[0-4]\d|25[0-5]$/}END{say$count}' 256109 is just wrong. The assumption that the last two digits are below 5 for 3-digit numbers is wrong.
It doesn't match: 106, 107, ..., 160, 161, ... , 199, ..., 206, 207, ... 216, ..., 249
Right! That's where it went wrong.
Adding to my previous comment: if you want to allow redundant zeroes, which your regexp does partially, change mine to
/^[01]?\d{1,2}|2[0-4]\d|25[0-5]$/
That's: 0-9, 00-99, 00-09, 000-099, 10-19, 100-199 for the first alternative, then 200-249 for the second and 250-255 for the third as before. It's redundant, but that doesn't matter.
Easier would be combining regexps with a little binary/boolean algebra, but that's straying outside the purity of the topic...
/^\d+$/ is faulty. Firstly, /$/ doesn't just match end-of-string, it will also match at a newline followed by end-of-string. So the regexp matches "1\n" as well as "1". Many of your regexps suffer this flaw.
Secondly, /\d/ is not a synonym for /[0-9]/. It also matches many Unicode digit-like characters. So your regexp matches "\x{666}", which probably doesn't look sufficiently like a number for whatever you were planning to do with it. Many of your regexps have this flaw too.
So /^\d+$/ should be /\A[0-9]+\z/, and many of the other regexps should be amended similarly.
Thanks for commenting on the issues with
/^\d+$/.Nice. Good little reference. Just wanted to mention that there is a missing '+' from 110's hex example.
Fixed!
#112 attempts to match an email _address_ not an email :-)
And the real regex to match an email address (as defined in rfc822) is far more complex. You can see it at the bottom of the source code for Email::Valid.
https://metacpan.org/source/RJBS/Email-Valid-0.185/lib/Email/Valid.pm
#113 checks that the string is a positive integer - not any number. It doesn't allow the use of a minus sign or a decimal point.
You're probably better using the number regexes in Regexp::Common or the fucntion looks_like_number in Scalar::Util.
#115 doesn't increment numbers, it increments integers. And it has a strange effect on floating point numbers. For example, 3.14 is changed to 4.15.
$ perl -E'$_=3.14; s/(\d+)/$1+1/ge; say'
4.15
Thanks for the insightful comments Dave!
Why [0-9][0-9] and not [1-9][0-9]?
Is 09 considered valid?
Depends on what you do.
#112 is VERY bad. If you ever have to validate an email address, you might want to speed a few minutes and find a regex that does it right. It is a complex problem that does deserve some time. Do NOT roll your own solution just because you [think you] can.
In this case, enforcing the dot in the domain part is a bad idea not only because it violates the standard (that would be RFC 822) because there are legitimate use cases. for example:
* special domains (like "localhost" for example (there *are* others). Administrative mail on linux gets delivered to root@localhost by default.)
* gtlds
* intranet-only applications where clients have a dns search domain
I know of one company that has two dozen (no kidding) different functions their flagship product that are supposed to validate email addresses. Most of them are non-trivial and definitely took some time to implement. Not one of them is correct.
(Same commenter as previous comment.)
#109 is bad too because other notations for IPs are valid.
Here is one example:
Oh and *then* there is IPv6.
Dear Peter Krumin, I know its boring but please take a look at the specs from time to time, especially if you want to publish something like this. This I-don't-care-if-I-violate-the-standard attitude that we see so often these days (especially but not exclusively in corporate environments) really bugs me because in the end we all suffer from it. Right?
I think this chapter on Regular Expressions need to be expanded or split into several parts, because there are so many other useful regex examples.
IMHO, point 110 better written /^(((2[0-4])?|1?[0-9]?)[0-9]|25[0-5])$/
perl -e 'print ~~grep { /^(((2[0-4])?|1?[0-9]?)[0-9]|25[0-5])$/ } (0 .. 300)'
256
Very interesting article suite btw :)
`my $t` should be `my $text`
Regarding 109: in general this is nasty approach to match something between start and end of the string as you may run in funny problems with it. I'd recommend to use \b which matches word boundary around (please look into perl regexp specs for it).
Regarding 112: Although criticized a lot there is one significant problem - you match .+ before @ which means "everything before @". Please note "everything" may include spaces and other things you do not want to see.
117: Important to note you are matching a single character this way, not characters. :)
Did not check others in details but well - good idea of collecting this kind of stuff and good luck!
Jaanis Kruumins
Leave a new comment