This article is part of the article series "Perl One-Liners Explained."
<- previous article next article ->
Perl One Liners

This is the seventh part of a nine-part article on famous Perl one-liners. Perl is not Perl without regular expressions, therefore in this part I will come up with and explain various Perl regular expressions. Please see part one for the introduction of the series.

Famous Perl one-liners is my attempt to create "perl1line.txt" that is similar to "awk1line.txt" and "sed1line.txt" that have been so popular among Awk and Sed programmers, and Unix sysadmins. I will release the perl1line.txt in the next part of the series.

The article on famous Perl one-liners consists of nine parts:

After I am done with the next part of the article, I will release the whole article series as a pdf e-book! Please subscribe to my blog to be the first to get it. You can also follow me on Twitter.

Awesome news: I have written an e-book based on this article series. Check it out:

And here are today's one-liners:

109. Match something that looks like an IP address.

/^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/

This regex doesn't guarantee that the thing that got matched is in fact a valid IP. All it does is match something that looks like an IP. It matches a number followed by a dot four times. For example, it matches a valid IP 81.198.240.140 and it also matches an invalid IP such as 923.844.1.999.

Here is how it works. The ^ at the beginning of regex is an anchor that matches the beginning of string. Next \d{1,3} matches one, two or three consecutive digits. The \. matches a dot. The $ at the end is an anchor that matches the end of the string. It's important to use both ^ and $ anchors, otherwise strings like foo213.3.1.2bar would also match.

This regex can be simplified by grouping the first three repeated \d{1,3}\. expressions:

/^(\d{1,3}\.){3}\d{1,3}$/

110. Test if a number is in range 0-255.

/^([0-9]|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$/

Here is how it works. A number can either be one digit, two digit or three digit. If it's a one digit number then we allow it to be anything [0-9]. If it's two digit, we also allow it to be any combination of [0-9][0-9]. However if it's a three digit number, it has to be either one hundred-something or two-hundred something. If it'e one hundred-something, then 1[0-9][0-9] matches it. If it's two hundred-something then it's either something up to 249, which is matched by 2[0-4][0-9] or it's 250-255, which is matched by 25[0-5].

111. Match an IP address.

my $ip_part = qr|([0-9]|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])|;
if ($ip =~ /^($ip_part\.){3}$ip_part$/) {
 say "valid ip";
}

This regexp combines the previous two. It uses the my $ip_part = qr/.../ operator compiles the regular expression and puts it in $ip_part variable. Then the $ip_part is used to match the four parts of the IP address.

112. Check if the string looks like an email address.

/.+@.+\..+/

This regex makes sure that the string looks like an email address. Notice that I say "looks like". It doesn't guarantee it is an email address. Here is how it works - first it matches something up to the @ symbol, then it matches as much as possible until it finds a dot, and then it matches some more. If this succeeds, then it it's something that at least looks like email address with the @ symbol and a dot in it.

For example, cats@catonmat.net matches but cats@catonmat doesn't because the regex can't match the dot \. that is necessary.

Much more robust way to check if a string is a valid email would be to use Email::Valid module:

use Email::Valid;
print (Email::Valid->address('john@example.com') ? 'valid email' : 'invalid email');

113. Check if the string is a decimal number.

Checking if the string is a number is really difficult. I based my regex and explanation on the one in Perl Cookbook.

Perl offers \d that matches digits 0-9. So we can start with:

/^\d+$/

This regex matches one or more digits \d starting at the beginning of the string ^ and ending at the end of the string $. However this doesn't match numbers such as +3 and -3. Let's modify the regex to match them:

/^[+-]?\d+$/

Here the [+-]? means match an optional plus or a minus before the digits. This now matches +3 and -3 but it doesn't match -0.3. Let's add that:

/^[+-]?\d+\.?\d*$/

Now we have expanded the previous regex by adding \.?\d*, which matches an optional dot followed by zero or more numbers. Now we're in business and this regex also matches numbers like -0.3 and 0.3.

Much better way to match a decimal number is to use Regexp::Common module that offers various useful regexes. For example, to match an integer you can use $RE{num}{int} from Regexp::Common.

How about positive hexadecimal numbers? Here is how:

/^0x[0-9a-f]+$/i

This matches the hex prefix 0x followed by hex number itself. The /i flag at the end makes sure that the match is case insensitive. For example, 0x5af matches, 0X5Fa matches but 97 doesn't, cause it's just a decimal number.

It's better to use $RE{num}{hex} because it supports negative numbers, decimal places and number grouping.

Now how about octal? Here is how:

/^0[0-7]+$/

Octal numbers are prefixed by 0, which is followed by octal digits 0-7. For example, 013 matches but 09 doesn't, cause it's not a valid octal number.

It's better to use $RE{num}{oct} because of the same reasons as above.

Finally binary:

/^[01]+$/

Binary base consists of just 0s and 1s. For example, 010101 matches but 210101 doesn't, because 2 is not a valid binary digit.

It's better to use $RE{num}{bin} because of the same reasons as above.

114. Check if a word appears twice in the string.

/(word).*\1/

This regex matches word followed by something or nothing at all, followed by the same word. Here the (word) captures the word in group 1 and \1 refers to contents of group 1, therefore it's almost the same as writing /(word).*word/

For example, silly things are silly matches /(silly).*\1/, but silly things are boring doesn't, because silly is not repeated in the string.

115. Increase all numbers by one in the string.

$str =~ s/(\d+)/$1+1/ge

Here we use the substitution operator s///. It matches all integers (\d+), puts them in capture group 1, then it replaces them with their value incremented by one $1+1. The g flag makes sure it finds all the numbers in the string, and the e flag evaluates $1+1 as a Perl expression.

For example, this 1234 is awesome 444 gets turned into this 1235 is awesome 445.

116. Extract HTTP User-Agent string from the HTTP headers.

/^User-Agent: (.+)$/

HTTP headers are formatted as Key: Value pairs. It's very easy to parse such strings, you just instruct the regex engine to save the Value part in $1 group variable.

For example, if the HTTP headers contain,

Host: localhost:8000
Connection: keep-alive
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US)
Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3

Then the regular expression will extract the Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US) string.

117. Match printable ASCII characters.

/[ -~]/

This is really tricky and smart. To understand it, take a look at man ascii. You'll see that space starts at value 0x20 and the ~ character is 0x7e. All the characters between a space and ~ are printable. This regular expression matches exactly that. The [ -~] defines a range of characters from space till ~. This is my favorite regexp of all time.

You can invert the match by placing ^ as the first character in the group:

/[^ -~]/

This matches the opposite of [ -~].

118. Match text between two HTML tags.

m|<strong>([^<]*)</strong>|

This regex matches everything between <strong>...</strong> HTML tags. The trick here is the ([^<]*), which matches as much as possible until it finds a < character, which starts the next tag.

Alternatively you can write:

m|<strong>(.*?)</strong>|

But this is a little different. For example, if the HTML is <strong><em>hello</em></strong> then the first regex doesn't match anything because the < follows <strong> and ([^<]*) matches as little as possible. The second regex matches <em>hello</em> because the (.*?)</strong> matches as little as possible until it finds </strong>, which happens to be <em>hello</em>.

However don't use regular expressions for matching and parsing HTML. Use modules like HTML::TreeBuilder to accomplish the task cleaner.

119. Replace all <b> tags with <strong>

$html =~ s|<(/)?b>|<$1strong>|g

Here I assume that the HTML is in variable $html. Next the <(/)?b> matches the opening and closing <b> tags, captures the optional closing tag slash in group $1 and then replaces the matched tag with either <strong> or </strong>, depending on if it was an opening or closing tag.

120. Extract all matches from a regular expression.

my @matches = $text =~ /regex/g;

Here the regular expression gets evaluated in the list context that makes it return all the matches. The matches get put in the @matches variable.

For example, the following regex extracts all numbers from a string:

my $t = "10 hello 25 moo 31 foo";
my @nums = $text =~ /\d+/g;

@nums now contains (10, 25, 30).

Perl one-liners explained e-book

I've now written the "Perl One-Liners Explained" e-book based on this article series. I went through all the one-liners, improved explanations, fixed mistakes and typos, added a bunch of new one-liners, added an introduction to Perl one-liners and a new chapter on Perl's special variables. Please take a look:

Have Fun!

Thanks for reading the article! In the next part I am releasing the perl1line.txt that will contain all the one-liners in a single file.

Follow me everywhere!

This article is part of the article series "Perl One-Liners Explained."
<- previous article next article ->

Comments

November 10, 2011, 11:56

I created a oneliner that checks for point 119 how many numbers it matches between 0 and 255. This number should be 256, but its not.

perl -E 'for (0..255) { $count+=m/^(([0-9])|([0-9][0-9])|([12][0-5][0-5]))$/}END{say$count}'
November 10, 2011, 12:03

I was confident it was right. I never checked it, I only trusted my reasoning. Now it will be interesting to find where I went wrong.

Edit: Aha, 106 - 109, 116 - 119, ..., 156 - 199 don't get matched, and 206-209, 216-219, 226-229, 236-239, 246-249.

Here is the fix:

/^(([0-9])|([0-9][0-9])|(1[0-9][0-9])|(2[0-4][0-9])|(25[0-5]))$/

Matches 0-9 or 10-99 or 100-199 or 200-249 or 250-255.

Here is the original, broken one that I had:

/^(([0-9])|([0-9][0-9])|([12][0-5][0-5]))$/
Eric TF Bat Permalink
November 10, 2011, 11:59

Nicely done, but #119 is in error as of this writing. It matches a single digit, or two digits, or a one or two followed by two digits under six. That means it won't match its own one-liner number! Or any other useful numbers like 192 and 168 and 127, which come up a lot in IP addresses.

I believe a correct regexp would be:

/^1?\d{1,2}|2[0-4]\d|25[0-5]$/

That matches:

- any one or two digits, optionally starting with "1", so that's 0-9, 00-99, and also 100-199 and (redundantly but harmlessly) 10-19 again
- 200 to 249
- 250 to 255

I haven't run a test, but I'm reasonably confident that it's right.

November 10, 2011, 12:06

Yours is correct:

$ perl -E 'for (0..255) { $count+=m/^1?\d{1,2}|2[0-4]\d|25[0-5]$/}END{say$count}'
256
Dane Permalink
November 10, 2011, 12:01

109 is just wrong. The assumption that the last two digits are below 5 for 3-digit numbers is wrong.

It doesn't match: 106, 107, ..., 160, 161, ... , 199, ..., 206, 207, ... 216, ..., 249

November 10, 2011, 12:11

Right! That's where it went wrong.

Eric TF Bat Permalink
November 10, 2011, 12:04

Adding to my previous comment: if you want to allow redundant zeroes, which your regexp does partially, change mine to

/^[01]?\d{1,2}|2[0-4]\d|25[0-5]$/

That's: 0-9, 00-99, 00-09, 000-099, 10-19, 100-199 for the first alternative, then 200-249 for the second and 250-255 for the third as before. It's redundant, but that doesn't matter.

Easier would be combining regexps with a little binary/boolean algebra, but that's straying outside the purity of the topic...

Zefram Permalink
November 10, 2011, 12:51

/^\d+$/ is faulty. Firstly, /$/ doesn't just match end-of-string, it will also match at a newline followed by end-of-string. So the regexp matches "1\n" as well as "1". Many of your regexps suffer this flaw.

Secondly, /\d/ is not a synonym for /[0-9]/. It also matches many Unicode digit-like characters. So your regexp matches "\x{666}", which probably doesn't look sufficiently like a number for whatever you were planning to do with it. Many of your regexps have this flaw too.

So /^\d+$/ should be /\A[0-9]+\z/, and many of the other regexps should be amended similarly.

November 10, 2011, 15:01

Thanks for commenting on the issues with /^\d+$/.

November 10, 2011, 14:21

Nice. Good little reference. Just wanted to mention that there is a missing '+' from 110's hex example.

November 10, 2011, 17:21

Fixed!

November 10, 2011, 14:27

#112 attempts to match an email _address_ not an email :-)

And the real regex to match an email address (as defined in rfc822) is far more complex. You can see it at the bottom of the source code for Email::Valid.

https://metacpan.org/source/RJBS/Email-Valid-0.185/lib/Email/Valid.pm

November 10, 2011, 14:29

#113 checks that the string is a positive integer - not any number. It doesn't allow the use of a minus sign or a decimal point.

You're probably better using the number regexes in Regexp::Common or the fucntion looks_like_number in Scalar::Util.

November 10, 2011, 14:32

#115 doesn't increment numbers, it increments integers. And it has a strange effect on floating point numbers. For example, 3.14 is changed to 4.15.

$ perl -E'$_=3.14; s/(\d+)/$1+1/ge; say'
4.15

November 10, 2011, 15:00

Thanks for the insightful comments Dave!

Aleksis Brezas Permalink
November 11, 2011, 13:25

Why [0-9][0-9] and not [1-9][0-9]?

Is 09 considered valid?

November 11, 2011, 15:33

Depends on what you do.

you Permalink
November 13, 2011, 14:28

#112 is VERY bad. If you ever have to validate an email address, you might want to speed a few minutes and find a regex that does it right. It is a complex problem that does deserve some time. Do NOT roll your own solution just because you [think you] can.

In this case, enforcing the dot in the domain part is a bad idea not only because it violates the standard (that would be RFC 822) because there are legitimate use cases. for example:

* special domains (like "localhost" for example (there *are* others). Administrative mail on linux gets delivered to root@localhost by default.)
* gtlds
* intranet-only applications where clients have a dns search domain

I know of one company that has two dozen (no kidding) different functions their flagship product that are supposed to validate email addresses. Most of them are non-trivial and definitely took some time to implement. Not one of them is correct.

you Permalink
November 14, 2011, 21:09

(Same commenter as previous comment.)

#109 is bad too because other notations for IPs are valid.

Here is one example:

# Find out the IP of example.com
$ dig +short example.com A
192.0.43.10

# Convert it into another notation
$ python2 -c "print 192*256**3 + 0*256**2 + 43*256**1 + 10*256**0"
3221236490

# Demo, just to prove that this really works
$ ping 3221236490
PING 3221236490 (192.0.43.10) 56(84) bytes of data.
64 bytes from 192.0.43.10: icmp_req=1 ttl=241 time=101 ms
64 bytes from 192.0.43.10: icmp_req=2 ttl=241 time=101 ms
^C
--- 3221236490 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 101.011/101.265/101.519/0.254 ms

Oh and *then* there is IPv6.

Dear Peter Krumin, I know its boring but please take a look at the specs from time to time, especially if you want to publish something like this. This I-don't-care-if-I-violate-the-standard attitude that we see so often these days (especially but not exclusively in corporate environments) really bugs me because in the end we all suffer from it. Right?

November 15, 2011, 05:40

I think this chapter on Regular Expressions need to be expanded or split into several parts, because there are so many other useful regex examples.

lco Permalink
January 23, 2012, 12:06

IMHO, point 110 better written /^(((2[0-4])?|1?[0-9]?)[0-9]|25[0-5])$/

perl -e 'print ~~grep { /^(((2[0-4])?|1?[0-9]?)[0-9]|25[0-5])$/ } (0 .. 300)'
256

Very interesting article suite btw :)

myq Permalink
August 16, 2012, 23:04

`my $t` should be `my $text`

Jaanis Permalink
April 19, 2013, 01:46

Regarding 109: in general this is nasty approach to match something between start and end of the string as you may run in funny problems with it. I'd recommend to use \b which matches word boundary around (please look into perl regexp specs for it).

Regarding 112: Although criticized a lot there is one significant problem - you match .+ before @ which means "everything before @". Please note "everything" may include spaces and other things you do not want to see.

117: Important to note you are matching a single character this way, not characters. :)

Did not check others in details but well - good idea of collecting this kind of stuff and good luck!

Jaanis Kruumins

Leave a new comment

(why do I need your e-mail?)

(Your twitter name, if you have one. (I'm @pkrumins, btw.))

Type the word "unix": (just to make sure you're a human)

Please preview the comment before submitting to make sure it's OK.

Advertisements