Perl One-Liners Explained, Part VII: Handy Regular Expressions

This is the seventh part of a nine-part article on Perl one-liners. Perl is not Perl without regular expressions, therefore in this part I will come up with and explain various Perl regular expressions. Please see part one for the introduction of the series.

Perl one-liners is my attempt to create "perl1line.txt" that is similar to "awk1line.txt" and "sed1line.txt" that have been so popular among Awk and Sed programmers, and Unix sysadmins. I will release the perl1line.txt in the next part of the series.

The article on Perl one-liners consists of nine parts:

Part I: File spacing.
Part II: Line numbering.
Part III: Calculations.
Part IV: String creation and array creation.
Part V: Text conversion and substitution.
Part VI: Selective printing and deleting of certain lines.
Part VII: Handy regular expressions (this part).
Part VIII: Release of perl1line.txt.
Part IX: Release of Perl One-Liners e-book.

After I am done with the next part of the article, I will release the whole article series as a pdf e-book! Please subscribe to my blog to be the first to get it. You can also follow me on Twitter.

Awesome news: I have written an e-book based on this article series. Check it out:

Perl book

And here are today's one-liners:

109. Match something that looks like an IP address.

/^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/

This regex doesn't guarantee that the thing that got matched is in fact a valid IP. All it does is match something that looks like an IP. It matches a number followed by a dot four times. For example, it matches a valid IP 81.198.240.140 and it also matches an invalid IP such as 923.844.1.999.

Here is how it works. The ^ at the beginning of regex is an anchor that matches the beginning of string. Next \d{1,3} matches one, two or three consecutive digits. The . matches a dot. The $ at the end is an anchor that matches the end of the string. It's important to use both ^ and $ anchors, otherwise strings like foo213.3.1.2bar would also match.

This regex can be simplified by grouping the first three repeated \d{1,3}. expressions:

/^(\d{1,3}\.){3}\d{1,3}$/

110. Test if a number is in range 0-255.

/^([0-9]|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$/

Here is how it works. A number can either be one digit, two digit or three digit. If it's a one digit number then we allow it to be anything [0-9]. If it's two digit, we also allow it to be any combination of [0-9][0-9]. However if it's a three digit number, it has to be either one hundred-something or two-hundred something. If it'e one hundred-something, then 1[0-9][0-9] matches it. If it's two hundred-something then it's either something up to 249, which is matched by 2[0-4][0-9] or it's 250-255, which is matched by 25[0-5].

111. Match an IP address.

my $ip_part = qr|([0-9]|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])|;
if ($ip =~ /^($ip_part\.){3}$ip_part$/) {
 say "valid ip";
}

This regexp combines the previous two. It uses the my $ip_part = qr/.../ operator compiles the regular expression and puts it in $ip_part variable. Then the $ip_part is used to match the four parts of the IP address.

112. Check if the string looks like an email address.

/.+@.+\..+/

This regex makes sure that the string looks like an email address. Notice that I say "looks like". It doesn't guarantee it is an email address. Here is how it works - first it matches something up to the code>@</code symbol, then it matches as much as possible until it finds a dot, and then it matches some more. If this succeeds, then it it's something that at least looks like email address with the code>@</code symbol and a dot in it.

For example, code>cats@catonmat.net</code matches but code>cats@catonmat</code doesn't because the regex can't match the dot . that is necessary.

Much more robust way to check if a string is a valid email would be to use Email::Valid module:

use Email::Valid;
print (Email::Valid->address('john@example.com') ? 'valid email' : 'invalid email');

113. Check if the string is a decimal number.

Checking if the string is a number is really difficult. I based my regex and explanation on the one in Perl Cookbook.

Perl offers \d that matches digits 0-9. So we can start with:

/^\d+$/

This regex matches one or more digits \d starting at the beginning of the string ^ and ending at the end of the string $. However this doesn't match numbers such as +3 and -3. Let's modify the regex to match them:

/^[+-]?\d+$/

Here the [+-]? means match an optional plus or a minus before the digits. This now matches +3 and -3 but it doesn't match -0.3. Let's add that:

/^[+-]?\d+\.?\d*$/

Now we have expanded the previous regex by adding .?\d*, which matches an optional dot followed by zero or more numbers. Now we're in business and this regex also matches numbers like -0.3 and 0.3.

Much better way to match a decimal number is to use Regexp::Common module that offers various useful regexes. For example, to match an integer you can use $RE{num}{int} from Regexp::Common.

How about positive hexadecimal numbers? Here is how:

/^0x[0-9a-f]+$/i

This matches the hex prefix 0x followed by hex number itself. The /i flag at the end makes sure that the match is case insensitive. For example, 0x5af matches, 0X5Fa matches but 97 doesn't, cause it's just a decimal number.

It's better to use $RE{num}{hex} because it supports negative numbers, decimal places and number grouping.

Now how about octal? Here is how:

/^0[0-7]+$/

Octal numbers are prefixed by 0, which is followed by octal digits 0-7. For example, 013 matches but 09 doesn't, cause it's not a valid octal number.

It's better to use $RE{num}{oct} because of the same reasons as above.

Finally binary:

/^[01]+$/

Binary base consists of just 0s and 1s. For example, 010101 matches but 210101 doesn't, because 2 is not a valid binary digit.

It's better to use $RE{num}{bin} because of the same reasons as above.

114. Check if a word appears twice in the string.

/(word).*\1/

This regex matches word followed by something or nothing at all, followed by the same word. Here the (word) captures the word in group 1 and \1 refers to contents of group 1, therefore it's almost the same as writing /(word).*word/

For example, silly things are silly matches /(silly).*\1/, but silly things are boring doesn't, because silly is not repeated in the string.

115. Increase all numbers by one in the string.

$str =~ s/(\d+)/$1+1/ge

Here we use the substitution operator s///. It matches all integers (\d+), puts them in capture group 1, then it replaces them with their value incremented by one $1+1. The g flag makes sure it finds all the numbers in the string, and the e flag evaluates $1+1 as a Perl expression.

For example, this 1234 is awesome 444 gets turned into this 1235 is awesome 445.

116. Extract HTTP User-Agent string from the HTTP headers.

/^User-Agent: (.+)$/

HTTP headers are formatted as Key: Value pairs. It's very easy to parse such strings, you just instruct the regex engine to save the Value part in $1 group variable.

For example, if the HTTP headers contain,

Host: localhost:8000
Connection: keep-alive
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US)
Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3

Then the regular expression will extract the Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US) string.

117. Match printable ASCII characters.

/[ -~]/

This is really tricky and smart. To understand it, take a look at man ascii. You'll see that space starts at value 0x20 and the ~ character is 0x7e. All the characters between a space and ~ are printable. This regular expression matches exactly that. The [ -~] defines a range of characters from space till ~. This is my favorite regexp of all time.

You can invert the match by placing ^ as the first character in the group:

/[^ -~]/

This matches the opposite of [ -~].

118. Match text between two HTML tags.

m|<strong>([^<]*)</strong>|

This regex matches everything between ... HTML tags. The trick here is the ([^<]*), which matches as much as possible until it finds a < character, which starts the next tag.

Alternatively you can write:

m|<strong>(.*?)</strong>|

But this is a little different. For example, if the HTML is hello then the first regex doesn't match anything because the < follows  and ([^<]) matches as little as possible. The second regex matches hello because the (.?) matches as little as possible until it finds , which happens to be hello.

However don't use regular expressions for matching and parsing HTML. Use modules like HTML::TreeBuilder to accomplish the task cleaner.

119. Replace all tags with 

$html =~ s|<(/)?b>|<$1strong>|g

Here I assume that the HTML is in variable $html. Next the <(/)?b> matches the opening and closing  tags, captures the optional closing tag slash in group $1 and then replaces the matched tag with either  or , depending on if it was an opening or closing tag.

120. Extract all matches from a regular expression.

my @matches = $text =~ /regex/g;

Here the regular expression gets evaluated in the list context that makes it return all the matches. The matches get put in the code>@matches</code variable.

For example, the following regex extracts all numbers from a string:

my $t = "10 hello 25 moo 31 foo";
my @nums = $text =~ /\d+/g;

code>@nums</code now contains (10, 25, 30).

Perl one-liners explained e-book

I've now written the "Perl One-Liners Explained" e-book based on this article series. I went through all the one-liners, improved explanations, fixed mistakes and typos, added a bunch of new one-liners, added an introduction to Perl one-liners and a new chapter on Perl's special variables. Please take a look:

Perl book

Have Fun!

Thanks for reading the article! In the next part I am releasing the perl1line.txt that will contain all the one-liners in a single file.