Computer Science December 14, 2009

# Recursive Regular Expressions

The regular expressions we use in our daily lives are actually not that "regular." Most of the languages support some kind of extended regular expressions that are computationally more powerful than the "regular" regular expressions as defined by the formal language theory.

For instance, the so often used capture buffers add auxiliary storage to the regular expressions that allow them to match an arbitrary pattern repeatedly. Or look-ahead assertions that allow the regular expression engine to peek ahead before it making a decision. These extensions make regular expressions powerful enough to describe some context-free grammars.

The Perl programming language has an especially rich with regex engine. One of the engine's features is the lazy regular subexpressions. The lazy regular subexpressions are expressed as `(??{ code })`, where the "`code`" is arbitrary Perl code that gets executed when the moment this subexpression may match.

This allows us to construct something really interesting - we can define a regular expression that has itself in the "`code`" part. The result is a recursive regular expression!

One of the classical problems that a regular expression can't match is the language `0n1n`, i.e., a string with a number of zeroes followed by an equal number of ones. Surprisingly, using the lazy regular subexpressions this problem becomes tractable!

Here is a Perl regular expression that matches `0n1n`:

```\$regex = qr/0(??{\$regex})?1/;
```

This regular expression matches a `0` followed by itself zero or one time, followed by a one. If the itself part doesn't match, then the string this regular expression matches is `01`. If the itself part matches, the string this regular expression matches is `00(\$regex)?11`, which is `0011` if `\$regex` doesn't match or it's `000(\$regex)?111` if it matches, ..., etc.

Here is a Perl program that matches `050000150000`:

```#!/usr/bin/perl

\$str = "0"x50000 . "1"x50000;
\$regex = qr/0(??{\$regex})*1/;

if (\$str =~ /^\$regex\$/) {
print "yes, it matches"
}
else {
print "no, it doesn't match"
}
```

Now let's look at the Yo Dawg regular expression in the picture above. Can you guess what it does? It matches a fully parenthesized expression such as `(foo(bar())baz)` or balanced parentheses `((()()())())`.

```\$regex = qr/
\(                 # (1) match an open paren (
(                # followed by
[^()]+         #   (3) one or more non-paren character
|                # OR
(??{\$regex})   #   (5) the regex itself
)*               # (6) repeated zero or more times
\)                 # (7) followed by a close paren )
/x;
```

Here is how to think about this regular expression. For an expression to be fully parenthesized, it has to start with an open paren, so we match it (point `(1)` in the regex). It also has to end with close paren, so we match a close paren at the end (point `(7)`). Now we have to think what can be in-between the parens? Well, we can either have some text that is neither an open paren or closed paren (point `(3)`) OR we can have another fully parenthesized expression! (point `(5)`). And all this may be repeated either zero times (point `(6)`) to match the smallest fully parenthesized expression `()` or more times to match a more complex expression.

Without the `/x` flag (that allows multiline regexes), it can be written more compactly:

```\$regex = qr/\(([^()]+|(??{\$regex}))*\)/;
```

But please don't use these regular expressions in production as they are too cryptic. Use Text::Balanced or Regexp::Common Perl modules.

And finally, in Perl 5.10 you can use recursive capture buffers instead of lazy code subexpressions to achieve the same result.

Here is a regular expression that matches `0n1n` and uses the recursive capture buffer syntax `(?N)`:

```my \$rx = qr/(0(?1)*1)/
```

The `(?1)*` says "match the first group zero or more times," where the first group is the whole regular expression.

You can try to rewrite the regular expression that matches balanced parens as an exercise.

Have fun!

whoa. nice !

Hummm, doesn't work here...

```
C:\Temp>perl
#!/usr/bin/perl

my \$str = \$str = "0"x50000 . "1"x50000;
my \$regex = qr/0(??{\$regex})*1/;

if (\$str =~ /^\$regex\$/) {
print "yes, it matches"
}
else {
print "no, it doesn't match"
}
^Z
no, it doesn't match
C:\Temp>perl -v

This is perl, v5.10.1 built for MSWin32-x86-multi-thread
(with 2 registered patches, see perl -V for more detail)

```

annhltr, I'd appreciate longer comments. :) Something with more insight.

Samuel, interesting. I just changed "my \$regex" to "\$regex" and it works.

```\$ perl
#!/usr/bin/perl

\$str = "0"x50 . "1"x50;
\$regex = qr/0(??{\$regex})*1/;

if (\$str =~ /^\$regex\$/) {
print "yes, it matches\n"
}
else {
print "no, it doesn't match\n"
}
```

I am not exactly sure why the lexically scoped variable changed whether it matches or not. I am investigating this now.

Update: Can't do that because lexical variables aren't visible during their own initialization! That is, you can't use a lexical in the same statement that declares it. I didn't think about it when I wrote those regexes.

I fixed the article everywhere and it should now be correct. :)

I did NOT try it ASAP. Hence did not comment more.
Guess I'll not..

Holy cow, I thought I was pretty hot stuff with Perl. I guess with Perl you really can learn something new every day, thank you!

I'm still boggling that this works:

perl -e 'if (\$ARGV[0] =~ qr/^(\(([^()]+|(?1))*\))\$/) { print "Match!\n"; }'

What's the big-O performance compared to a typical recursive-descent parser?

So now we can create a freakin fast oneline XML parser which should never be used on a production server. :)

dude, you don't need a recursive descent parser to solve paren problem. All you need is two counters - one for open parens, and one for closed. Let's call those A and B. A correct paren expression starts with an open paren, so A gets set to 1 at the start. A correct paren expression also has one closing paren at the end, therefore B should equal A at the end. Now what happens in between? In between we can only have valid paren expression again, therefore at the end we should always check if A-B==0. Also if at any moment B>A, there was an unexpected closed paren and the expression is invalid.

Therefore bigO for paren problem with this algorithm is O(n).

The bigO for the yo-dawg regex is at least O(n^2) from the T(n) = T(n-2) + O(n) recursion recurrence (at each recursion we eliminate two parenthesis). "at least" because it also depends on the regex engine.

Martin Boer :D

This approach is really interesting. It is worth pointing out that you can achieve the same match non-recursively with two regexes. With my versions of perl (5.10 on Windows and 5.8 Linux) this approach turns out to be much faster even though it appears to have more steps.

```                  Rate     recursive non_recursive
recursive      41580/s            --          -93%
non_recursive 606061/s         1358%            --
```
```use Benchmark qw(cmpthese);

use strict;
use warnings;

use vars qw(\$recursive_regex);

my \$str = "0"x50 . "1"x50;
\$recursive_regex = qr/0(??{\$recursive_regex})*1/;

sub recursive {
if (\$str =~ /^\$recursive_regex\$/) {
# do nothing - printing to stdout would
# be the most expensive operation here
}
}

sub non_recursive {
\$str =~ /^(0*)/;
my \$num_zeroes = length \$1;
if (\$str =~ /^0{\$num_zeroes}1{\$num_zeroes}\$/) {
# do nothing
}
}

cmpthese(400000,
{
recursive => \&recursive,
non_recursive => \&non_recursive,
});
```

Addendum to my prior comment - not all solutions which can be solved recursively can be reduced so simply. This recursive regex technique could be useful for such cases, although you may be better off in that case with a parser generator.

These extensions make regular expressions powerful enough to describe some context-free grammars.

Discutable wording, as (purely) regular languages already are a subset of context-free languages.

Interesting post! Now this opens us to regex injection attacks. A post on that would be great peter! :D

bluestorm, I don't know how to say it differently. What I mean is that they can decide if some language is purely context-free (not just regular).

Jaskirat, I'd need some case studies from real world applications then! But yes, it's possible:

```\$rx = qr/(??{system("ls")})/;
\$str = "foo";
\$str =~ /\$rx/;
```

Lists files in the current directory.

Yo dawg, I was going to read this post. The hip hop reference stopped me dead on my track. You come off as a straight douchebag with that silly picture and your shitty hiphop allegory- first of all it doesnt make sense...
2ndly take that picture off before xzibit bitch smacks u in estonia...

no wonder google doesnt want you

Jealous much? You didn't read the post by your own admission, but then feel entitled to criticize him and make a comment on an article you didn't read? You are the only douche here, bushleague.

This is very cool!

Here's mine:

perl -le '\$_ = "(foo(bar())baz)"; print "match" if /^( \( (?: [^()]* (?1)* [^()]* ) \) )\z/x;'

This article is very useful for anyone working with Regular Expression. These recursive reg-ex techniques could be useful for most of the cases to solve the problems so simply

I think you have a mistake in your last example with the "my \$rx = qr/(0(?1)*1)/". It matches the string "001011", which is not of the form 0^n 1^n.

this can be solved with sed and without recursion:

\$ perl -e 'print "0"x10 . "1"x10 . "\n"' > x
\$ cat x
00000000001111111111
\$ cat x.sed
# match 0n1n
/^00*11*\$/{
h
:again
/^00*11*\$/{
s/01//
t again
}
/^\$/{
x
p
}
}

\$ sed -n -f x.sed x
00000000001111111111