Yo dawg, I heard you liked regular expressions, so I put a regex in your regex so you can match while you match!

The regular expressions we use in our daily lives are actually not that "regular." Most of the languages support some kind of extended regular expressions that are computationally more powerful than the "regular" regular expressions as defined by the formal language theory.

For instance, the so often used capture buffers add auxiliary storage to the regular expressions that allow them to match an arbitrary pattern repeatedly. Or look-ahead assertions that allow the regular expression engine to peek ahead before it making a decision. These extensions make regular expressions powerful enough to describe some context-free grammars.

The Perl programming language has an especially rich with regex engine. One of the engine's features is the lazy regular subexpressions. The lazy regular subexpressions are expressed as (??{ code }), where the "code" is arbitrary Perl code that gets executed when the moment this subexpression may match.

This allows us to construct something really interesting - we can define a regular expression that has itself in the "code" part. The result is a recursive regular expression!

One of the classical problems that a regular expression can't match is the language 0n1n, i.e., a string with a number of zeroes followed by an equal number of ones. Surprisingly, using the lazy regular subexpressions this problem becomes tractable!

Here is a Perl regular expression that matches 0n1n:

$regex = qr/0(??{$regex})?1/;

This regular expression matches a 0 followed by itself zero or one time, followed by a one. If the itself part doesn't match, then the string this regular expression matches is 01. If the itself part matches, the string this regular expression matches is 00($regex)?11, which is 0011 if $regex doesn't match or it's 000($regex)?111 if it matches, ..., etc.

Here is a Perl program that matches 050000150000:

#!/usr/bin/perl

$str = "0"x50000 . "1"x50000;
$regex = qr/0(??{$regex})*1/;

if ($str =~ /^$regex$/) {
  print "yes, it matches"
}
else {
  print "no, it doesn't match"
}

Now let's look at the Yo Dawg regular expression in the picture above. Can you guess what it does? It matches a fully parenthesized expression such as (foo(bar())baz) or balanced parentheses ((()()())()).

$regex = qr/
  \(                 # (1) match an open paren (
    (                # followed by
      [^()]+         #   (3) one or more non-paren character
    |                # OR
      (??{$regex})   #   (5) the regex itself
    )*               # (6) repeated zero or more times
  \)                 # (7) followed by a close paren )
/x;

Here is how to think about this regular expression. For an expression to be fully parenthesized, it has to start with an open paren, so we match it (point (1) in the regex). It also has to end with close paren, so we match a close paren at the end (point (7)). Now we have to think what can be in-between the parens? Well, we can either have some text that is neither an open paren or closed paren (point (3)) OR we can have another fully parenthesized expression! (point (5)). And all this may be repeated either zero times (point (6)) to match the smallest fully parenthesized expression () or more times to match a more complex expression.

Without the /x flag (that allows multiline regexes), it can be written more compactly:

$regex = qr/\(([^()]+|(??{$regex}))*\)/;

But please don't use these regular expressions in production as they are too cryptic. Use Text::Balanced or Regexp::Common Perl modules.

And finally, in Perl 5.10 you can use recursive capture buffers instead of lazy code subexpressions to achieve the same result.

Here is a regular expression that matches 0n1n and uses the recursive capture buffer syntax (?N):

my $rx = qr/(0(?1)*1)/

The (?1)* says "match the first group zero or more times," where the first group is the whole regular expression.

You can try to rewrite the regular expression that matches balanced parens as an exercise.

Have fun!

Comments

annhltr Permalink
December 14, 2009, 13:12

whoa. nice !

Samuel Permalink
December 14, 2009, 13:44

Hummm, doesn't work here...


C:\Temp>perl
#!/usr/bin/perl

my $str = $str = "0"x50000 . "1"x50000;
my $regex = qr/0(??{$regex})*1/;

if ($str =~ /^$regex$/) {
  print "yes, it matches"
}
else {
  print "no, it doesn't match"
}
^Z
no, it doesn't match
C:\Temp>perl -v

This is perl, v5.10.1 built for MSWin32-x86-multi-thread
(with 2 registered patches, see perl -V for more detail)

December 14, 2009, 13:45

annhltr, I'd appreciate longer comments. :) Something with more insight.

December 14, 2009, 13:49

Samuel, interesting. I just changed "my $regex" to "$regex" and it works.

$ perl
#!/usr/bin/perl

$str = "0"x50 . "1"x50;
$regex = qr/0(??{$regex})*1/;

if ($str =~ /^$regex$/) {
  print "yes, it matches\n"
}
else {
  print "no, it doesn't match\n"
}

I am not exactly sure why the lexically scoped variable changed whether it matches or not. I am investigating this now.

Update: Can't do that because lexical variables aren't visible during their own initialization! That is, you can't use a lexical in the same statement that declares it. I didn't think about it when I wrote those regexes.

I fixed the article everywhere and it should now be correct. :)

annhltr Permalink
December 14, 2009, 15:55

I did NOT try it ASAP. Hence did not comment more.
Guess I'll not..

dude Permalink
December 14, 2009, 21:01

Holy cow, I thought I was pretty hot stuff with Perl. I guess with Perl you really can learn something new every day, thank you!

I'm still boggling that this works:

perl -e 'if ($ARGV[0] =~ qr/^(\(([^()]+|(?1))*\))$/) { print "Match!\n"; }'

What's the big-O performance compared to a typical recursive-descent parser?

Martin Boer Permalink
December 14, 2009, 21:20

So now we can create a freakin fast oneline XML parser which should never be used on a production server. :)

December 14, 2009, 22:07

dude, you don't need a recursive descent parser to solve paren problem. All you need is two counters - one for open parens, and one for closed. Let's call those A and B. A correct paren expression starts with an open paren, so A gets set to 1 at the start. A correct paren expression also has one closing paren at the end, therefore B should equal A at the end. Now what happens in between? In between we can only have valid paren expression again, therefore at the end we should always check if A-B==0. Also if at any moment B>A, there was an unexpected closed paren and the expression is invalid.

Therefore bigO for paren problem with this algorithm is O(n).

The bigO for the yo-dawg regex is at least O(n^2) from the T(n) = T(n-2) + O(n) recursion recurrence (at each recursion we eliminate two parenthesis). "at least" because it also depends on the regex engine.

December 14, 2009, 22:29

Martin Boer :D

December 15, 2009, 03:05

This approach is really interesting. It is worth pointing out that you can achieve the same match non-recursively with two regexes. With my versions of perl (5.10 on Windows and 5.8 Linux) this approach turns out to be much faster even though it appears to have more steps.

                  Rate     recursive non_recursive
recursive      41580/s            --          -93%
non_recursive 606061/s         1358%            --
use Benchmark qw(cmpthese);

use strict;
use warnings;

use vars qw($recursive_regex);

my $str = "0"x50 . "1"x50;
$recursive_regex = qr/0(??{$recursive_regex})*1/;

sub recursive {
  if ($str =~ /^$recursive_regex$/) {
    # do nothing - printing to stdout would
    # be the most expensive operation here
  }
}

sub non_recursive {
  $str =~ /^(0*)/;
  my $num_zeroes = length $1;
  if ($str =~ /^0{$num_zeroes}1{$num_zeroes}$/) {
    # do nothing
  }
}

cmpthese(400000,
     {
     recursive => \&recursive,
     non_recursive => \&non_recursive,
     });
December 15, 2009, 03:07

Addendum to my prior comment - not all solutions which can be solved recursively can be reduced so simply. This recursive regex technique could be useful for such cases, although you may be better off in that case with a parser generator.

bluestorm Permalink
December 15, 2009, 07:57

These extensions make regular expressions powerful enough to describe some context-free grammars.

Discutable wording, as (purely) regular languages already are a subset of context-free languages.

Jaskirat Permalink
December 15, 2009, 09:27

Interesting post! Now this opens us to regex injection attacks. A post on that would be great peter! :D

December 15, 2009, 09:49

bluestorm, I don't know how to say it differently. What I mean is that they can decide if some language is purely context-free (not just regular).

December 15, 2009, 09:54

Jaskirat, I'd need some case studies from real world applications then! But yes, it's possible:

$rx = qr/(??{system("ls")})/;
$str = "foo";
$str =~ /$rx/;

Lists files in the current directory.

December 17, 2009, 04:09

Yo dawg, I was going to read this post. The hip hop reference stopped me dead on my track. You come off as a straight douchebag with that silly picture and your shitty hiphop allegory- first of all it doesnt make sense...
2ndly take that picture off before xzibit bitch smacks u in estonia...

no wonder google doesnt want you

Gregg Permalink
July 01, 2013, 01:46

Jealous much? You didn't read the post by your own admission, but then feel entitled to criticize him and make a comment on an article you didn't read? You are the only douche here, bushleague.

December 30, 2009, 09:51

This is very cool!

Here's mine:

perl -le '$_ = "(foo(bar())baz)"; print "match" if /^( \( (?: [^()]* (?1)* [^()]* ) \) )\z/x;'

May 06, 2010, 11:02

This article is very useful for anyone working with Regular Expression. These recursive reg-ex techniques could be useful for most of the cases to solve the problems so simply

bob Permalink
July 14, 2011, 03:46

I think you have a mistake in your last example with the "my $rx = qr/(0(?1)*1)/". It matches the string "001011", which is not of the form 0^n 1^n.

Oleg Permalink
September 22, 2011, 16:42

this can be solved with sed and without recursion:

$ perl -e 'print "0"x10 . "1"x10 . "\n"' > x
$ cat x
00000000001111111111
$ cat x.sed
# match 0n1n
/^00*11*$/{
h
:again
/^00*11*$/{
s/01//
t again
}
/^$/{
x
p
}
}

$ sed -n -f x.sed x
00000000001111111111

Belinda Permalink
December 08, 2014, 08:44

Nice site, nice and easy on the eyes and great content too. Do you need many drafts to make a post?
Christmas wallpaper 2014 thank you

Leave a new comment

(why do I need your e-mail?)

(Your twitter name, if you have one. (I'm @pkrumins, btw.))

Type the word "server_180": (just to make sure you're a human)

Please preview the comment before submitting to make sure it's OK.

Advertisements