I am now on Twitter! Meet me on Twitter here (my nick is pkrumins.)
Or on Google Buzz and Facebook.
The regular expressions we use in our daily lives are actually not that “regular.” Most of the languages support some kind of extended regular expressions that are computationally more powerful than the “regular” regular expressions as defined by the formal language theory.
For instance, the so often used capture buffers add auxiliary storage to the regular expressions that allow them to match an arbitrary pattern repeatedly. Or look-ahead assertions that allow the regular expression engine to peek ahead before it making a decision. These extensions make regular expressions powerful enough to describe some context-free grammars.
The Perl programming language has an especially rich with regex engine. One of the engine’s features is the lazy regular subexpressions. The lazy regular subexpressions are expressed as (??{ code }), where the “code” is arbitrary Perl code that gets executed when the moment this subexpression may match.
This allows us to construct something really interesting - we can define a regular expression that has itself in the “code” part. The result is a recursive regular expression!
One of the classical problems that a regular expression can’t match is the language 0n1n, i.e., a string with a number of zeroes followed by an equal number of ones. Surprisingly, using the lazy regular subexpressions this problem becomes tractable!
Here is a Perl regular expression that matches 0n1n:
$regex = qr/0(??{$regex})?1/;
This regular expression matches a 0 followed by itself zero or one time, followed by a one. If the itself part doesn’t match, then the string this regular expression matches is 01. If the itself part matches, the string this regular expression matches is 00($regex)?11, which is 0011 if $regex doesn’t match or it’s 000($regex)?111 if it matches, …, etc.
Here is a Perl program that matches 050000150000:
#!/usr/bin/perl
$str = "0"x50000 . "1"x50000;
$regex = qr/0(??{$regex})*1/;
if ($str =~ /^$regex$/) {
print "yes, it matches"
}
else {
print "no, it doesn't match"
}
Now let’s look at the Yo Dawg regular expression in the picture above. Can you guess what it does? It matches a fully parenthesized expression such as (foo(bar())baz) or balanced parentheses ((()()())()).
$regex = qr/
\( # (1) match an open paren (
( # followed by
[^()]+ # (3) one or more non-paren character
| # OR
(??{$regex}) # (5) the regex itself
)* # (6) repeated zero or more times
\) # (7) followed by a close paren )
/x;
Here is how to think about this regular expression. For an expression to be fully parenthesized, it has to start with an open paren, so we match it (point (1) in the regex). It also has to end with close paren, so we match a close paren at the end (point (7)). Now we have to think what can be in-between the parens? Well, we can either have some text that is neither an open paren or closed paren (point (3)) OR we can have another fully parenthesized expression! (point (5)). And all this may be repeated either zero times (point (6)) to match the smallest fully parenthesized expression () or more times to match a more complex expression.
Without the /x flag (that allows multiline regexes), it can be written more compactly:
$regex = qr/\(([^()]+|(??{$regex}))*\)/;
But please don’t use these regular expressions in production as they are too cryptic. Use Text::Balanced or Regexp::Common Perl modules.
And finally, in Perl 5.10 you can use recursive capture buffers instead of lazy code subexpressions to achieve the same result.
Here is a regular expression that matches 0n1n and uses the recursive capture buffer syntax (?N):
my $rx = qr/(0(?1)*1)/
The (?1)* says “match the first group zero or more times,” where the first group is the whole regular expression.
You can try to rewrite the regular expression that matches balanced parens as an exercise.
Have fun!
Did you like this post? Subscribe here:
If you really enjoyed the post, I'd appreciate a gift from my geeky Amazon book wishlist. Books would make me more educated and I could write even better posts. Thanks! :)

(10 votes, average: 4.1 out of 5)
|
|
|


December 14th, 2009 at 1:12 pm
whoa. nice !
December 14th, 2009 at 1:44 pm
Hummm, doesn’t work here…
C:\Temp>perl #!/usr/bin/perl my $str = $str = "0"x50000 . "1"x50000; my $regex = qr/0(??{$regex})*1/; if ($str =~ /^$regex$/) { print "yes, it matches" } else { print "no, it doesn't match" } ^Z no, it doesn't match C:\Temp>perl -v This is perl, v5.10.1 built for MSWin32-x86-multi-thread (with 2 registered patches, see perl -V for more detail)December 14th, 2009 at 1:45 pm
annhltr, I’d appreciate longer comments. :) Something with more insight.
December 14th, 2009 at 1:49 pm
Samuel, interesting. I just changed “my $regex” to “$regex” and it works.
$ perl #!/usr/bin/perl $str = "0"x50 . "1"x50; $regex = qr/0(??{$regex})*1/; if ($str =~ /^$regex$/) { print "yes, it matches\n" } else { print "no, it doesn't match\n" }I am not exactly sure why the lexically scoped variable changed whether it matches or not. I am investigating this now.
Update: Can’t do that because lexical variables aren’t visible during their own initialization! That is, you can’t use a lexical in the same statement that declares it. I didn’t think about it when I wrote those regexes.
I fixed the article everywhere and it should now be correct. :)
December 14th, 2009 at 3:55 pm
I did NOT try it ASAP. Hence did not comment more.
Guess I’ll not..
December 14th, 2009 at 9:01 pm
Holy cow, I thought I was pretty hot stuff with Perl. I guess with Perl you really can learn something new every day, thank you!
I’m still boggling that this works:
perl -e ‘if ($ARGV[0] =~ qr/^(\(([^()]+|(?1))*\))$/) { print “Match!\n”; }’
What’s the big-O performance compared to a typical recursive-descent parser?
December 14th, 2009 at 9:20 pm
So now we can create a freakin fast oneline XML parser which should never be used on a production server. :)
December 14th, 2009 at 10:07 pm
dude, you don’t need a recursive descent parser to solve paren problem. All you need is two counters - one for open parens, and one for closed. Let’s call those A and B. A correct paren expression starts with an open paren, so A gets set to 1 at the start. A correct paren expression also has one closing paren at the end, therefore B should equal A at the end. Now what happens in between? In between we can only have valid paren expression again, therefore at the end we should always check if A-B==0. Also if at any moment B>A, there was an unexpected closed paren and the expression is invalid.
Therefore bigO for paren problem with this algorithm is O(n).
The bigO for the yo-dawg regex is at least O(n^2) from the T(n) = T(n-2) + O(n) recursion recurrence (at each recursion we eliminate two parenthesis). “at least” because it also depends on the regex engine.
December 14th, 2009 at 10:29 pm
Martin Boer :D
December 15th, 2009 at 3:05 am
This approach is really interesting. It is worth pointing out that you can achieve the same match non-recursively with two regexes. With my versions of perl (5.10 on Windows and 5.8 Linux) this approach turns out to be much faster even though it appears to have more steps.
Rate recursive non_recursive recursive 41580/s -- -93% non_recursive 606061/s 1358% --use Benchmark qw(cmpthese); use strict; use warnings; use vars qw($recursive_regex); my $str = "0"x50 . "1"x50; $recursive_regex = qr/0(??{$recursive_regex})*1/; sub recursive { if ($str =~ /^$recursive_regex$/) { # do nothing - printing to stdout would # be the most expensive operation here } } sub non_recursive { $str =~ /^(0*)/; my $num_zeroes = length $1; if ($str =~ /^0{$num_zeroes}1{$num_zeroes}$/) { # do nothing } } cmpthese(400000, { recursive => \&recursive, non_recursive => \&non_recursive, });December 15th, 2009 at 3:07 am
Addendum to my prior comment - not all solutions which can be solved recursively can be reduced so simply. This recursive regex technique could be useful for such cases, although you may be better off in that case with a parser generator.
December 15th, 2009 at 7:57 am
Discutable wording, as (purely) regular languages already are a subset of context-free languages.
December 15th, 2009 at 9:27 am
Interesting post! Now this opens us to regex injection attacks. A post on that would be great peter! :D
December 15th, 2009 at 9:49 am
bluestorm, I don’t know how to say it differently. What I mean is that they can decide if some language is purely context-free (not just regular).
December 15th, 2009 at 9:54 am
Jaskirat, I’d need some case studies from real world applications then! But yes, it’s possible:
$rx = qr/(??{system("ls")})/; $str = "foo"; $str =~ /$rx/;Lists files in the current directory.
December 16th, 2009 at 1:24 am
[…] Recursive Regular Expressions – good coders code, great reuse (tags: regex recursion perl) […]
December 16th, 2009 at 10:48 am
[…] Recursive Regular Expressions […]
December 16th, 2009 at 3:00 pm
[…] 原文在此。rex译于2009年12月15~16日,翻译过程中使用的是ubuntu 9.10+firefox+prism+google […]
December 17th, 2009 at 4:09 am
Yo dawg, I was going to read this post. The hip hop reference stopped me dead on my track. You come off as a straight douchebag with that silly picture and your shitty hiphop allegory- first of all it doesnt make sense…
2ndly take that picture off before xzibit bitch smacks u in estonia…
no wonder google doesnt want you
December 18th, 2009 at 6:48 am
[…] Recursive Regular Expressions […]
December 30th, 2009 at 9:51 am
This is very cool!
Here’s mine:
perl -le ‘$_ = “(foo(bar())baz)”; print “match” if /^( \( (?: [^()]* (?1)* [^()]* ) \) )\z/x;’