Compile Time Regular Expressions

A constant theme in programming languages is the struggle between progression and regression. Java paired garbage collection with an inadequate type system. It wasn't until 2004 that Java had generics. Python style function signatures were a breath of fresh air after Perl5 style @_ parsing, but Python did away with variable declarations and now basic typos are runtime errors. This has nothing to with Python's type system.

A lamentable omission from Python is first class regular expression support. Python is willing to add special syntax for ternary expressions x if test else y, this comparison thing 0 < x < 10, decorators @foo(), the in keyword for membership and loops, that weird printing redirection << thing in Python2, with, and yield, and a dozen or so infix or prefix operators. All this, but nothing for the lowly regular expression. Perl, of course, has first class regular expressions. Look at the code below.

use feature 'say';

$text = 'meet me at room # 345';
if ($text =~ /\d{3}/) {
    say "Found"
} else {
    say "Not Found"
}

Yes, it's typical code. A string is searched and then a message is printed based on the search. The key thing to notice is the /\d{3}/, that's the regular expression. And look, it's not a string. The consequence of this is that Perl can compile regular expressions before the script starts running. Compare this with the Python code below.

import re

pattern = re.compile('(\\d{3})')
text = 'meet me at room # 345'

if pattern.search(text):
    print('Found')
else:
    print('Not Found')

The semantics are identical to the Perl example. A string is searched and a result is based on the result, but the regular expression is not created at compile time. Instead, a string representing the regular expression is compiled into an object that is later used. Python's regular expression library mitigates this issue by supporting strings. The code can be rewritten to inline the regular expression, but at the cost of speed. The regular expression is recompiled on each iteration.

import re

text = 'meet me at room # 345'

# We recompile every time we search the string.
for _ in range(1000):
    if re.search('(\\d{3})', text):
        print('Found')
    else:
        print('Not Found')

# We only compile once before the loop runs.
pattern = re.compile('(\\d{3})')
for _ in range(1000):
    if re.search(text):
        print('Found')
    else:
        print('Not Found')

Go is similar.

package main

import "regexp"
import "fmt"

func main() {
	pattern:= regexp.MustCompile("\\d{3}")
	text := "meet me at room # 345"
	if pattern.MatchString(text) {
		fmt.Println("Found");
	} else {
		fmt.Println("Not found.");
	}
}

Lisp

In contrast, Lisp supports compile time regular expressions, like Perl, but they don't require compiler modifications. Compile time regualar expressions can be added.

(let ((text "meet me at room # 345"))
  (print
    (if (#~m/\d{3}/ text)
      "Found"  
      "Not Found")))

In Let Over Lambda, Doug Hoyte adds support for compile time regular expressions within 30 lines of Lisp code. He didn't write the regular expression library, but he did add compile time support for it. 0

+cl-ppcre
(defun |#~-reader| (stream sub-char numarg)
  (declare (ignore sub-char numarg))
  (let ((mode-char (read-char stream)))
    (cond
      ((char= mode-char #\m)
	 (match-mode-ppcre-lambda-form
	   (segment-reader stream
			   (read-char stream)
			   1)))
      ((char= mode-char #\s)
	 (subst-mode-ppcre-lambda-form
	   (segment-reader stream
			   (read-char stream)
			   2)))
      (t (error "Unknown #~~ mode character")))))

#+cl-ppcre
(set-dispatch-macro-character #\# #\~ #'|#~-reader|)

And it's very fast. Faster than Perl!

First, CL-PPCRE is fast. Really fast. When compiled with a good native code compiler, benchmarks suggest that for most regular expressions CL-PPCRE is roughly twice as fast as Perl, often much faster. And Perl has one of the fastest non-lisp regular expression engines around: a highly optimised engine written in C. How is this possible? Surely Perl's low-level implementation should have a performance edge over anything written in a high-level language like lisp.

How can this be?

Because programs compiled with a C compiler don't have access to the C compiler, Perl is unable to compile regular expressions all the way down to machine code.

What's interesting is the potential applications in other domains. Regular languages, finite automata, are just a static data structure. Other static data structures could benefit from a similar system. A more interesting solution would be a way to mark functions or expression patterns so that they support compile time evaluation.