Fast Compilation with Lisp

While reading about finite automata and regular languages, I realized something: compilation is faster than interpretation, but programming languages are a specific case of interpretation and compilation.

Put another way, compilers and interpreters are two ways of handling an abstract syntax tree. The interpreter is data structure driven: it follows the abstract syntax tree performing actions as it goes. The compiler is also data structure driven: it takes an abstract syntax tree and converts it to an executable. The key insight is that an abstract syntax tree is a special case. Any data structure driven program is either an interpreter or a compiler.

As an example, consider the scanner in compilers. The job of the scanner is to split the source code text stream into tokens for the parser. There are two ways to implement the scanner: one is to hand code the fastest scanner you can; the other is to identify the regular expressions describing the language, compile the regular expressions into a table, and then use the table to drive the scanner. It is common knowledge that hand written scanners are faster than table driven scanners. [Programming languages] However, this seems to be a false dichotomy. There's a third option.

In a compiler, the scanner acts as a kind of compiler itself. It takes a stream of text and creates tokens. Then, the parser acts as a kind of compiler, it takes the stream of tokens and returns an abstract syntax tree. I use compiler to mean a transformation on static data: a transformation that can be done on input known ahead of time. It's hard to imagine a scanner or parser that works differently.

Figure 1: An interpreter takes program input and source code at the same time. A compiler translates program input into an optimized executable and then maps input to output.

A scanner, a tool that transforms a regular language into a table is a kind of compiler itself. It takes a stream of text and creates tokens. But how is the table handled? In a table driven scanner, the table is interpreted. A program walks the table to handle a stream of source code. The obvious question is "If the table is constant, why not compile the table?"

Figure 2: Walking a static data structure is like interpreting a programming language, slow and easy.

This has been done. Consider Doug Hoyte's description of Edi Weitz's Common Lisp regular expression library:

[…] CL-PPCRE is fast. Really fast. When compiled with a good native code compiler, benchmarks suggest that for most regular expressions CL-PPCRE is roughly twice as fast as Perl, often much faster. And Perl has one of the fastest non-lisp regular expression engines around: a highly optimised engine written in C. How is this possible? Surely Perl's low-level implementation should have a performance edge over anything written in a high-level language like lisp.

Let Over Lambda, Chapter 4, Doug Hoyte

My guess is that Edi is compiling regular expressions into a table and compiling the resulting table. While this seems difficult, compiling code in Lisp is very easy. All you have to do is walk the data structure and build up a large lambda. Paul Graham's On Lisp includes a chapter where Paul Graham compiles a decision tree.

By representing the nodes as closures, we are able to transform our twenty-questions network entirely into code. As it is, the code will have to look up the node functions by name at runtime. However, if we know that the network is not going to be redefined on the fly, we can add a further enhancement: we can have node functions call their destinations directly, without having to go through a hash-table.

The end of the chapter sums up the basic idea.

"To represent with closures" is another way of saying "to compile,"

To make our system faster, we'll hand generate jumps using Common Lisp's support for goto.

Example

To make my point, I created a small library working with finite state machines. There are two ways to use a state machine with my library: the finite state machine can be interpreted with the scan method and the finite state machine can be compiled directly into a two tier jump table. To show the distinction between the two methods I'll need an example state machine, example input, and then a measurement of each case.

My finite state machine is an example taken from Sisper.

Figure 3: This finite state machine tracks a bounded position state. The u can be interpreted as "Up". The d can be interpreted as "Down".

Below, I define a FSM equivalent to the state diagram. I start by defining the state-transition function with a hash table and then I create an object containing the start state, alphabet, list of states, end states, and transition table.

(defparameter *transition-table-example-2* (make-hash-table :test #'equalp))
(let ((table *transition-table-example-2*))
  (setf (gethash #(base #\u) table) 'u)
  (setf (gethash #(base #\d) table) 'd)
  (setf (gethash #(u #\u) table) 'uu)
  (setf (gethash #(u #\d) table) 'base)
  (setf (gethash #(d #\u) table) 'base)
  (setf (gethash #(d #\d) table) 'dd)
  (setf (gethash #(uu #\u) table) 'uuu)
  (setf (gethash #(uu #\d) table) 'u)
  (setf (gethash #(uuu #\u) table) 'uuu)
  (setf (gethash #(uuu #\d) table) 'uu)
  (setf (gethash #(dd #\u) table) 'd)
  (setf (gethash #(dd #\d) table) 'ddd)
  (setf (gethash #(ddd #\u) table) 'dd)
  (setf (gethash #(ddd #\d) table) 'ddd))

(defparameter *automata-example-2*
  (make-instance 'finite-automata
		 :start-state 'base
		 :alphabet (list #\u #\d)
		 :states (list 'base 'u 'd 'uu 'dd 'uuu 'ddd)
		 :end-states (list 'base)
		 :transition-table *transition-table-example-2*))

Next, I need some input. Random input should be fine, so I'll generate a random sequence of 1000 valid characters.

(defparameter *inputs-example-1*
  (flet ((random-input ()
	   (aref #(#\u #\d) (random 2))))
    (coerce (loop for i below 1000
	       collect (random-input))
	    'string)))

The input is shown below.

"uuududdduddduududududddduuudududduudddduuududududuuduuuuuduudduuuuuuddduddduduuuudduudududduudduuduudduduududduuduuuuduuudddddddudduuuuduuuduuuudududdduduudddddddduuuddduduuuudduddduuuduuuudduuduudddddududududduuuduuuududdddduuuuuuduududuuuduuuuuduuddduududuududuududduddududduuuuuududududuuuduuududduudddddddduududduuuudddduuuuduuuddddududduudduduuudddddudddddddduuuuduudddddududduudduuuududuuduuuduudduuuuddddudduddduddudduduuududddduddduudduuduuuuddduuuudduduudduuduududduuuduuudddduddduuuuuudduddududddduuududddudddudddudduuduuduuddduddudddduuudduudduduuduuddduudduduududduuddudduduudddddduududuuuuuddduudddduddudddddddduduuduuduududdduddduddduudduduuuuuduuuduuuddududuuuddddddddduuddudduududdddududuudddduuudduudduuduuududuudduduuuuudududuuududdudduduuudduuuudududduddduuuuududuuudddddduuuuuuddduududduudduuuuuduuuuduududuudududuudduuuuduuuddududuuuudduuuududdddduuduuddudduuddduududdududuuduuddduuuudduuudddduddddduuuuuudduudddudduuudududuuduudddddduudduduuddudududdududuuddddduduuuudduuududuuu"

I can benchmark the code using the trival-benchmark package.

(ql:quickload 'trivial-benchmark)
(benchmark:with-timing (10)
  'test)

This results in the following output:

-                SAMPLES  TOTAL  MINIMUM  MAXIMUM  MEDIAN  AVERAGE  DEVIATION  
REAL-TIME        10       0      0        0        0       0        0.0        
RUN-TIME         10       0      0        0        0       0        0.0        
USER-RUN-TIME    10       0      0        0        0       0        0.0        
SYSTEM-RUN-TIME  10       0      0        0        0       0        0.0        
PAGE-FAULTS      10       0      0        0        0       0        0.0        
GC-RUN-TIME      10       0      0        0        0       0        0.0        
BYTES-CONSED     10       0      0        0        0       0        0.0        
EVAL-CALLS       10       0      0        0        0       0        0.0        
NIL

With an example state machine, test input, and a benchmark tool, we can now compare interpretation to compilation.

Table Driven Interpreter

The table driven interpreter uses the scan method to walk the table until the input is exhausted.

(benchmark:with-timing (100000)
  (with-input-from-string (input *inputs-example-1*)
    (scan *automata-example-2* input)))

The timing results are below:

-                SAMPLES  TOTAL       MINIMUM  MAXIMUM  MEDIAN  AVERAGE    DEVIATION  
REAL-TIME        100000   12.187      0        0.016    0       0.000122   0.000351   
RUN-TIME         100000   12.208      0        0.012    0       0.000122   0.000692   
USER-RUN-TIME    100000   12.036      0        0.012    0       0.00012    0.000688   
SYSTEM-RUN-TIME  100000   0.184       0        0.004    0       0.000002   0.000086   
PAGE-FAULTS      100000   0           0        0        0       0          0.0        
GC-RUN-TIME      100000   0.276       0        0.012    0       0.000003   0.00013    
BYTES-CONSED     100000   3214350368  0        1142480  32768   32143.504  5735.312   
EVAL-CALLS       100000   0           0        0        0       0          0.0

Compiled Table

The compiler transforms the automata table to machine code. The transformation is done outside of the benchmark timing, because it can be done at compile time; the compiled automata is reused each time. Unlike the table driven interpeter, the method scan is not used. The compile creates a lambda that can be called with funcall.

(let ((scanner (eval (compile-automata *automata-example-2*))))
  (benchmark:with-timing (100000)
    (with-input-from-string (input *inputs-example-1*)
      (funcall scanner input))))

The dissasembly of the scanner function can be seen below.

(let ((scanner (eval (compile-automata *automata-example-2*))))
  (disassemble scanner))

The timing results are below:

-                SAMPLES  TOTAL     MINIMUM  MAXIMUM  MEDIAN  AVERAGE    DEVIATION  
REAL-TIME        100000   1.606     0        0.012    0       0.000016   0.000131   
RUN-TIME         100000   1.596     0        0.012    0       0.000016   0.000254   
USER-RUN-TIME    100000   1.424     0        0.012    0       0.000014   0.00024    
SYSTEM-RUN-TIME  100000   0.196     0        0.004    0       0.000002   0.000089   
PAGE-FAULTS      100000   19        0        3        0       0.00019    0.016431   
GC-RUN-TIME      100000   0.012     0        0.012    0       0.0        0.000038   
BYTES-CONSED     100000   12953552  0        130960   0       129.53552  2090.5718  
EVAL-CALLS       100000   0         0        0        0       0          0.0        
NIL

Conclusion

The compiled version is ~10 times faster, but the compiler implementation is less than 20 lines of code.

(defun compile-automata (automata)
  (let ((escape (gensym "ESCAPE-"))
	(eof-value (gensym "EOF-VALUE-")))
    `(lambda (input)
       (declare (optimize (safety 0) (speed 3)))
       (block ,escape
	 (tagbody
	    ,@(loop for state in (states automata)
		 append (list state
			      `(ecase (read-char input nil ',eof-value)
				 ,@(loop for input in (alphabet automata)
				      collect `(,input (go ,(gethash (vector state input) (transition-table automata)))))
				 (',eof-value (return-from ,escape ',state))))))))))

The compiler is a dense. An example will help explain what happens. The output of compiling the example state machine is shown below.

(LAMBDA (REGULAR-LANGUAGE-COMPILER::INPUT)
  (DECLARE (OPTIMIZE (SAFETY 0) (SPEED 3)))
  (BLOCK #:ESCAPE-726
    (TAGBODY
     BASE
      (ECASE (READ-CHAR REGULAR-LANGUAGE-COMPILER::INPUT NIL '#:EOF-VALUE-727)
	(#\u (GO U))
	(#\d (GO D))
	((QUOTE #:EOF-VALUE-727) (RETURN-FROM #:ESCAPE-726 'BASE)))
     U
      (ECASE (READ-CHAR REGULAR-LANGUAGE-COMPILER::INPUT NIL '#:EOF-VALUE-727)
	(#\u (GO U))
	(#\d (GO BASE))
	((QUOTE #:EOF-VALUE-727) (RETURN-FROM #:ESCAPE-726 'U)))
     D
      (ECASE (READ-CHAR REGULAR-LANGUAGE-COMPILER::INPUT NIL '#:EOF-VALUE-727)
	(#\u (GO BASE))
	(#\d (GO DD))
	((QUOTE #:EOF-VALUE-727) (RETURN-FROM #:ESCAPE-726 'D)))
     UU
      (ECASE (READ-CHAR REGULAR-LANGUAGE-COMPILER::INPUT NIL '#:EOF-VALUE-727)
	(#\u (GO UUU))
	(#\d (GO U))
	((QUOTE #:EOF-VALUE-727) (RETURN-FROM #:ESCAPE-726 'UU)))
     DD
      (ECASE (READ-CHAR REGULAR-LANGUAGE-COMPILER::INPUT NIL '#:EOF-VALUE-727)
	(#\u (GO D))
	(#\d (GO DDD))
	((QUOTE #:EOF-VALUE-727) (RETURN-FROM #:ESCAPE-726 'DD)))
     UUU
      (ECASE (READ-CHAR REGULAR-LANGUAGE-COMPILER::INPUT NIL '#:EOF-VALUE-727)
	(#\u (GO UUU))
	(#\d (GO UU))
	((QUOTE #:EOF-VALUE-727) (RETURN-FROM #:ESCAPE-726 'UUU)))
     DDD
      (ECASE (READ-CHAR REGULAR-LANGUAGE-COMPILER::INPUT NIL '#:EOF-VALUE-727)
	(#\u (GO DD))
	(#\d (GO DDD))
	((QUOTE #:EOF-VALUE-727) (RETURN-FROM #:ESCAPE-726 'DDD))))))

This goto laden code is a two tier jump table.

An example of the same state machine hand written is below.

(with-input-from-string (input *inputs-example*)
  (symbol-macrolet ((value (read-char input nil 'eof-value)))
    (block escape
      (tagbody
       base (ecase value
	      (#\u (go u))
	      (#\d (go d))
	      (eof-value (return-from escape 'base)))
       u (ecase value
	   (#\u (go uu))
	   (#\d (go base))
	   (eof-value (return-from escape 'u)))
       d (ecase value
	   (#\u (go base))
	   (#\d (go dd))
	   (eof-value (return-from escape 'd)))
       uu (ecase value
	    (#\u (go uuu))
	    (#\d (go d))
	    (eof-value (return-from escape 'uu)))
       dd (ecase value
	    (#\u (go d))
	    (#\d (go ddd))
	    (eof-value (return-from escape 'dd)))
       uuu (ecase value
	     (#\u (go uuu))
	     (#\d (go uu))
	     (eof-value (return-from escape 'uuu)))
       ddd (ecase value
	     (#\u (go dd))
	     (#\d (go ddd))
	     (eof-value (return-from escape 'ddd)))))))

To write the compiler, I first hand wrote this state machine. Then, I wrote the compiler.

The benchmark results are impressive, but the interpreter is very slow. It uses a hash table rather than a 2D array. An interpreter that translates states and inputs to array indices might perform better. On the other hand, a better interpreter might be more complicated than my simple compiler.