cl-irregsexp is an ASDF-installable library for fast text matching. It goes beyond the facilities allowed by traditional regular expressions, while also making the matchers easier to write and maintain.
Traditional regexps are everywhere and well-understood. There are already several fine regular expression toolkits for Common Lisp. They are both complete and mature. Why use cl-irregsexp?
(register-groups-bind (method url version-major version-minor) ("(\\S+)\\s+(\\S+)\\s+HTTP/(\\d+).(\\d+)$" line) ...)write
(match-bind (method (+ (space)) url (+ (space)) "HTTP/" (version-major (integer)) "." (version-minor (integer)) (last)) line ...)which might be clearer and doesn't involve mentioning the captured variables twice.
Write to the CL-IRREGSEXP-devel mailing list. Or add to the wiki.
Tarball of the the latest git commit.
Here are some snapshots.
To check out the tree locally:
git clone http://common-lisp.net/projects/cl-irregsexp/cl-irregsexp.git
The string matcher is generated with macros, and is completely inlined. It does not build a state machine (DFA) lookup table but uses case statements to implement the state machine in native code.
First the matcher description is translated to an intermediate form with the following primitives:
A few transforms are applied to the intermediate form, then it is output as Lisp, which will hopefully be compiled to efficient native machine code by the Lisp environment.
One particular case that has been optimised a little is that of searching for a constant string. The algorithm used is Boyer-Moore, but implementation has some unusual facets.
It is very irritating when people two different spellings for the same word. cl-irregsexp can sometimes search for both nearly as fast as it can search for just one.
To demonstrate its efficiency, I chose to compare how long it takes to search for "indecipherable" or "undecipherable" in a long string, among many regex implementations.
The haystack was one million random lowercase letters followed by "undecipherable". Each implementation reads the haystack into memory and then prints out how many milliseconds it took to search it for the two possibilities. Just to be safe, it also has to print the offset where the needle was found.
To make the test results less susceptible to slow starts from caches and branch prediction units, most implementations repeat the search 1000 times and print out the average number of milliseconds. The implementations all abort if the string is ever found at an incorrect position.
Implementation | Time (s), smaller is better |
---|---|
cl-ppcre.clozure | 214.70 |
- | 214.56 |
- | 214.19 |
- | 213.39 |
- | 215.26 |
cl-ppcre.sbcl | 101.44 |
- | 101.44 |
- | 101.42 |
- | 101.41 |
- | 101.42 |
ruby.rb | 41.18 |
- | 41.82 |
- | 42.08 |
- | 41.41 |
- | 42.30 |
python.py | 36.49 |
- | 36.56 |
- | 36.81 |
- | 36.63 |
- | 36.50 |
pcre | 32.12 |
- | 32.11 |
- | 32.11 |
- | 31.98 |
- | 33.01 |
perl.pl | 24.55 |
- | 24.56 |
- | 24.52 |
- | 24.60 |
- | 24.56 |
cl-irregsexp.clozure | 12.02 |
- | 12.04 |
- | 12.02 |
- | 12.11 |
- | 12.10 |
re2 | 10.09 |
- | 10.08 |
- | 10.09 |
- | 10.08 |
- | 10.10 |
cl-irregsexp.sbcl | 6.24 |
- | 6.24 |
- | 6.25 |
- | 6.24 |
- | 6.23 |
The source code for all is available in the cl-irregsexp download in the bench/ directory.
Ruby has the smallest and neatest benchmark program. It gets the award for concision.
Perl has the implementation involving the most one character global variables. It wins the second prize for speed and the first prize for being impenetrable to language dilettanti.
I also tried with clisp (a Common Lisp implementation that compiles to byte code). It took a couple of minutes with cl-irregsexp but a massive 2622.48s with cl-ppcre, so I did not include it in the results.
If you fancy sending in an example for another regexp implementation, I'd be pleased to add it!
Here are the commands to make the data in the bash shell.
$ cat /dev/urandom | tr -d -c abcdefghijklmnopqrstuvwxyz | dd bs=1 count=1000000 > test-data $ echo undecipherable >> test-data
The code generated calculates a skip distance by doing a (case (peek-one-character) ...). SBCL unfortunately translates this into a sequence of conditional jumps apparently independently of the number of different cases. At some point it is surely better to use a jump table.