REGEX

This code is released into the public domain; you may do anything you wish with it.

The regular-expression engine here is inspired by the one in the book “Software Tools in Pascal” by B.W. Kernighan and P.J. Plauger, Addison-Wesley, 1981. (ISBN 0-201-10342-7). It was originally written in PDP-11 assembly and Pascal (in 1984) to run on a Terak 8510a running UCSD Pascal; the Pascal files were then converted to a homebrew language called P2 for use on a Terak 8510a running a much-modified version of UCSD Pascal. In 1985, it was rewritten in 68000 assembly language and C for use on the Atari ST; modifications, tweaks, and enhancements have been applied over the years since then. The current version will probably run on any 680x0 machine that has a C compiler and a motorola-format 680x0 assembler on it.

Installing the regex library on your system is fairly simple; move libregex.a into your library directory and move regex.h into your include-file directory.

Synopsis

char *compile(char *pattern) - compile regular expressions
char *execute(char *start, char *end) - match strings with compiled regular expressions
extern char *RElastp; - pointer at last char matched
extern char *REargs[]; - arguments array

Description

This is a simple regular-expression compiler/interpreter inspired by the one in Kerningham & Plauger’s Software Tools in Pascal. To use it, you feed compile() a regular expression, then execute() each text object you want to do searches on.

Compile() understands regular expressions containing the following tokens:

$
matches end of line (end of text or \n)
^
matches start of line (start of text or previous char is \n)
.
any character
[xxx]
characters matching or not matching a set of characters. [abc], for example, matches a single a, b, or c. [a-z] matches any lower- case alphabetic (to match a literal -, it must be the first or last character in the set. If the first character in the set is a ^, it means match anything except characters found in the set; the only way you can match a ] character is to have it be the first character in the set (to match ] and -, you’d need to do []-])
*
match the previous token 0 or more times. Doesn’t work on ^, $, *, or +, though.
+
match the previous token 1 or more times.
\<
matches the beginning of a word.
\>
matches the end of a word.
\(
indicates the beginning of a text argument, Arguments may not be nested, and there are only 10 arguments allowed.
\)
indicates the end of a text argument.
\digit
matches a previously matched argument (see \( and \) )
\
removes special meaning from all other tokens (for example, \* matches a *.)
{
introduces an arbitrary closure. An arbitrary closure is defined like {minumum,maximum} and matches the previous token at least minumum times but not more than maximum times. (For example, to match the previous token zero or more times, use {0,0}. Note that this is the * token – for +, it’s {1,0} - using 0 for the maximum means it can match an infinite number of times.)

If compile() successfully compiles the string you give it, it will return a null pointer; otherwise it will return a string describing the reason that it couldn’t compile the expression. The compiled pattern can be found in the string _REpattern, which is a pascal-style string (1 word length, then arbitrary data out to that length.)

Execute() tries to match the last pattern fed to compile() against whatever text string you feed it. It doesn’t care what the format of that string is, just that you give it a pointer to the start and end of the string. If it finds a match to the pattern in the string, it returns the start of the pattern; otherwise it returns a null pointer.

On a successful match, execute() fills in a couple of globals -= it fills in RElastp with a pointer at the last character matched and REargs with pointers to the start and end of each argument contained in the pattern.

Example

To match “hello” on stdin:

char *msg;
char line[200];
char *args[20];     /* 10 arguments, 2 pointers each */
char *p;

msg = compile("hello");
if (msg != (char *)0) {
    fprintf(stderr, "RE: %s\n", msg);
    exit(0);
}

while (gets(line))
    if (p=execute(line, line+strlen(line)-1))
        puts(line);

Bugs

  1. The code-buffer for the compiled strings is restricted to 1k compiled strings. If you do many [] strings, these will tend to eat a lot of this code-buffer and may cause buffer overflow errors.
  2. Closure will only match single-character items.
  3. The repeat-counts for closure can’t be any greater than 255, unless you’re doing infinite repeats.

Author

david parsons (Orc), 25-Mar-1991

Source Code

execute.s
The regular expression scanner (68k assembly language.)
regex.h
library headerfile.
re_token.h
headerfile for compile.c
compile.c
the regular-expression compiler.
gre.c
a simple grep for seeing how it works.
makefile
a makefile for it.