% regex % Jessica L. Parsons % Wed Jan 16 14:21:30 PST 2008 #REGEX > This code is released into the public domain; you may do > anything you wish with it. The regular-expression engine here is inspired by the one in the book "Software Tools in Pascal" by B.W. Kernighan and P.J. Plauger, Addison-Wesley, 1981. (ISBN 0-201-10342-7). It was originally written in PDP-11 assembly and Pascal (in 1984) to run on a Terak 8510a running UCSD Pascal; the Pascal files were then converted to a homebrew language called P2 for use on a Terak 8510a running a much-modified version of UCSD Pascal. In 1985, it was rewritten in 68000 assembly language and C for use on the Atari ST; modifications, tweaks, and enhancements have been applied over the years since then. The current version will probably run on any 680x0 machine that has a C compiler and a motorola-format 680x0 assembler on it. Installing the regex library on your system is fairly simple; move `libregex.a` into your library directory and move `regex.h` into your include-file directory. ##Synopsis
char *compile(char *pattern) - compile regular expressions
char *execute(char *start, char *end) - match strings with compiled regular expressions
extern char *RElastp; - pointer at last char matched
extern char *REargs[]; - arguments array
##Description This is a simple regular-expression compiler/interpreter inspired by the one in Kerningham & Plauger's Software Tools in Pascal. To use it, you feed `compile()` a regular expression, then `execute()` each text object you want to do searches on. `Compile()` understands regular expressions containing the following tokens: =`$`= matches end of line (end of text or `\n`) =`^`= matches start of line (start of text or previous char is `\n`) =`.`= any character =`[`xxx`]`= characters matching or not matching a set of characters. `[abc]`, for example, matches a single a, b, or c. `[a-z]` matches any lower- case alphabetic (to match a literal -, it must be the first or last character in the set. If the first character in the set is a `^`, it means match anything _except_ characters found in the set; the only way you can match a `]` character is to have it be the first character in the set (to match `]` and `-`, you'd need to do `[]-]`) =`*`= match the previous token 0 or more times. Doesn't work on `^`, `$`, `*`, or `+`, though. =`+`= match the previous token 1 or more times. =`\<`= matches the beginning of a word. =`\>`= matches the end of a word. =`\(`= indicates the beginning of a text argument, Arguments may not be nested, and there are only 10 arguments allowed. =`\)`= indicates the end of a text argument. =`\ `_digit_= matches a previously matched argument (see `\(` and `\)` ) =`\ `= removes special meaning from all other tokens (for example, `\*` matches a `*`.) =`{`= introduces an arbitrary closure. An arbitrary closure is defined like `{`_minumum_`,`_maximum_`}` and matches the previous token at least _minumum_ times but not more than _maximum_ times. (For example, to match the previous token zero or more times, use {0,0}. Note that this is the `*` token -- for `+`, it's {1,0} - using 0 for the _maximum_ means it can match an infinite number of times.) If `compile()` successfully compiles the string you give it, it will return a null pointer; otherwise it will return a string describing the reason that it couldn't compile the expression. The compiled pattern can be found in the string `_REpattern`, which is a pascal-style string (1 word length, then arbitrary data out to that length.) `Execute()` tries to match the last pattern fed to `compile()` against whatever text string you feed it. It doesn't care what the format of that string is, just that you give it a pointer to the start and end of the string. If it finds a match to the pattern in the string, it returns the start of the pattern; otherwise it returns a null pointer. On a successful match, execute() fills in a couple of globals -= it fills in RElastp with a pointer at the last character matched and REargs with pointers to the start and end of each argument contained in the pattern. ##Example To match "hello" on stdin: char *msg; char line[200]; char *args[20]; /* 10 arguments, 2 pointers each */ char *p; msg = compile("hello"); if (msg != (char *)0) { fprintf(stderr, "RE: %s\n", msg); exit(0); } while (gets(line)) if (p=execute(line, line+strlen(line)-1)) puts(line); ##Bugs 1. The code-buffer for the compiled strings is restricted to 1k compiled strings. If you do many [] strings, these will tend to eat a lot of this code-buffer and may cause buffer overflow errors. 2. Closure will only match single-character items. 3. The repeat-counts for closure can't be any greater than 255, unless you're doing infinite repeats. ##Author jessica l. parsons (Orc), 25-Mar-1991 ##Source Code =[`execute.s`](execute.s)= The regular expression scanner (68k assembly language.) =[`regex.h`](regex.h)= library headerfile. =[`re_token.h`](re_token.h)= headerfile for compile.c =[`compile.c`](compile.c)= the regular-expression compiler. =[`gre.c`](gre.c)= a simple grep for seeing how it works. =[`makefile`](makefile)= a makefile for it.