Applied Code Reading: Debugging FreeBSD Regex
When the code we're trying to read is inscrutable, inserting print statements and running various test cases can be two invaluable tools. Earlier today I fixed a tricky problem in the FreeBSD regular expression library. The code, originally written by Henry Spencer in the early 1990s, is by far the most complex I've ever encountered. It implements sophisticated algorithms with minimal commenting. Also, to avoid code repetition and increase efficiency, the 1200 line long main part of the regular expression execution engine is included in the compiled C code three times after modifying various macros to adjust the code's behavior: the first time the code targets small expressions and operates with bit masks on long integers, the second time the code handles larger expressions by storing its data in arrays, and the third time the code is also adjusted to handle multibyte characters. Here is how I used test data and print statements to locate and fix the problem.
The Bug
The problem described in the original bug report is simple. In the following command
echo "ab" | sed -E "s/(()|.)(b)/[&]/"
echo "ab" | sed -E "s/(.|())(b)/[&]/"
To experiment with the code without dragging the complete implementation of sed with me, I first wrote a small test harness that takes as arguments a regular expression and a string and prints the matching part.
#include <stdio.h>
#include <regex.h>
int
main(int argc, char *argv[])
{
regex_t r;
regmatch_t m[10];
int i;
if (argc != 3) {
fprintf(stderr, "usage: %s regex string\n", argv[0]);
return (1);
}
if (regcomp(&r, argv[1], REG_EXTENDED) != 0) {
fprintf(stderr, "%s: regcomp of %s failed\n", argv[0], argv[1]);
return (1);
}
if (regexec(&r, argv[2], 10, m, REG_TRACE - REG_TRACE) != 0) {
fprintf(stderr, "%s: string %s doesn't match RE %s\n", argv[0], argv[2], argv[1]);
return (1);
}
for (i = m[0].rm_so; i < m[0].rm_eo; i++)
putchar(argv[2][i]);
return (0);
}
$ cc -g -o harness -I ../locale regcomp.c regerror.c regexec.c regfree.c harness.c
$ ./harness '(()|.)(b)' ab
b
$ ./harness '(.|())(b)' ab
ab
Debug Print Statements
I guessed that
running a debugger on code that has been included multiple times and
macro-defined to death wouldn't be very productive, so I decided to trace
the code through print statements.
I used to think that debug print statements are for people who
can't use a debugger, but over the years I've realized that
debug print statements are a valuable tool.
I searched through the code for the word print
guessing that the
original author would have used the same technique, and I was right.
I quickly found code like the following.
#ifdef REDEBUG
static void print(struct match *m, char *caption, states st, int ch, FILE *d);
#endif
#ifdef REDEBUG
#define SP(t, s, c) print(m, t, s, c, stdout)
#define AT(t, p1, p2, s1, s2) at(m, t, p1, p2, s1, s2)
#define NOTE(str) { if (m->eflags®_TRACE) printf("=%s\n", (str)); }
REDEBUG
I immediately recompiled the code
with the corresponding macro defined.
$ cc -g -DREDEBUG -o harness -I ../locale regcomp.c regerror.c regexec.c regfree.c harness.c
The Wrong Character
Sadly, the debugging output appeared cryptic to my eyes.saft b 8, 9, 10, 11, 12, 13 sboweow \37777777577 8, 9, 10, 11, 12, 13 =dissecting diss b-\0 1-13 slow b-\0 2-9 sstart b 2 saft b 8, 9 sboweow \37777777577 8, 9 slow \0-\0 9-13 sstart 9But I had on my hands two very similar test cases, one of which worked correctly. So I decided to compare the output of the two.
Correct | Wrong |
---|---|
=dissecting diss a-\0 1-13 slow a-\0 2-9 sstart a 2 |
=dissecting diss b-\0 1-13 slow b-\0 2-9 sstart b 2 |
fast
matching routine was refusing
to match the a character in the erroneous case.
The Fast Matcher
By adding some additional debugging output to the fast matching routine
I guessed that it worked by checking in a loop whether the states
reachable from a given input character are different from the states
reachable from the previous one.
If the two state sets are the same, this means that no progress is being
made and the character pointer coldp
advances to ignore
that character.
if (EQ(st, fresh))
coldp = p;
Correct | Wrong |
---|---|
start a 1, 2, 3, 5, 6, 7, 8, 9, 10, 11 Make cold point to a (same states before and after) boweow a 1, 2, 3, 5, 6, 7, 8, 9, 10, 11 aft a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 aft b 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 |
start a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 Make cold point to a (same states before and after) boweow a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 aft a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 Make cold point to b (same states, i.e. no progress) aft b 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 |
The State Transition Bug
So who adds those states?
Another routine, step
, is, according to its comment,
responsible for mapping a set of states reachable before a character
to the set reachable after it.
This routine works by following the operators of the parsed regular
expression and adding the corresponding states to a set which initially
contains only the starting state.
I added some debugging output, but initially all I got were very large
meaningless numbers.
OP=1744830464 1, OP=2013268920 1, 2Switching the output to hexadecimal showed me that I was looking at some small bit values located on the number's most significant bits.
OP=68000000 1, OP=78000000 1, 2,and a quick search for the constants used in the corresponding
case
labels found me the code that defined them.
#define OPSHIFT ((unsigned)27)
// [...]
#define OANY (5L<<OPSHIFT) /* . - */
// [...]
#define OLPAREN (13L<<OPSHIFT) /* ( fwd to ) */
#define ORPAREN (14L<<OPSHIFT) /* ) back to ( */
#define OCH_ (15L<<OPSHIFT) /* begin choice fwd to OOR2 */
#define OOR1 (16L<<OPSHIFT) /* | pt. 1 back to OOR1 or OCH_ */
#define OOR2 (17L<<OPSHIFT) /* | pt. 2 fwd to OOR2 or O_CH */
#define O_CH (18L<<OPSHIFT) /* end choice back to OOR1 */
Correct | Wrong |
---|---|
OP=68000000 O_LPAREN 1, OP=78000000 O_CH_ 1, 2, OP=28000000 OANY 1, 2, 3, 5, OP=80000000 OOR1 1, 2, 3, 5, OP=88000000 OOR2 1, 2, 3, 5, OP=68000000 OLPAREN 1, 2, 3, 5, 6, OP=70000000 ORPAREN 1, 2, 3, 5, 6, 7, OP=90000000 O_CH 1, 2, 3, 5, 6, 7, 8, OP=70000000 O_RPAREN 1, 2, 3, 5, 6, 7, 8, 9, OP=68000000 1, 2, 3, 5, 6, 7, 8, 9, 10, OP=10000000 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, OP=70000000 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, |
OP=68000000 O_LPAREN 1, OP=78000000 O_CH_ 1, 2, OP=68000000 OLPAREN 1, 2, 3, 6, OP=70000000 ORPAREN 1, 2, 3, 4, 6, OP=80000000 OOR1 1, 2, 3, 4, 5, 6, OP=88000000 OOR2 1, 2, 3, 4, 5, 6, 8, OP=28000000 OANY 1, 2, 3, 4, 5, 6, 7, 8, OP=90000000 O_CH 1, 2, 3, 4, 5, 6, 7, 8, OP=70000000 O_RPAREN 1, 2, 3, 4, 5, 6, 7, 8, 9, OP=68000000 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, OP=10000000 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, OP=70000000 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, |
O_CH_
stands for the start of a choice
subexpression,
OOR1
and
OOR2
for the points immediately before and after the operator, and
O_CH
for the end of the choice subexpression.
From the behavior of fast
I already knew I was chasing
a rogue state.
So I noticed in the above that in the erroneous case an additional
state was added after
OOR1
.
The code handling this case certainly seemed tricky:
if (ISSTATEIN(aft, here)) {
for (look = 1; OP(s = g->strip[pc+look]) != O_CH; look += OPND(s))
assert(OP(s) == OOR2);
FWD(aft, aft, look);
}
FWD
macro set the state corresponding to
O_CH
, but I found the value of look
fishy for that
purpose; the start value of 1 ringed to me an off-by-one alarm bell.
For the small-sized regular expressions FWD
is implemented
with bit twiddling operators
#define FWD(dst, src, n) ((dst) |= ((unsigned long)(src)&(here)) << (n))
OOR1
and
O_CH
cases.
The output was enlightening.
Correct | Wrong |
---|---|
OP=68000000 O_LPAREN 1, OP=78000000 O_CH_ 1, 2, OP=28000000 OANY 1, 2, 3, 5, OP=80000000 OOR1 1, 2, 3, 5, OP=88000000 OOR2 1, 2, 3, 5, OP=68000000 OLPAREN 1, 2, 3, 5, 6, OP=70000000 ORPAREN 1, 2, 3, 5, 6, 7, OP=90000000 O_CH 1, 2, 3, 5, 6, 7, 8, OCH FWD set=100 << 1 OP=70000000 O_RPAREN 1, 2, 3, 5, 6, 7, 8, 9, OP=68000000 1, 2, 3, 5, 6, 7, 8, 9, 10, OP=10000000 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, OP=70000000 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, |
OP=68000000 O_LPAREN 1, OP=78000000 O_CH_ 1, 2, OP=68000000 OLPAREN 1, 2, 3, 6, OP=70000000 ORPAREN 1, 2, 3, 4, 6, OP=80000000 OOR1 1, 2, 3, 4, 5, 6, look for O_CH OOR1 FWD set=20 << 3 OP=88000000 OOR2 1, 2, 3, 4, 5, 6, 8, OP=28000000 OANY 1, 2, 3, 4, 5, 6, 7, 8, OP=90000000 O_CH 1, 2, 3, 4, 5, 6, 7, 8, OCH FWD set=100 << 1 OP=70000000 O_RPAREN 1, 2, 3, 4, 5, 6, 7, 8, 9, OP=68000000 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, OP=10000000 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, OP=70000000 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, |
20 << 3 == 100
whereas
100 << 1 == 200
, so an additional wrong state was being introduced.
Fix and Test
In most cases, once you've found a bug the fix is easy.
I simply added a + 1
to the code
if (ISSTATEIN(aft, here)) {
for (look = 1; OP(s = g->strip[pc+look]) != O_CH; look += OPND(s))
assert(OP(s) == OOR2);
FWD(aft, aft, look + 1);
}
O_CH
.
This indeed corrected the behavior for the specific problem.
But how could I be sure that my fix did not break something else? Fortunately, this regex implementation contains a regression test suite with a few hundreds of test cases. So, in the test specification file I added two lines corresponding to the two test cases of the bug report
# PR 130504 (.|())(b) - ab ab (()|.)(b) - ab aband then removed my fix to verify that the test would indeed fail.
$ make sh mkh -i _REGEX_H_ ../regex2.h ../reg*.c >regex.tmp [...] ./re < tests 477: ERE matched `b' instead *** Error code 1Finally, I reintroduced the fix and was happy to see that all the tests passed without a problem.
$ make sh mkh -i _REGEX_H_ ../regex2.h ../reg*.c >regex.tmp [...] ./re < tests ./re -el < tests ./re -er < tests
In the end I guess I hadn't read more than 10% of the 3400 lines of the code. My essential problem-solving tools were running the code, adding debug print statements, and differential debugging.
Read and post comments