A good type system covers a lot of the small bugs. Unit tests, careful design, and a sound mind can get you through some bigger ones. But sometimes complex bugs in large codebases call for heavy-duty debugging tools.
One such tool is Delta Debugging, which repeatedly shrinks recursive data structures to find a minimal counterexample that still exhibits the bug. You’ll appreciate the technique if you’ve ever used
git bisect to locate a small breaking change in a large codebase.
- Introduces the Delta Debugging technique
- Implements a general-purpose delta-debugging tool
- Uses Haskell’s FFI to control a C chess engine
- Locates an error introduced into the chess engine
In my last article, I implemented a chess engine in C. I’ve introduced an error into its move generator. Recall that the move generation involved 5 functions:
The “gold standard” for testing a chess engine is to compare the output of the above
perft function with published reference values. The bug that we’ll be finding is easily visible from the starting state of the board, so
new_game will be the only
gamestate value I’ll expose from the engine.
In the FFI section, I’ll define Haskell wrappers for everything:
To detect the bug, we’ll need a reference version of
perft to compare ours against. I’ll use Roce Chess, so our Haskell will literally invoke a
roce38 process, set the board state, and ask it to calculate a
I’ve removed the
IO constraint using
unsafePerformIO to simplify the article’s progression, but the final solution will be general enough to work over
Here’s a variation on QuickCheck’s
If you have an HTML file that crashes your web browser,
minimize would remove tags or characters until it is a minimal crash-causing file. This variant would remove only a single tag or character at a time, while
Arbitrary removes entire batches for performance.
perft takes a
Gamestate and a
Depth. The “smallest possible” means that the
Depth is the smallest.
And here’s how to apply this to find a specific position and move that differ between the two engines:
The FEN string can be pasted directly into a program like
XBoard to visualize the position, and “e1e2” is a move that Rocechess emits that my engine doesn’t.
At this point, you would have to dig into the C to see why the King at e1 is not being allowed to move northward to e2. I’ll spoil that by revealing the change that I made:
My experience across many different software projects is that most of the work in debugging is in creating a small isolated test case that demonstrates the problem, and not in the actual fix.
Divisible is the main reason I wrote this article, and allows us to generalize
minimize above. Since the context is debugging a chess engine, I’ll use the language of deductive logic to describe how we’ll use it.
The idea is that to prove that the value of
perft g 6 is correct, you need two components:
- A way to split
perft g 6into subproblems
perft g' 5.
- A way to prove base cases correct:
perft g 0 == 1
Since we’re debugging, such a proof is impossible because we have a failing testcase in-hand. However, we get to define what “proof” means in our context, so we’ll use “Correct, or produces a concrete minimal counterexample”:
There are actually 3 different typeclasses, corresponding to 3 different proof techniques:
Contravariantgeneralizes Modus Tollens.
Divisiblegeneralizes Conjunction Introduction.
Decidablegeneralizes Disjunction Elimination.
In most programs, you start with values obtained from sensory data like files or system calls, and apply functions to convert those into more abstract or useful information. This is reversed when debugging: We start with a goal in mind(a failing testcase), and work “backwards” to find the first place with a problem. Thus,
Decidable have type signatures that seem “impossible” in forward-thinking programs:
In a proof context,
contramap is a way of reducing a goal of type
a to a goal of type
f b represents a sensor that detects erroneous
b values, then we can convert it into a sensor that detects erroneous
a values by converting the
bs and then scanning the
A goal will usually have a function type like
type Goal a = a -> Bool. The mystery behind
Contravariant’s impossible signature goes away when you realize we don’t have a value of type
b in the first place: We only have a function, so we’re waiting on code upstream to deliver one to us.
Here’s a concrete implementation
contramap lets us solve a goal by relating it to an easier goal, and then solving the easier one. We might also want to split the goal into multiple easier subgoals:
divide is reasonable enough, but in a proof context
conquer seems to let us prove anything at all.
divide can be generalized to arbitrary tuples, and
conquer is the empty conjuction:
divide takes a collection of subproblems and solutions for those subproblems, and combines them into a solution for the larger problem. Here’s the tuple example generalized to arbitrarily-large lists:
divideList above isn’t fully-general because it requires every subproblem to be of the same type.
conquer is an assertion that no further subproblems need to be checked. Recall that our
Cx type means “Either all children are correct, or we’ve found suspicious evidence”.
conquer will correspond to this first case.
Here’s how to implement
divide for our counterexample detector:
Since both subproblems must be clean in order for the parent to be clean, we can can return early if we find an error in the left branch.
contramap corresponds to the logical contrapositive, while
divide corresponds to conjuction. The other major tool is disjunction, which is provided by
lose is a placeholder when you know that a branch in
choose is impossible yet are required to provide a proof. There are only 3 possible implementations:
- An exception, using
- A constant distinct from
I don’t see any examples of the 3rd type in the
contravariant package, but I imagine that it could be used to write backwards backtracking algorithms similarly to how
guard :: Bool -> [()] is used to prune branches using
choose in our context says that if you are trying to locate errors in a value of type
a, can split it into one of two subproblems, and can locate errors in both subproblems, then you can locate errors by figuring out which subproblem is appropriate for
a and reusing its proof. Here’s the implementation:
The main difference between
Divisible is that
Decidable branches to precisely one of its subproblems, while
Divisible combines subproblems together.
decideList to give a feel for how
Decidable works in practice before moving on to reimplement
The idea here is that
(Integer, b) consists of a branch selector and a value
Left would be branch 0, while
Right would be branch 1.
caseConstructor either chooses the current branch, or threads the
a value along to the next one.
decideList isn’t fully-general because it requires every subproblem to be of the same type.
We’re ready to generalize our original
minimize function. Here it is again:
We generalize it in 3 steps:
Cxtype contains an
a -> Maybe a, which can represent both the
a -> Booland
a -> aportions of
Reducible a => Cx a a -> Cx a a.
Second, I’ve generalized
Decidable fconstraint, causing its signature to be
(Decidable f, Reducible a) => f a -> f a.
find :: (a -> Bool) -> [a] -> Maybe a. So I’ve generalized it to
findD :: Decidable f => f a -> f [a].
In our specific context, here’s how you might read these in English:
minimizeD: A minimal error is either in one of the
reductionsfrom the current node, or the current node itself(
findD: A minimal error is either nonexistent because the list was empty(
conquer), the current element(
p), or in the remainder of the list(
And here’s a function with a simpler type that can be easily applied:
One of the reasons I wrote an FFI to a C chess engine was to demonstrate that the terms and typeclasses I’m pushing around aren’t just abstract nonsense but have actual computational meaning. Nobody criticizes C for being too abstract, so I felt that it would give a good counterbalance to the control code in
pred instantiates a separate system process every time it’s called. It seems reasonable that a good solution should create as few subprocesses as possible. Even ignoring my use of
unsafePerformIO, a good solution should avoid recursing too deeply or invoking expensive functions unnecessarily.
I’ll add a global counter that we increment every time we create a Rocechess process. I’ll use the counter to detect any hidden performance issues.
timeTicks forces its input by printing it as output, and compares the counter before and after. Here’s the output using the original
So a total of 14 processes are spawned calculating reference
perft values to find this minimal testcase. How does our slick one-liner fare?
The generalized algorithm generates 18x as many calls as our simple one, to find the same failing game position.
To solve this problem, I’ll first diagnose the problem and then introduce a new typeclass
Refinable to solve it.
Let’s compare the two side-by-side:
f on each element emitted by
minimizeD pred. With
minimize, there is no recursion if the node evaluates to
Nothing, since we know all children are correct.
We could stop the recursion by only passing
But now the counterexample is only 1 step smaller, and not a minimal counterexample.
What we need is to refine the failure case. First we run
pred to scan for an element in a parent node(a coarse failure), then we run
minimizeD pred to reduce it once we’ve confirmed that one is present(a fine failure).
Here’s how to implement that:
divide lets us abort a proof as soon as we find a counterexample,
implies lets us abort a proof as soon as we prove that no counterexample exists:
Both optimizations are used here: The first in
findD to abort searching when we find an element, and the second we’ll add explicitly:
Now the numbers are almost equal:
The extra node is because
minimizeD scans the root node, while
minimize immediately recurses into it. Thus, we’ve inadvertently fixed a minor inconsistency in
The last component is the interface between the C++ engine and the Haskell control code. To start with, here are C++ functions I’ll expose:
Since GHC doesn’t support passing
struct values on the stack, I allocate temporary memory and pass in a pointer for the C++ code to write to.
Each wrapped function follows a regular pattern of allocating memory, writing arguments to it, calling the function, and reading the result. For brevity, I’ll only show a single one:
That covers my own chess engine. We also need to call out to Rocechess to use as a reference. Here’s the function that does that:
Note the use of
tickCounter to measure the number of times we invoke this expensive command.
Why not Arbitrary?
I mainly used
Reducible in this article because it is a simpler typeclass than
Arbitrary. There’s not much reason to use
Arbitrary for our chess example. This is because
perft test failures are monotone: A successful test implies that all transitive children are successful. A real example that isn’t monotone that I’ve come across would be a compiler that emitted a jump to a non-existent label:
In this case, we want
minimize to return just
goto label2, because that is the smallest statement that exhibits the bug. However,
shrinkList in the
QuickCheck package employs an optimization that drops an exponentially-decreasing percentage of elements from a list for performance, which means it could drop
This still has a jump to a non-existent label, but only because our shrinking dropped it. It’s possible to restore the monotone property in full by defining the error condition in sufficient detail; but it can also be restored by only reducing to immediate children. It may be easier to check that
label1 can be excluded from the immediate children because
goto label1 depends on it, than to check whether an entire set of statements can be safely removed.
We generalized a function for reducing a failing testcase to a minimal state. In particular, there is an instance for
Compose allowing the resulting algorithm to work over any
Applicative. This allows one to augment the reduction process with benchmarking information, experiment with clearly-defined optimizations, and test novel reduction algorithms on simple pure values which automatically generalize to the full application.
The code for this article is available on Github.
Future articles may cover:
- Further optimizing testcase reduction tools
- Generalizing other common utilities with abstract typeclasses
- More techniques for debugging programs
About the Author
Michael Burge helps companies efficiently process their large datasets. This includes everything from database and schema design for new projects, problem diagnosis and repair for existing projects hitting scaling limits, and even bespoke SQL compilers and databases for exceptionally difficult cases. Hire Michael to benefit from his extensive knowledge and experience.