Debugging C with Haskell's Divisible
A good type system covers a lot of the small bugs. Unit tests, careful design, and a sound mind can get you through some bigger ones. But sometimes complex bugs in large codebases call for heavy-duty debugging tools.
One such tool is Delta Debugging, which repeatedly shrinks recursive data structures to find a minimal counterexample that still exhibits the bug. You’ll appreciate the technique if you’ve ever used git bisect
to locate a small breaking change in a large codebase.
This article:
- Introduces the Delta Debugging technique
- Implements a general-purpose delta-debugging tool
- Uses Haskell’s FFI to control a C chess engine
- Locates an error introduced into the chess engine
The Setup
In my last article, I implemented a chess engine in C. I’ve introduced an error into its move generator. Recall that the move generation involved 5 functions:
The “gold standard” for testing a chess engine is to compare the output of the above perft
function with published reference values. The bug that we’ll be finding is easily visible from the starting state of the board, so new_game
will be the only gamestate
value I’ll expose from the engine.
In the FFI section, I’ll define Haskell wrappers for everything:
To detect the bug, we’ll need a reference version of perft
to compare ours against. I’ll use Roce Chess, so our Haskell will literally invoke a roce38
process, set the board state, and ask it to calculate a perft
value.
I’ve removed the IO
constraint using unsafePerformIO
to simplify the article’s progression, but the final solution will be general enough to work over IO
.
Reducible
Here’s a variation on QuickCheck’s Arbitrary
typeclass:
If you have an HTML file that crashes your web browser, minimize
would remove tags or characters until it is a minimal crash-causing file. This variant would remove only a single tag or character at a time, while Arbitrary
removes entire batches for performance.
Recall that perft
takes a Gamestate
and a Depth
. The “smallest possible” means that the Depth
is the smallest.
And here’s how to apply this to find a specific position and move that differ between the two engines:
The FEN string can be pasted directly into a program like XBoard
to visualize the position, and “e1e2” is a move that Rocechess emits that my engine doesn’t.
At this point, you would have to dig into the C to see why the King at e1 is not being allowed to move northward to e2. I’ll spoil that by revealing the change that I made:
My experience across many different software projects is that most of the work in debugging is in creating a small isolated test case that demonstrates the problem, and not in the actual fix.
Divisible
Divisible
is the main reason I wrote this article, and allows us to generalize minimize
above. Since the context is debugging a chess engine, I’ll use the language of deductive logic to describe how we’ll use it.
The idea is that to prove that the value of perft g 6
is correct, you need two components:
- A way to split
perft g 6
into subproblemsperft g' 5
. - A way to prove base cases correct:
perft g 0 == 1
Since we’re debugging, such a proof is impossible because we have a failing testcase in-hand. However, we get to define what “proof” means in our context, so we’ll use “Correct, or produces a concrete minimal counterexample”:
There are actually 3 different typeclasses, corresponding to 3 different proof techniques:
Contravariant
generalizes Modus Tollens.Divisible
generalizes Conjunction Introduction.Decidable
generalizes Disjunction Elimination.
Contravariant
In most programs, you start with values obtained from sensory data like files or system calls, and apply functions to convert those into more abstract or useful information. This is reversed when debugging: We start with a goal in mind(a failing testcase), and work “backwards” to find the first place with a problem. Thus, Contravariant
, Divisible
, and Decidable
have type signatures that seem “impossible” in forward-thinking programs:
In a proof context, contramap
is a way of reducing a goal of type a
to a goal of type b
: If f b
represents a sensor that detects erroneous b
values, then we can convert it into a sensor that detects erroneous a
values by converting the a
s to b
s and then scanning the b
s.
A goal will usually have a function type like type Goal a = a -> Bool
. The mystery behind Contravariant
’s impossible signature goes away when you realize we don’t have a value of type a
or b
in the first place: We only have a function, so we’re waiting on code upstream to deliver one to us.
Here’s a concrete implementation
Divisible
contramap
lets us solve a goal by relating it to an easier goal, and then solving the easier one. We might also want to split the goal into multiple easier subgoals:
divide
is reasonable enough, but in a proof context conquer
seems to let us prove anything at all. divide
can be generalized to arbitrary tuples, and conquer
is the empty conjuction:
In general, divide
takes a collection of subproblems and solutions for those subproblems, and combines them into a solution for the larger problem. Here’s the tuple example generalized to arbitrarily-large lists:
The divideList
above isn’t fully-general because it requires every subproblem to be of the same type.
Then, conquer
is an assertion that no further subproblems need to be checked. Recall that our Cx
type means “Either all children are correct, or we’ve found suspicious evidence”. conquer
will correspond to this first case.
Here’s how to implement divide
for our counterexample detector:
Since both subproblems must be clean in order for the parent to be clean, we can can return early if we find an error in the left branch.
Decidable
contramap
corresponds to the logical contrapositive, while divide
corresponds to conjuction. The other major tool is disjunction, which is provided by choose
.
lose
is a placeholder when you know that a branch in choose
is impossible yet are required to provide a proof. There are only 3 possible implementations:
- An exception, using
undefined
orabsurd
. const conquer
- A constant distinct from
conquer
I don’t see any examples of the 3rd type in the contravariant
package, but I imagine that it could be used to write backwards backtracking algorithms similarly to how guard :: Bool -> [()]
is used to prune branches using MonadPlus
.
choose
in our context says that if you are trying to locate errors in a value of type a
, can split it into one of two subproblems, and can locate errors in both subproblems, then you can locate errors by figuring out which subproblem is appropriate for a
and reusing its proof. Here’s the implementation:
The main difference between Decidable
and Divisible
is that Decidable
branches to precisely one of its subproblems, while Divisible
combines subproblems together.
I’ll implement decideList
to give a feel for how Decidable
works in practice before moving on to reimplement minimize
.
The idea here is that (Integer, b)
consists of a branch selector and a value b
. For Either
, Left
would be branch 0, while Right
would be branch 1. caseConstructor
either chooses the current branch, or threads the a
value along to the next one.
As with divideList
, decideList
isn’t fully-general because it requires every subproblem to be of the same type.
minimize
We’re ready to generalize our original minimize
function. Here it is again:
We generalize it in 3 steps:
-
First, our
Cx
type contains ana -> Maybe a
, which can represent both thea -> Bool
anda -> a
portions ofminimize
. Sominimize
becomesReducible a => Cx a a -> Cx a a
. -
Second, I’ve generalized
Cx
to aDecidable f
constraint, causing its signature to be(Decidable f, Reducible a) => f a -> f a
. -
Third,
minimize
callsfind :: (a -> Bool) -> [a] -> Maybe a
. So I’ve generalized it tofindD :: Decidable f => f a -> f [a]
.
In our specific context, here’s how you might read these in English:
minimizeD
: A minimal error is either in one of thereductions
from the current node, or the current node itself(id
).findD
: A minimal error is either nonexistent because the list was empty(conquer
), the current element(p
), or in the remainder of the list(findD p
).
And here’s a function with a simpler type that can be easily applied:
Performance
One of the reasons I wrote an FFI to a C chess engine was to demonstrate that the terms and typeclasses I’m pushing around aren’t just abstract nonsense but have actual computational meaning. Nobody criticizes C for being too abstract, so I felt that it would give a good counterbalance to the control code in minimizeD
.
Recall that pred
instantiates a separate system process every time it’s called. It seems reasonable that a good solution should create as few subprocesses as possible. Even ignoring my use of unsafePerformIO
, a good solution should avoid recursing too deeply or invoking expensive functions unnecessarily.
I’ll add a global counter that we increment every time we create a Rocechess process. I’ll use the counter to detect any hidden performance issues.
timeTicks
forces its input by printing it as output, and compares the counter before and after. Here’s the output using the original minimize
:
So a total of 14 processes are spawned calculating reference perft
values to find this minimal testcase. How does our slick one-liner fare?
The generalized algorithm generates 18x as many calls as our simple one, to find the same failing game position.
Refinable
To solve this problem, I’ll first diagnose the problem and then introduce a new typeclass Refinable
to solve it.
Let’s compare the two side-by-side:
Notice that minimize
calls f
on each element emitted by reductions
, while minimizeD
calls minimizeD pred
. With minimize
, there is no recursion if the node evaluates to Nothing
, since we know all children are correct.
We could stop the recursion by only passing pred
to findD
:
But now the counterexample is only 1 step smaller, and not a minimal counterexample.
What we need is to refine the failure case. First we run pred
to scan for an element in a parent node(a coarse failure), then we run minimizeD pred
to reduce it once we’ve confirmed that one is present(a fine failure).
Here’s how to implement that:
Just as divide
lets us abort a proof as soon as we find a counterexample, implies
lets us abort a proof as soon as we prove that no counterexample exists:
Both optimizations are used here: The first in findD
to abort searching when we find an element, and the second we’ll add explicitly:
Now the numbers are almost equal:
The extra node is because minimizeD
scans the root node, while minimize
immediately recurses into it. Thus, we’ve inadvertently fixed a minor inconsistency in minimize
.
FFI
The last component is the interface between the C++ engine and the Haskell control code. To start with, here are C++ functions I’ll expose:
Since GHC doesn’t support passing struct
values on the stack, I allocate temporary memory and pass in a pointer for the C++ code to write to.
Each wrapped function follows a regular pattern of allocating memory, writing arguments to it, calling the function, and reading the result. For brevity, I’ll only show a single one:
That covers my own chess engine. We also need to call out to Rocechess to use as a reference. Here’s the function that does that:
Note the use of tickCounter
to measure the number of times we invoke this expensive command.
Why not Arbitrary?
I mainly used Reducible
in this article because it is a simpler typeclass than Arbitrary
. There’s not much reason to use Reducible
over Arbitrary
for our chess example. This is because perft
test failures are monotone: A successful test implies that all transitive children are successful. A real example that isn’t monotone that I’ve come across would be a compiler that emitted a jump to a non-existent label:
In this case, we want minimize
to return just goto label2
, because that is the smallest statement that exhibits the bug. However, shrinkList
in the QuickCheck
package employs an optimization that drops an exponentially-decreasing percentage of elements from a list for performance, which means it could drop label1
and goto label2
:
This still has a jump to a non-existent label, but only because our shrinking dropped it. It’s possible to restore the monotone property in full by defining the error condition in sufficient detail; but it can also be restored by only reducing to immediate children. It may be easier to check that label1
can be excluded from the immediate children because goto label1
depends on it, than to check whether an entire set of statements can be safely removed.
Conclusion
We generalized a function for reducing a failing testcase to a minimal state. In particular, there is an instance for Compose
allowing the resulting algorithm to work over any Monad
or Applicative
. This allows one to augment the reduction process with benchmarking information, experiment with clearly-defined optimizations, and test novel reduction algorithms on simple pure values which automatically generalize to the full application.
The code for this article is available on Github.
Future articles may cover:
- Further optimizing testcase reduction tools
- Generalizing other common utilities with abstract typeclasses
- More techniques for debugging programs