Amazon Redshift is a SQL-based database with a frontend similar to PostgreSQL. The backend compiles queries to C++ code, which execute against a columnnar datastore. When I last benchmarked it for a former employer, well-tuned queries were about 3-5x slower than hand-rolled C++. But I also uncovered an interesting technique that lets you close that gap for CPU-bound queries.
This article will cover:
Writing a chess engine in C
Turning that C into plain x86-64 instructions
Executing those instructions on a running Amazon Redshift database
Here’s the first game I successfully played against the engine, on a real Redshift database:
If you’d like to try it on your live production database, you can simply run the files create-apply-move.sql and create-best-move.sql in the Github repository. These will create two functions pf_apply_move and pf_best_move.
Both take as input a FEN string representing the board. The initial state of the board is rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1.
pf_best_move takes a FEN string and a search depth, and produces a move with the highest expected score:
pf_apply_move will produce a FEN string representing the new board after a move:
Most moves are 4 characters, and represent the file and rank for the source and destination location. You can specify which promotion a pawn gets by adding /R, /N, /Q, /B, so that a7a8/Q is a move that moves the pawn on a7 to a8 promoting it to a Queen.
Invalid input usually results in an exception being thrown.
The Simplest Example
Let’s take the following C as an example:
You can turn this into assembly instructions with gcc:
So our shellcode is 0xb82a000000c3. Here’s a Python script that you can run on your own computer to execute it:
When I found it, Amazon had recently launched Python User-Defined Functions(UDFs). The Python translates pretty directly to this SQL:
which works exactly as expected:
When is it appropriate to use this? Practically speaking these UDFs are really slow to call(10,000 calls/second), so it was actually pretty hard to find a case to use them. But if you have a small number of CPU-intensive calls, then the overhead doesn’t matter as much and the benefit of C is larger.
Like a chess engine.
A chess engine consists of several components:
We’re going to write our own chess engine in C, because we have some very restrictive deployment requirements and C gives us the best balance of control, readability, and ease of debugging. If we used a language like Go, Haskell, or Python, we would have to deploy not only our code but also the implementation of these languages. Our C will map directly to reasonable assembly, and carries no additional implementation that we would need to debug.
There are 64 squares on a chess board, and modern Intel CPUs are 64-bit. So we’ll use bitsets to represent where each piece is on the board:
Each bitboard consists of 64 bits, which are set to 1 if the corresponding square contains that piece. current_player_bb is set to 1 if White controls the piece on that square.
castle_flags are 4 bits: Whether white or black can castle kingside or queenside.
Pawns are allowed to move 2 spaces on their first move. If player A uses this to avoid an enemy pawn’s capture, the En Passant rule allows player B to capture it regardless. The en_passant_sq is the coordinate of the skipped square from the previous move, or a special “invalid coordinate” value if no double-move occurred.
It will be convenient to develop the engine by assuming that White is always the player ready to move. This lets us hardcode the positions for castling, en passant, and pawn double-movement. We’ll flip the board after every move, and use the is_white property to track the real color of the current player.
Bitsets are easy to efficiently combine:
We also need to know how to map an index from 0-63 to a bitset’s bit. Here’s how to do that:
The decision to put the file coordinate first was intentional. Since each file is a contiguous byte, changing the endianness of a 64-bit bitset will change the order of the rows without changing their contents:
Here’s two examples of how to construct one of these gamestate values:
Finally, we’ll need some bit-twiddling helper functions:
A common operation is to efficiently list all possible moves from a given position. With one exception, we’ll say that a move is a pair of coordinates:
Once we’ve obtained a move, we can use apply_move to update our board, including removing captured pieces, updating whether the king can still castle, and swapping the board.
The “one exception” is that a pawn can promote to either a rook, knight, bishop, or queen once it reaches the 8th rank. Since we only need 6 bits to store the 64 chessboard positions, I placed the promotion choice in the to field:
Notice that the promotable pieces start at 4 and count down to 0. This simplifies the move generation code later.
You might imagine that we would keep a list or array of these move values. That would require dynamic memory allocation, which would complicate the deployment of our chess engine. It will be simpler to keep a small data structure on the stack that we can use to generate the moves on-demand:
If we had such a data structure, here’s how we could use it to count the number of moves:
The bitboards make this pretty easy to implement. An iterator’s job is to find the next available move, so it needs to:
Find the piece type
Find the specific piece on the board
Find a move made by that piece
If that piece is a pawn moving to promote, find a promotion piece
Here’s a direct translation of that:
The location of the specific piece is the first bit that’s set within the specific_piece bitboard
The gamestate has bitboards for every piece type, so if we clear bits when we’re done we can calculate piece_type from the first nonzero bitboard.
Here I’ve unrolled piece_type to make it look more like the gamestate type:
Since we already have a gamestate when we create the iterator, I actually reused the data structure to be both. It’s probably a better idea to use the 4-element type above, but here’s the real iterator type from my engine:
Most of the fields on a gamestate only need to be restricted to the current player to be directly usable:
reset_iterator_moves sets the current_piece_bb bitset to have all of the available moves for the current piece. If the piece has no available moves, then advance_iterator has the logic for finding the next available piece and its moves.
At this point, it’s mostly simple case analysis to figure out the information from our gamestate. I’ll show one last example below before moving on.
And here’s how to implement one of those:
I’ve omitted the below from this article for brevity. You can find the full implementation of these on Github.
The iterator generates “pseudo-legal” moves, with each call to advance_iterator being O(1). Pseudo-legal moves are moves that could be legal, but require expensive additional information to make a final decision.
For example, a king cannot move into check. Check detection requires knowing whether the opponent has a pseudo-legal move that attacks the king. My naive implementation applies the move and iterates over all of the opponent’s moves to see if any attack the king. Castling has a check detection on as many as 4 squares, and draws also have some restrictions.
Besides the efficiency, psuedo-legal moves can’t be merged with legal moves because the rules depend on it. Consider the scenario below:
Black has pinned White’s bishop to his king.
It is not legal for the white bishop to move to d5, because it leaves his king in check.
When testing legality, an engine might apply the move anyways to see if it results in check.
After moving to d5, it would not be legal for the black bishop to capture the white king, since it leaves his own king in check.
However, it is pseudo-legal for the black bishop to capture the king, which is what the check calculation uses.
An engine that only had the concept of legal moves would thus contain a bug allowing the white king to move into check.
Besides generating a list of moves, our engine also needs to choose the best one. Suppose we had a function true_score that gave us a perfectly accurate rating of the board, in the sense that true_score(g1) > true_score(g2) implies that g1 is more favorable to White than g2. Then we could calculate the perfect move as follows:
The minus sign on true_score is because it is scoring the opponent’s turn.
So the problem reduces to finding a good “score” for a board position. Here’s a perfect one:
The idea behind true_score is that your last move is either the last move in the game or the one before that. If you make the last move, it means you lose or draw, so you get a -1 score. Otherwise, you’ve won or drawn, and since we’re counting the player that draws as having lost, you must’ve won. Since true_score(g1) = 1 > true_score(g2) = -1 implies that White wins in g1 and loses in g2, true_score is perfectly accurate when comparing leaf nodes or 2nd-to-leaf nodes. Recursion handles the rest.
It’s not computationally feasible to calculate true_score. However, rather than recurse all the way out to checkmate, we could stop after N moves and guess whether we’ll be checkmated. The guess could take the form of a negative number, with lower numbers meaning we’re more likely to be checkmated.
Since we only ever take the maximum of two scores, the actual score values that evaluate returns don’t matter at all and only establish an ordering.
The personality of a chess engine is in the evaluation function. You can make it play more aggressively by raising the score for being able to attack enemy units, or for having pieces in the center.
Here are the 3 basic scoring functions we’ll be using:
Notice that we’ve moved true_scores checkmate score into score_availability.
Before we do anything else, we need to be certain we don’t have obvious bugs in our engine, like generating invalid moves, crashes, missing moves, or having a sign error in our search algorithm. And we can’t easily debug this code once it’s on Redshift.
The two big components are our move generation, and our search algorithm. The search algorithm is tested by unit tests with a few clear-cut initial positions that have one obviously-good move. The move generation is more interesting.
The gold standard for testing move generation is to count the number of states N moves deep from several initial positions, and compare it with known reference values. Here are the first few:
The first entry is 1 because there is exactly one possible board when you start a game of chess and don’t do anything to it.
The second entry is 20 because there are 20 possible moves from the starting position: 4 knight moves + 8 pawn single-jumps + 8 pawn double-jumps.
The third entry is 400 = 20*20 because black has the same 20 moves as white. Since there are two independent choices, they multiply.
If it works, you can have great confidence in the correctness of your move generator. But when it inevitably turns up errors, all you get is that an error exists: “Expected 197,281 moves; found 198,303”.
To debug a perft test failure, you implement divide. divide(n) calculates perft(n-1) for each possible move. If you compare the output of divide with another chess engine and find a difference on some move, then checking perft(n-1) has reduced the size of your search space. You can repeat divide until you find a move that doesn’t exist on one of the two engines.
I used Roce when debugging my chess engine. I recommend writing a unit test whenever perft uncovers an error, since you’ll inevitably want to “optimize” your move generation later and it’s much easier to work with small constructive tests.
With this, we’re essentially done with development for now. Now we’ll take the result and deploy it onto Amazon Redshift.
How do we actually get this thing running on Redshift? First we need to get it running locally using a Python script. Then we’ll deploy it to Redshift using a Python UDF.
Recall the original example used to validate our approach:
which we turned into shellcode by running gcc:
A couple things to note about this:
The target x is the first symbol, so jumping to the shellcode pointer will begin execution there.
There are no absolute addresses.
There is only a single section
It uses common instructions that our CPU supports.
Lets look at a few examples to see how these assumptions can fail:
Using Multiple Symbols
Lets use a recursive function so that lower optimization levels emit a symbol for it:
Notice that factorial has been placed first in the disassembly:
So if we jump to the beginning of our shellcode, we will end up in factorial. And because factorial is a static function, I don’t believe it is required to be callable with the standard calling convention like custom_main is, so a function like it may even be unsafe to call.
Here’s how to add an offset that we can use to jump to the correct function:
If offset is 0, it will invoke factorial and print 120.
If offset is 26, it will invoke custom_main and print 34, which is the sum of 0! + 1! + 2! + 3! + 4!.
Handling Absolute Addresses
You might think all assembly code either uses relative addresses, use hardware or OS-provided absolute addresses, or obtains absolute addresses at runtime from the OS. You might think this because the kernel cannot place programs at fixed locations in memory: If you run a self-modifying program twice, its address space would conflict and the code could not be shared.
Here’s a C program to prove that notion wrong:
On my machine, this prints 0x400430. We can see the same number in the output of readelf -l a.out:
This by itself doesn’t mean anything, because ELF headers have no meaning when we are directly jumping to bytes in memory. However, the example C code clearly takes the address of _start, and this absolute address is visible in the disassembly:
Notice the mov $0x400430,%esi line. The constant 0x400430 is “backwards” in the actual byte encoding to the left of it: 30 04 40 00. It’s backwards because Intel CPUs use a “little-endian” architecture where the bytes are arranged least-significant to most-significant, while the bits inside each byte are arranged most-significant to least-significant. People are taught in school to write numbers from most-significant to least-significant, so the individual bytes seem correct to us but their order seems backwards.
It’s important to look at the actual bytes here because the disassembler will sometimes “lie” to you by converting relative addresses to absolute addresses. The callq instruction is not an example of an absolute address, because the bytes c2 fe ff ff are actually the signed integer representation of -318. The final executable locates _printf@plt exactly 318 bytes before this call instruction, 0x400539 - 0x400400 = 313, and the remaining 5 bytes are the callq instruction itself.
So how does the kernel do this? If we ran two different programs that wanted to start at address 0x400430, they can’t both be physically placed there. However, Intel CPUs have a Memory Management Unit that translates virtual addresses into physical addresses. The kernel is one step up from our C code, but even it doesn’t truly have that level of access: If you run your computer in a virtual machine, the guest OS is behind another layer of indirection with Extended Page Tables.
When the kernel loads this ELF file and grants it a PID, it uses the MMU to map 0x400430 to a real address that we don’t have access to. The kernel-visible address space is mounted at /dev/mem, which only root can access. It’s also available as a core dump in /proc/kcore so that you can use tools like gdb to debug the kernel.
Having any absolute addresses - even virtual ones - is a huge problem for us, because we haven’t told the Linux kernel running Amazon Redshift to map our shellcode at address 0x400430. We can resolve this in two ways:
Map our shellcode at the correct virtual address
Tell the compiler not to emit absolute addresses
And here’s how to do these:
The printf adds a lot of noise, so we’ll use a cut-down example:
Here, custom_main is our intended entry point. It calls the function at absolute address 0x400000 and passes the return value along.
My python happens to have its own entry point at 0x49d9b0, which is somewhat close to the desired absolute address. If it overlapped 0x400000, we would not easily be able to use Python for our loader. In such a case, we could recompile python with a different absolute address or use a loader written in C. See mmap-test.c in the Github repository for an example of a loader written in C.
Since there’s no conflict, here’s how to use mmap to assign our shellcode to a specific location in virtual memory:
The other possibility is to simply use relative addresses for everything. I’ll go through a couple sources of absolute addresses that I encountered and show how each can be suppressed:
gcc likes to place any function named main into its own ELF section. This breaks address continuity with other sections. Here’s a simple example:
Notice that main encodes the address of y as simply 0. This will generate a segmentation fault at runtime, so how can it be correct?
In addition to the .text sections above, there are several other types of sections. One is a relocation section, which provides a list of placeholder values that the linker needs to fill in later. Here’s how to see the relocations in an object file:
In our main function, the 4-byte address of y starts at offset 1, so there is a 32-bit relocation located at offset 1. The relocation includes the intended symbol name, which the linker can use when matching names in different files.
You could use a linker to remove this relocation, or use a different name so that gcc doesn’t place it in a different section:
Now they’ve been merged into the same section, but there is still a relocation in custom_main:
Using the -fPIC` flag will generate Position-Independent code, and finally eliminates this relocation:
Notice that I used the static modifier on y. This allows gcc to conclude that y will not be replaced at link-time with another function with the same name, which allows -fPIC to generate an instruction-relative address.
A developer practicing Test-Driven Development has provided the following definition of the nth prime number:
Unfortunately, gcc has placed the jump table within an .rodata section and left a relocation in .text. The relocation causes a segmentation fault at runtime:
Not only is there a relocation to .rodata to get the jump table, but the jump table itself has relocations back to .text. It’s possible to use a different jumptable in prime, or to use the jumptable in a different function; so the linker has to resolve two layers of indirection.
If we’re willing to take a performance hit, gcc provides the -fno-jump-tables option to avoid this issue:
If we want the jump table but don’t want the relocations, we can always run a linker to resolve them for us. In a future article, maybe I’ll show how one can write a custom linker to resolve simple relocations.
Unsupported CPU Instructions
Our chess engine uses a lot of bitwise operators to manipulate bitsets. Unfortunately, gcc emits a call to a library function even at the highest optimization levels:
Newer CPUs have instructions for counting the number of bits in a value, and more. We can tell gcc to use them using the -march parameter:
The Amazon Redshift node type that I’m testing on uses Intel Xeon E5-2670v2 CPUs, so you would set -march=ivybridge. By default, my gcc emits code for the x86-64 target, whose specification was originally released in 2000. So make sure you’re getting the full benefit of your newer CPU instructions.
Using a Loader for Staging
Once the code has been built and any relocations removed, we’ll want to test it locally before we load it into Redshift. Here is the actual entry point for our C code:
g_str is a FEN string representing the board state, depth is passed to the negamax function to get more accurate scoring, and m_dest is a buffer that we will write the string representation of the engine’s move to.
After the chess.c file has been compiled, we need to extract the shellcode string and offset to custom_main into Python variables. Here’s a Haskell script that will read the object file and print those:
For testing, I wrote the above output to a file named shellcode_bytes.py and included the line from shellcode_bytes import shellcode, offset in the Python loader.
The final SQL is built by pasting the above output into the corresponding loader:
Check out the Github repository for the full code.
A chess engine was implemented and deployed to an Amazon Redshift node, and a sample game was played. If you want to learn more, I recommend checking out: