Haskell programmers often code in ivory towers with their heads in the cloud. In this multi-part article series, we’ll get our feet wet diving deep below C level.
I create Pyramid: A dialect of the Scheme programming language that targets the Ethereum Virtual Machine(EVM). Pyramid Scheme is implemented using the appropriately-named Racket. The Pyramid compiler is currently 3512 lines of code, and includes code from Structure and Interpretation of Computer Programs.
This article covers the high-level design of the Pyramid compiler: The compiler’s components and Pyramid’s runtime environment.
People interested in Scheme, compilers, or Ethereum will enjoy this series. At the end of it, I’ll use Pyramid to take preorders for a new book: Destabilizing Nation-states with Math: An Ethereum Hacker’s Handbook. Readers are encouraged to subscribe to the mailing list to receive new articles.
The Pyramid compiler turns plain text into executable EVM code. Its 5 components are:
- Parser: Converts plain text into an AST. Racket makes this easy with its read built-in.
- Compiler: The SICP Scheme compiler targets an abstract machine with 5 registers, 8 operations, a stack, and 13 built-in operations.
- Code Generator: Converts the abstract machine code into EVM assembly.
- Serializer: Converts EVM assembly into deployable EVM bytecode.
- Debugger: A Google Sheets script that simulates disassembled EVM bytecode.
An example Pyramid program is
. The Pyramid compiler outputs a long hexadecimal string, which can be deployed to a real Ethereum blockchain using the Web3 library:
The complete syntax of Pyramid is
. All other language features are provided by the Pyramid Standard Library, which are Procedure and Variable definitions installed into the program’s environment at startup.
After Pyramid is parsed, it is compiled to Abstract Machine Code.
I’ll show examples of each syntax translated to abstract machine code, before giving the full language. The compiler is documented in SICP.
Variables can be defined, modified, and accessed:
(define x 1234) compiles to
. The integer
1234 is Boxed, and a pointer to it stored in the
val register. Plain integers have no type: They could be confused for pointers or symbols. Boxing attaches a type to the value. Boxed values are always pointers.
The primitive operation
define-variable! takes three arguments: A name to define, a value to set it to, and an environment. The
env register holds the current environment.
The next two lines are similar:
val register holds one argument, or the output of an operation. In general, the last Pyramid statement executed stored its result in
if statement evaluates to either its True Branch or its False Branch. The expression
10 because the condition is
true and 10 is the true branch.
if cannot be a standard library procedure, because procedure calls evaluate all arguments. The
if statement has two branches, but only evaluates one.
if statement compiles to
if stores its result in
val, because each branch stores its result in
A Procedure Definition is a Variable Definition set to a Lambda. A Lambda creates a Compiled Procedure, which is a code pointer and an environment. Compiled Procedures are entered by jumping to the code pointer, and exited by jumping to the
Here is the simplest Procedure Definition:
Which compiles to:
After compiling the Lambda body, it is skipped over using the first
goto because procedures must be explicitly called to execute their body.
When Compiled Procedures are entered, their arguments are in the
argl register. The Lambda first moves the arguments to the environment, and then runs the body.
An explicit call(or Application) looks like
. It looks shorter, but the compiled code
A Procedure can be either a Compiled Procedure created from a Lambda, or a built-in Primitive Procedure. They are currently similar at the EVM level, but may be optimized in the future. For example, Primitive Procedures could pass their arguments on the stack rather than allocating a linked list.
Abstract Machine Language
These tables summarize the abstract machine that the compiler targets.
There are 8 instructions:
13 primitive operations:
The Abstract Machine is a good intermediate target, but doesn’t know anything about the EVM. The Code Generator has detailed knowledge of EVM memory, instructions, stack, and storage. It has two responsibilities:
- Convert abstract machine code into almost EVM assembly
- Initialize the environment
The target EVM pseudo-assembly has these primitives:
The Serializer turns this into deployable bytecode.
The Code Generator must implement each instruction, expression, and operation in the previous section. The abstract instructions generally map to a single assembly instruction:
Because abstract instructions evaluate their arguments, the final EVM assembly can be longer than zero or one instructions. The complexity of the code generator is in the expressions.
Expressions leave their result on the stack. The expressions are single instructions, except for operations and list constants:
|label||Given to Serializer|
|op||Custom EVM assembly|
Operations use the stack for input and output. Their arguments are themselves expressions, although operations do not take operations as arguments.
Expressions evaluate to plain 256-bit words: Integers are 256-bits, Symbols are up to 32 8-bit ASCII characters(which totals 256 bits), and everything else is a 256-bit pointer to a dynamically-allocated object.
Memory is allocated by increasing the
allocator register. If the value in
n, then increasing
allocator by 32 reserves the memory addresses
[n, n+32). Memory is freed by decreasing the allocator: You cannot free an address without also freeing all address allocated after it.
Pyramid does not use a garbage collector. Complex garbage collectors can have unpredictable gas usage, and there are no long-lived EVM executions. In the future, the Pyramid compiler may optimize memory by freeing dead variables or placing variables on the stack.
Pyramid values are either Tagged or Untagged(alternatively, Boxed or Unboxed). Untagged values are single 256-bit words that either point to tagged values or temporarily live on the stack. Tagged values are stored in allocated memory.
There are 7 Primitive Types of tagged values:
|0||Fixnum||1||A 256-bit integer|
|1||Symbol||1||Exactly 32 8-bit ASCII characters|
|2||Compiled Procedure||2||A code pointer, and closure environment pointer|
|3||Primitive Procedure||1||A code pointer|
|4||Pair||2||Two tagged values|
|6||Nil||0||Only the tag word|
A tagged value is a pointer to a
1 + Size group of 256-bit words. The first word is the type tag from the above table; the rest depend on the type.
Derived Types are combinations of primitive types. Documentation uses them to specify which primitive values are valid arguments:
|Any||Any tagged value|
|List X||Nil or (Pair X List)||A linked list of pairs. Used to pass function arguments.|
|Environment||List Frame||Holds all variable bindings|
|Frame||Pair (List Symbol) (List Any)||A lexical scope. The Global Frame has the standard library and top-level user-defined variables.|
|Procedure||Compiled or Primitive||Any callable value|
Untagged values have no runtime type information, but are used in these contexts:
|Code||Offset into deployed Ethereum bytecode|
|Memory||Unaligned offset into EVM memory|
Most memory is allocated, but there are some reserved memory locations:
|0x00||invalid||N/A||The 0 pointer is never used|
|0x20||env||Standard Library||Environment currently in scope|
|0x40||proc||1337||Procedure about to be called|
|0x60||continue||31337||Return address from a Procedure|
|0x80||argl||337||List of arguments|
|0xa0||val||N/A||Result of last executed statement|
|0xc0||nil||6(the Nil tag)||One shared copy of the Nil object|
|0xe0||allocator||0x100||Next memory address to allocate|
|0x100||N/A||First allocated memory address|
The initial values of
argl are useful when debugging the compiler. They shouldn’t be observable to user code.
Before the user’s code runs, reserved memory locations are set to their initial values and the Pyramid Standard Library is installed.
env writes a constant to memory.
env first gets a newly-allocated empty Environment, and then the Pyramid Standard Library is installed by defining several Primitive Procedure objects. Pyramid’s
define syntax could be used for this, except that users don’t currently have a way to create Primitive Procedure values.
This article used this example at the beginning:
This requires 3 standard library functions:
|=||Fixnum Fixnum -> Boolean||1 if two numbers are equal; 0 otherwise|
|*||Fixnum Fixnum -> Fixnum||Multiplication|
|-||Fixnum Fixnum -> Fixnum||Subtraction|
A Primitive Procedure is a tagged label. It is called by jumping to the label with two items on the stack: A pointer to an argument list and a return address. The code following the label should:
- Move arguments from the list to the stack
- Unbox integers
- Apply the EQ, MUL, or SUB assembly instructions
- Box the result(except for EQ)
- Jump to the return address
If Pyramid supported inline EVM assembly, then users could define their own Primitive Procedures as a special type of Lambda. This is better than adding primitives to the code generator, because users could remove functions they don’t need. Shorter programs are easier to audit. Expect to see this implemented in the future.
Most compiled programs directly execute on their target platform. The EVM is different in that compiled programs are actually Program Initializers. This initializer generates a new program that handles all future transactions and messages.
The Pyramid Serializer prepends a Loader to the Pyramid program that returns the Pyramid program:
programSize is the size in bytes of the compiled Pyramid bytecode, and
afterLoader is the size in bytes of the loader(usually 14).
The Serializer also calculates the byte offset of each label, and adds it to a Symbol Table. Labels may be defined after they are used, so the Serializer emits a 0 as a placeholder and adds the offset to a Relocation Table. After the bytecode is generated, the Symbol Table is matched with the Relocation Table and the bytecode edited in-place.
The Serializer also figures out the correct size for pushes without a size: Ethereum supports different sizes of
In this article, we looked at the high-level design of the Pyramid Scheme compiler. To keep it accessible, I left out the compiler’s implementation code. In my next article “Debugging a Compiler Backend with Google Sheets”, I’ll cover the code generator’s implementation, including developing custom debugging tools.
The code is available on Github.
Future Pyramid articles may cover:
- Optimizations: Lexical addressing, liveness analysis, early unboxing
- Package Manager: Users will want to share code. EVM languages can have “internal” and “external” modules
- Contract Development: Development of a contract that takes pre-orders for a book.
Future Ethereum articles may cover:
- Security: Conducting a security audit, common pitfalls, vulnerability scanners
- Contract Libraries: Organizing larger codebases, CALL vs. CALLCODE vs. DELEGATECALL
- Formal Verification: Get an absolute 100% guarantee that your contracts are free of errors
- Frontend: Build an ecommerce store using a contract as the backing datastore