DSLs for Ethereum Contracts
A good smart contract language is a $1 billion problem.
Why? Look at the amounts lost in some recent hacks:
How are people changing their development process in response?
Ethereum is leaning towards SMT solvers for general use, and theorem provers for parts of the design. Cardano aims for more upfront design, leaning heavily on type-based tools and the K Framework.
In this article, I present Language-oriented Programming as an alternative. Experts create specialized languages for modeling, tokens, or atomic cross-chain swaps. Contract developers use them to describe the relevant parts of their contract.
Experts know the rules to write secure contracts. However, those rules are not consistently followed. Contract developers are not experts, and even experts can slip up when reviewing a large amount of code.
Specialized languages consistently apply security rules, and reduce the amount of code to be reviewed.
Pyramid is a language for writing smart contracts using language-oriented programming. People can develop new programming languages that interoperate with every other language.
As a technical demonstration, I develop a C compiler and port a C chess engine to run on Ethereum’s Virtual Machine(EVM). Contract developers are not expected to use C, but it’s a simple realistic language that almost all programmers know.
Finally, I give a status update on Pyramid for people interested in using the language.
Pyramid Modeling Language
With smart contracts, “the code is law”. But who can read and trust 1000 lines of subtle code?
To start with, it’s not often clear what it means for a contract to be “correct”. People know something’s a problem after they’ve lost money, not before.
I present a language for describing security properties, where every requirement fits in a tweet. I apply this language to Ethereum’s ERC-20 token standard.
Tokens are tradeable virtual currencies. As an example, there are many gamers or Ethereum miners with video cards who could rent them to AI researchers: A token could be exchanged for GPU time.
Ethereum or Cardano have their own base currencies Ether and Ada, so you might wonder why people issue new currencies rather than exchange the base currency. There are three common reasons:
- Diffusion: People want to invest in an idea, not in individual actions. A Vanguard Index Fund allows you to invest in “the US economy” without reviewing company business plans, and a token allows you to invest in “a marketplace for GPUs” without leasing space in a data center and sourcing video cards.
- Commitment: Since tokens are only usable for a specific purpose, investors as a whole can only exit by creating real productive demand. This makes the idea more likely to succeed.
- Restriction: Some tokens need additional trading restrictions. Examples: Allowing users to purchase tokens with a credit card or bank transfer, or checking that users are Accredited Investors.
Since Diffusion, Commitment, and Restriction all apply to our hypothetical GPU marketplace, a new token would be useful. So that it is tradeable on exchanges, it implements the ERC-20 interface:
Solidity accurately describes the data flow for these methods. But that doesn’t tell you what these methods do. For that, we need laws.
I’ll use the hypothetical Pyramid Modeling Language(PML) to describe properties that can hold for contracts implementing this interface. Here’s one:
1. Addresses hold the tokens
This law states that if you add up all the tokens that every user holds, you’ll get the totalSupply()
.
Not every ERC-20 contract satisfies this law: If there is a two-week “holding period” before you can sell after buying, the developer may include unvested tokens in totalSupply
but only count vested tokens in user’s balanceOf
.
However, if it’s true for your particular contract, you can state with just one line that your contract cannot have tokens minted or destroyed by hiding them in things other than user’s balances.
A PML formula has 3 classes of free variables:
- The
@
sigil means the identifier refers to a contract method. - A
$
sigil can be substituted with any value to generate a new formula. - An unmarked variable stands for terms in the modeling language.
A formula with no $
sigils is said to be an expression. Some expressions - such as ones involving the =
operator - are propositions. A law is a formula that reduces to propositions after $
-substitution.
The sum
special form substitutes all possible values for $address
into the formula @balanceOf $address
and sums the results.
Here’s another law:
2. Transfers don’t change the total supply
The initial state of a contract is arbitrary. The =
“forks the universe” from this initial state, and compares two different actions in each.
On the left-hand side, we call the contract’s totalSupply
method. On the right-hand side, we first call transfer
with any two arguments and then call totalSupply
. =
requires the values returned by totalSupply
to be the same.
A sequence of expressions evaluates to the last expression. A law must hold in all possible initial states and assignments to $
variables.
Not every ERC-20 satisfies this property: Some tokens will “tax” every transaction in the token. Most do, however, and it’s important to know if these two lines hold for your favorite token.
There are even stronger versions of this property:
3. Nothing changes the total supply after initialization
The initialize
method is not part of the ERC-20 standard, but some contracts will have it anyways. This law then states that “After the contract has been initialized, there is no possible way to change the totalSupply
of the token”. The _
stands for “Any sequence of method calls”.
Since totalSupply
is unsigned, it satisfies @totalSupply >= 0
. This lets property #3 rule out integer overflow errors that generate huge amounts of tokens. The PoWHCoin hack was an example of this.
Each =
places a constraint on totalSupply
. If the supply ever changes, it is impossible to give $result
two different values so one of the two constraints will be violated.
Properties 1,2, and 3 are universal properties: They hold in all possible situations. Some laws have conditions before they hold. We can use if
to restrict them:
4. Delegated transfers don’t change the total supply
Delegated transfers allow users to grant other users or contracts permission to spend their tokens. A cryptocurrency exchange with thousands of tokens might require this permission in order to enable trading.
This law mainly checks that transferFrom
does not throw an exception, because the left-hand of =
did not throw an exception.
The first line of the if
expression is the condition. The other lines are the consequent and alternative.
The { X ; Y }
is the indentation-insensitive syntax for sequentially executing two statements.
Property 4 isn’t very useful because it is imprecise. I mainly used it to introduce the syntax. Here’s a more precise specification of transfer
:
5. If a transfer changes a user’s balance, they sent or received.
Here, there are two additional syntaxes:
- The
@{$from}transfer
syntax constrains$from
to the address that calledtransfer
. - The
&
and|
operators combine multiple propositions.
Using PML
This modeling language says a lot in only a few lines. But what can you do with a specification?
- Communication: Security auditors and token investors could use this language to talk about contracts without changing the development process. No extra code has to be written.
- Test Generation: Replace the
$
variables with random or developer-chosen values and test whether the properties hold. Some care must be taken generating addresses and cryptographic hashes. - Model Checking: The specifications can be sent to a model-checker like Z3. If the contract itself can also be converted to Z3, then it may be possible to formally prove the contract’s security properties.
A custom language can communicate intent more efficiently than implementation, and allows special-purpose tools to check that intent.
The goal of Pyramid is to allow experts to create special-purpose languages just like this modeling language. DSLs can be specifications, executable code, or test frameworks.
In the next section, I’ll show the nuts-and-bolts of how I created a new language.
C Compiler
Pyramid builds on Racket. Since Racket focuses on language development there are libraries for implementing Type Systems, Semantics Engineering, Parsing, and Program Synthesis and Verification.
Using Pyramid as a code generator, anyone can create advanced smart contract languages.
Pyramid’s flagship language is a Scheme dialect, but in this section I create a C compiler that can compile a Chess Engine that I wrote in a previous article. The engine successfully calculated a Chess perft
.
I based my implementation on Matthew Butterick’s Beautiful Racket, so that interested readers can use his well-written book as a guide. Matthew is a lawyer and web designer who uses language-oriented programming to publish books like Practical Typography and Typography for Lawyers.
Overview
A custom language begins with the first line:
Every Pyramid-based language has a #lang
line. ceagle
is the name of a Racket collection that describes how to read and interpret the rest of the file.
Only this first line has a predefined syntax - everything else can be completely customized.
The Ceagle C compiler has the following compiler stages:
Stage | Output |
---|---|
Lexer | Token sequence/type information |
Parser | Parse Tree |
Expander | Syntax Tree |
Simplifier | Syntax Tree |
Compiler | Pyramid Syntax Tree |
Macros | Pyramid Syntax |
The parser uses the brag DSL for defining BNF grammars.
The interesting work happens in the Compiler
and Macros
modules.
A custom Pyramid language’s expander turns the source code’s syntax into a Translation Unit - a bundle of Scheme together with metadata like the dependencies, language-specific parse tree, and language-specific AST.
Pyramid and Ceagle are implemented in Typed Racket, so the struct
above includes type annotations.
Translation units are topologically sorted by dependencies
. A translation unit exports every definition.
A custom language has two components: The Reader and Expander. Ceagle’s Reader runs the Lexer and Parser stages to create a Syntax Object; while the Expander is a macro that runs the Expander, Simplifier, and Compiler stages on this syntax to form a Translation Unit.
Custom languages export a make-translation-unit
function. The execute?
argument lets them treat the entrypoint module differently: Ceagle invokes the main
function at the end.
expand-translation-unit
and compile-translation-unit
refer to a C translation unit, not a Pyramid translation-unit
.
C’s #include
only works with C source, but Ceagle has a non-standard #require
that allows it to import any Pyramid-based language with a guarantee that they will only be compiled once. These are extracted from the parse tree by require-stxs
above.
Most of the interesting work is in compile-translation-unit
. The Beautiful Racket book mentioned above covers everything before it in greater detail.
To compile a translation unit, I’ll provide rules for compiling all 4 classes of abstract syntax:
- Declarations
- Statements
- Expressions
- Types
The types module defines all types used in Ceagle, and I’ve duplicated the important ones below.
Declarations
A C translation unit consists of top-level declarations: typedef
, global variables, and function definitions. The entrypoint module should additionally define a function named main
.
Here are the definitions of these:
To reduce the number of parentheses, I like to define a plural version c-declarations
of my singular types c-declaration
. I will only explicitly define the singular versions in this article.
Type Aliases
typedef
s are implemented by remembering the name
and type
association in a table for the rest of the translation unit to use. Since it must return Pyramid syntax as well, it returns an empty begin
form:
The destruct
macro defines a variable for each field in the struct
. So x-name
refers to (c-decl-type-name x)
.
See the Types
section for more details on how typechecking is done.
Variable Declarations
The c-decl-var
is a variable declaration int x = 5;
. There are two parts:
- Remembering that
x
’s type isint
in a table - Compiling the declaration to the form
(define x initializer)
, whereinitializer
is either the expression given or a default value that depends on the type.
Conceptually, variables are all pointers to sizeof(type)
bytes. However, values do not have identity so each type has a variable-definer
macro that assigns a fresh memory location to a value.
Macros expect and return syntax, so the shrink-pyramid
function reverses the Expander stage of the compiler, converting the AST back into syntax.
For the int x = 5
case, the compiler emits a use of the %c-define-fixnum
macro. This is written in Pyramid, and looks very similar to an ordinary Racket macro:
The %c-noinline
expands to (set! name name)
and prevents the Pyramid optimizer from attempting to inline the variable: This will not be necessary in a future version of the compiler.
Function Declarations
A function declaration looks like int main() { return 0; }
. A function has several requirements:
- The function name and signature needs to be remembered for later.
- Variables defined inside the function should not escape the body.
- Arguments are passed as rvalues, so new variables need to be initialized from them.
- A
return
statement completes the current function call with the given value. - Functions should be ordinary lambdas so they are usable with other Pyramid programs.
Putting all that together, we get this:
Since Pyramid is a Scheme dialect, it supports Scheme’s call/cc
. The return
statement simply calls a continuation named return
with the value given.
break
and continue
are also implemented as continuations. The goto
statement cannot be implemented as a continuation like this, because it allows you to jump to labels you haven’t visited yet.
This covers declarations. Since I just mentioned break
, continue
, and goto
, I’ll cover statements next.
Statements
There are 4 categories of statements:
- Labels are places that can be jumped to. C supports labels directly, but most programmers don’t know that the
switch
statement’scase
anddefault
clauses are considered labels. C switch statements are not equivalent to cascading if/else statements:
In this example, a value of x=2
would cause the switch to execute the return
skipping the three while
loops. This is not easily translatable to cascading if/else statements.
- Scopes allow you to define variables within the body. Many also allow you to
break
orcontinue
. - Continuations abort a surrounding scope using a Scheme continuation.
- Expressions can be used as statements.
Every entry in Labels is implemented as an inline assembly label and jump
instruction. A switch
statement emits a jumptable for its case
and default
labels.
A C if
statement compiles to a Scheme if
expression:
The for
, while
, and do while
loops are implemented using continuations for break
and continue
, which are also used to check the loop condition:
A C block is like a Scheme begin
, but it needs to hide any locally-declared variables. So we wrap it in an immediately-invoked lambda:
This does mean that a goto
between blocks could miscompile, since the λ
creates a new continuation frame which needs to be released. The solution is to hoist all variable definitions to the beginning of the function and use set!
to initialize them, but since I don’t use cross-block gotos in my example program I opted to keep it simple.
And finally, an expression statement is compiled just like the expression it wraps.
Statements mapped fairly cleanly onto Scheme. Expressions have more details to consider.
Expressions
Every expression can be compiled in two modes: rvalue
(“Result Value”) or lvalue
(“Location Value”).
Variables consist of both a location value and a result value. The location is the address in memory the variable is stored, while the result is the contents of the address.
Expressions compilable as location values allow assignment operators like +=
to modify the value at that location. Expressions compilable as result values can be read, but not necessarily written.
Location and result values satisfy the following two laws, for variable addresses and dereferences:
Ceagle tries to fit both lvalues and rvalues into a single 32-byte word. struct
values are allocated in memory and represented as pointers, even in rvalue context. However, struct
lvalues refer to the original struct while struct
rvalues are copied, so they are not interchangeable.
A 32-byte c-type-fixed
would have a 32-byte result value, but its associated address may be restricted to 3 or 4 bytes for efficiency.
Because rvalue
s do not have identity, they must be copied to create an lvalue
.
Here are all the C expressions:
I disagree with the C standard on string representation: Strings are length :: data
, not data :: 0
.
A c-variable
in lvalue context is the memory address of the variable, and in rvalue context is a copy of the variable’s value.
A c-field-access
or c-array-access
in lvalue context are the offset memory address from the base, and in rvalue context is the value at that memory address.
The other expressions mostly follow recursively from these.
Types
Typechecking is simple. The compiler remembers types when it comes across them, and typechecks expressions using previously-remembered type information.
If an expression can’t be typechecked, an error is thrown.
Here are all possible types:
The typespace
member of c-type-alias
is used to disambiguate x
, struct x
, and union x
. These can all refer to different types, even though each has x
for a name
.
Because Ethereum has 256-bit words, Ceagle has a non-standard __bits N
token which smaller standard types can be defined in terms of:
Only integer types smaller than 32 bytes are supported. Smaller integers are stored as 256-bit words too, but the compiler sign-extends them as necessary after every arithmetic operation:
Future Work
This article used Pyramid to implement the C programming language on Ethereum.
Pyramid still needs one last major technical component before it can be used by general developers. Look forward to the next article in the series:
“How I made my language 10,000x faster …by deferring all optimization until this article”
Beyond that, here is a table summarizing some ideas I’ve had for future work:
Feature | Priority | What it gets us |
---|---|---|
Optimizer | Required | Feasible to deploy Pyramid contracts |
Reference Documentation | Required | Possible for developers to learn the language |
Solidity ABI | Required(for Ethereum) | Possible to interface with Metamask |
Cardano Backend | Required(for Cardano) | Possibly to deploy Pyramid code to Cardano |
Type Systems | Optional | Reduce time spent debugging simple errors |
Multi-contract libraries | Optional | Deploy multiple interdependent contracts |
Contract testing libraries | Optional | Reduces risk of errors in deployed contracts |
Library management | Optional | Easier to fetch and use libraries |
Formal IR semantics | Optional | Reduces risk of errors in Pyramid compiler |
Solidity or LLL support | Optional | Use existing libraries in Pyramid |
Required
means “I’m the only person in the world who’s capable of doing this, and it’s necessary for anyone else to start using Pyramid”.
The optimizer and documentation are self-explanatory, and I’ll describe the other items:
Solidity ABI
Ethereum tools make “method calls” on the contract by encoding function name and signature along with the data. If Pyramid had a library implementing this, then tools like Metamask could call Pyramid contracts.
A direct translation of the ERC-20 interface is
. This would:
- Generate a dispatch table to the appropriate Pyramid function using the name and signature.
- Convert the function’s inputs and output using the Solidity ABI and the declared signature.
Importantly, exports
would be a macro and not a built-in language feature. So people could experiment with alternative ABIs without needing to change the compiler.
Cardano Backend
Pyramid’s code generator currently targets the Ethereum Virtual Machine. It shouldn’t be too difficult to also support Cardano’s LLVM derivative, IELE. This would allow Pyramid developers to write smart contracts that work on either Cardano or Ethereum.
This refers only to deploying to these blockchains. Atomic cross-chain trading is a different feature necessary to communicate between Cardano and Ethereum.
Type Systems
One problem with Ceagle is that its type system is incompatible with other type systems. If many languages have incompatible type systems, language boundaries could become awkward.
A recent paper introduced a technique for implementing type systems as macros that attach syntax properties to the code they transform. They were able to build on simpler type systems when implementing more complex type systems(all the way up through System F - similar to Haskell or ML).
This technique should make it easier to transparently call across language boundaries into languages with differing type systems.
Multi-Contract Libraries
Many Ethereum applications are networks of contracts. A game might have a token-trading contract, a high-level interface, a contract implementing the rules of the game, and a contract that stores player’s data.
Pyramid already lets you manually write this code, but it would be convenient to use an abstraction like Racket’s Units to safely link together contracts with declared interfaces.
Safely upgrading a contract while preserving its security properties is a common problem.
Contract testing
Testing libraries are the main tools used to check that a contract is free of errors.
Unit-testing | Implemented |
Randomized testing | Not Implemented |
Model-checking | Not Implemented |
Theorem-proving | Not Implemented |
Modeling languages like PML could be used to generate random testcases or fed into a model-checker like Z3. Some limited checking can also be done with just strongly-typed contract interfaces.
Together with a formal IR semantics, a dependently-typed language could emit end-to-end verified code.
Library management
Pyramid will either use Racket’s raco
or use the Nix package manager to handle libraries.
Nix promises reproducible builds, so anyone can verify that a contract was deployed correctly. It may make sense to retest contracts during the build process.
Formal IR semantics
What is the meaning of a program?
It’s common to define larger languages as reductions to simpler languages. This makes it easier to identify compiler errors.
A tool like PLT Redex could be used to specify these semantics, which also makes future formal verification work easier.
Solidity or LLL
Most Ethereum developers currently use Solidity. Since Pyramid allows people to develop languages as libraries, I could write a Solidity library for Pyramid that allows existing contract developers to use it without changing any of their existing code.
There is also a small community of “Low-Level Lisp” developers, who have expressed interest in Pyramid.
Conclusion
In this article, I presented language-oriented programming as an alternative way to program smart contracts. If you are interested, I recommend following me on Twitter, subscribing to the mailing list, or joining the Pyramid Discord channel. Links are available on the website’s sidebar.