Recently, I made an emulator for the Nintendo Entertainment Console(NES) - a game console first released in 1983.
In this article, I’ll talk about how I used Rust to develop the emulator. I’ll cover questions like:
- What features does the emulator support? What games can it play?
- How did I approach the problem of emulating the NES?
- Did Rust’s type system or borrow checker interfere? Were there performance issues?
Table of Contents:
- Rust Language
Super Mario Bros is beatable on my emulator:
- Runs at a stable 60 FPS(can go up to 430 FPS in “headless” mode)
- Use an Xbox 360 controller for game input
- Savestates allow you to save at any time. Impress your friends with flawless Goomba-stomping
- Video recording works with savestates to remember a single chain of controller inputs.
- Donkey Kong
- Super Mario Bros
There are a few remaining tasks that would make it more usable:
- Support for more mappers will allow it to play more games
- The keyboard should be usable as an input device
- The audio sounds different, though I don’t know the words to describe the problem
Rust seems like it was a fine choice for this project. Using features like iterators and traits generally didn’t slow down the program(though see below for one exception). Since an NES is a fixed hardware device, no dynamic allocation1 should be necessary and Rust makes it easy to reason about already-allocated memory with its ownership model.
People frequently want to run emulators on strange embedded systems. C is probably still a better choice for those cases: It’s more difficult to find someone who can implement a Rust compiler than a C compiler.
But Rust seems usable in any project where C++ is a viable language.
I emulated the NES mostly by emulating each individual component separately. These include:
- MOS Technology 6502 CPU
- Custom Picture Processing Unit(PPU)
- Custom Audio Processing Unit(APU)
- Writable memory(RAM) or read-only memory(ROM)
- Cartridges with custom circuitry
These components are either Clocked or are mapped to one of two Address Spaces. I define each component as a C Struct, and implement one of two Traits to specify how it interacts with other components.
For video/audio/controller IO with the host OS, I used the SDL library.
Here is a table with all structures in my program. They generally correspond to actual hardware components like RAM or internal timers.
|Structure Name||Component||Clocked?||Address Space?||Description|
|Nes||T||F||Top-level hardware component. Has all others as members|
|Ppu||PPU||T||T||Produces a 256x240 pixel display|
|PpuRegisters||PPU||F||F||Hides some tricky internal PPU state|
|PaletteControl||PPU||F||T||Stores which 13 colors have been chosen of 64 possible|
|CpuPpuInterconnect||PPU||F||T||Maps certain PPU registers to CPU address space|
|Sprite||PPU||F||F||Represents a 4-byte entry in the OAM table|
|Apu||APU||T||T||Generates audio samples|
|Frame Counter||APU||T||F||Generates a clock signal every quarter or half frame that other audio components react to.|
|Length Counter||APU||T||F||Silences an audio channel after a certain number of clock cycles.|
|Linear Counter||APU||T||F||Silences an audio channel according to a timer. Has slightly different timing than the length counter|
|Triangle||APU||T||F||Audio channel for a triangle wave|
|Sweep||APU||T||F||Dynamically changes the pitch of an audio channel|
|Envelope||APU||T||F||Dynamically changes the volume of an audio channel|
|Pulse||APU||T||F||Audio channel for a pulse/square wave|
|Noise||APU||T||F||Audio channel for pseudo-random noise|
|Dmc||APU||T||F||Audio channel for premade audio samples|
|Ram||F||T||A fixed-size block of readable and writable memory|
|Rom||F||T||A fixed-size block of read-only memory|
|NullAddressSpace||F||T||Represents unmapped address space. Returns 0 when read, and does nothing on write.|
|Mapper||F||T||Splits the address space into regions, which are assigned to other
|Joystick||Input||F||T||Communicates controller inputs to the CPU|
|Ines||Cartridge||F||F||A game cartridge, represented in iNES format|
In this section, I’ll talk about the most important of these, and each of the traits that link them together.
A clock cycle is the smallest discrete step that a component can make: There should be no observable changes outside of a clock cycle.
The CPU is an example of a clocked component. A single CPU instruction might:
- Request an 8-bit value at an address
- Attempt to store an 8-bit value at an address
- Calculate an addition using the Arithmetic Logic Unit(ALU)
Generally, an emulation that does more work at once is faster than a component that simulates each individual clock cycle.
A clock-accurate emulation of a clocked component is the most accurate2, but programmers - even rugged ones in the 1980s - generally don’t depend on the clock-by-clock details of the CPU. It’s common to assume that a bitwise AND instruction takes 6 clock cycles, but not that it does a memory read on clock #2 and a memory write on clock #6. So it can be an acceptable loss of accuracy to run the entire CPU instruction in one clock cycle, and then do nothing for the next 5.
The NES has a Master Clock, and all other Clocked components run at an integer fraction of its speed:
|Master||1:1 = 21.477272 Mhz|
The APU has components that act on two separate clock signals: One from the APU clock, and one from an internal
FrameCounter that sends a signal every half or quarter frame.
When a component wants to read or write a value at an address, it places the address on an Address Bus. There is a separate Data Bus that holds the value being read or written. Other components listen for specific addresses that they have been assigned.
An emulated Address Space is an assignment of addresses to actions that components take.
The NES has two different address spaces, primarily used by the CPU and PPU respectively:
The CPU mostly handles game logic, while the PPU’s address space stores sprites, tiles, and colors.
Cartridges listen on both address buses. They are not a simple ROM holding a list of bytes, since they can have arbitrary circuitry3. Games can choose to include extra RAM, coprocessors for specialized calculations, or control registers for changing the address space.
Super Mario Bros 3 animates its background by telling the cartridge to suddenly switch between two different background tile patterns. When the PPU reads from the relevant addresses, the cartridge suddenly starts returning different tile data. Its otherwise not possible to update the entire background in a single frame.
(Original image from this article)
The CPU handles the game logic: What happens when Mario jumps, stomps on a Goomba, or falls in a pit?
It repeatedly fetches, decodes, and executes an instruction at the current Program Counter - a pointer in CPU address space - which is then incremented to the next instruction.
The CPU can only perform three primitive actions:
- Request a read(in CPU address space)
- Request a write
- Change its internal registers
All instructions are combinations of these. The Arithmetic Shift Left, with Absolute,X addressing mode instruction(
0x1E) takes 7 clock cycles and causes the CPU to perform the following 7 actions4:
- (1) Fetch the opcode byte
0x1Eat the current program counter
- (2,3) Fetch a two-byte value
Vimmediately after the opcode byte containing a 16-bit address
- (4) Adds the
- (5) Fetch a one-byte value
alocated at address
- (6) Calculate the value
b = a << 1, updating several status flags
- (7) Write the value
It’s a complex operation, but the individual steps are simple: 4 memory fetches, 2 calculations using the ALU, and 1 memory write.
I implemented each instruction in terms of the three primitive operations, generated an instruction-by-instruction log using the
nestest test ROM, and verified that it matched the reference log.
At 60 Frames per Second(FPS) the NES’ CPU can execute
29780 clock cycles per frame5, but there are
61440 different pixels to display. So the CPU is too slow to draw even a blank screen.
A Picture Processing Unit(PPU) runs at triple the CPU’s clock speed, emitting one pixel per cycle. It has more cycles available than pixels to emit, so the idle time(“Vertical Blank”) is used by the CPU to configure the PPU for the next frame.
The PPU draws two things: Background tiles and sprites. Up to 32 x 30 tiles and 64 sprites can be displayed at once, and these share a pool of 512 different 8x8 4-color patterns.
There are 4 tables in PPU address space that configure these:
- Nametable: A table of 32x30 bytes that specify which 8x8 pattern to use
- Attribute Table: Specifies which 4-color palette is used for each 16x16 group of tiles
- Object Attribute Memory(OAM): Stores the position, palette, and status of 64 sprites
- Palette: There are 8 different 4-color palettes. The first color is always transparent, and the other 3 choose from 64 different System Colors.
The PPU and CPU have different address spaces, but they are not isolated. There are three communication channels between them:
- There are 8 PPU registers mapped in the CPU’s address space.
- The PPU triggers two CPU interrupts: End-of-scanline and Vertical Blank.
- The cartridge itself can choose to modify its assigned PPU address space based on reads or writes to the CPU address space.
There are 5 primitive actions that the PPU can take on each clock cycle:
- Request a read( in PPU address space)
- Request a write
- Change its internal registers
- Emit a single pixel
- Trigger a CPU interrupt
It is more difficult to test the PPU than the CPU: Much of the PPU must already be functioning in order to get any output at all. I recommend printing out the 4 tables separately - and once that data is confirmed correct, then test whether these are combined correctly for any particular pixel.
A common feature in emulators is to save and load a game at any time. Every component implements the
There are two properties that make this model useful:
- Once a ROM has been loaded, my emulator doesn’t allocate memory. So every object can be written in-place.
- Loading a ROM always creates the same types of components, so the values saved from a previous program execution can be assumed compatible.
The first property guarantees that pointers aren’t affected by a savestate restore. If a component is dynamically-allocated and a savestate from before or after its lifetime is loaded, then any pointers to it are no longer valid. This would require keeping track of which objects have pointers to which other objects.
If this second property failed(say, if someone unplugged a standard NES controller and plugged in a new SNES controller), then the savestate files would need to include type information so the correct
load method is called.
However, assuming these restrictions makes serialization straightforward. For example:
Every component except for the primitives like
u32 look like that. The primitives aren’t too difficult either:
Rust’s traits were useful for implementing savestates. The compiler infers which
load methods are needed for each type, so the code is uniform.
The video playback feature builds on top of savestates by storing 8 bits per frame - one for whether each controller button was pressed. When a savestate is restored, it also restores the active input list.
The previous section gave a high-level overview of how I designed my NES emulator. In this section, I’ll talk about the Rust language itself.
By default, Rust will throw an exception if any arithmetic operation overflows. This caught a fair number of bugs during testing, because the wrapping_* functions require explicit type information and it’s important whether a number wraps as a 16-bit value or an 8-bit value.
I used functions like
wrapping_add in my emulator, but there are also Wrapping types that imply wrapping arithmetic operations.
In Rust, mutable values must have a single owning variable. Other variables can borrow it, but only one mutable reference is allowed at a single time.
The CPU address space has several PPU registers mapped. So the CPU maintains a permanent mutable reference to the PPU. But the top-level
Nes object also owns the PPU.
I worked around this using Box to assign fixed memory addresses to values, and then “unsafe” pointer dereferences when needed.
This use of
unsafe means that my emulator is not thread-safe. If both the PPU and CPU ran in separate threads, they could both issue writes to the same address or read a value while it is being written.
A Mutex might be a solution here. The lock would be held for a small amount of time, so it shouldn’t cause much of a performance issue.
In a later article, I plan to train a Reinforcement Learning(RL) agent to play Mario. A faster emulator allows me to gather more sample data, so I spent a few hours ensuring my emulator was comparable to other NES emulators used for RL research.
My benchmark runs 10,000 frames of Mario in a “headless” mode with no video or sound, and then writes a screenshot.
Before I started optimizing it, my emulator got about 350 frames-per-second(FPS). I saw RL people getting between 200 and 450 FPS on various emulators, so I made a few small changes and ended up at about 430 FPS.
I found three optimizations, which each gave a 10% FPS increase.
Compilers try to avoid emitting division instructions, since they are slow. However, division by a non-constant defeats many of the optimizations it uses.
. Since the
FrameCounter is clocked every other CPU cycle, it executes about
div instructions per NES frame, or about 6.4 million per second at the accelerated rate of 430 FPS.
Given the following two conditions:
- The dividend never decreases
- The dividend never increases by more than the divisor between divisions
The above code can be replaced with a conditional subtraction:
The branch is only taken once every 15,000 clocks, so this bottleneck is completely removed and the emulator as a whole processes 10% more frames.
Consider the following two implementations of the same function:
foo is a fairly direct translation of the original source:
The variables are mapped as follows:
rcxis the remaining number of bytes in the slice
rdxis the variable
raxis the variable
rdipoints to the beginning of the data in
And the loop executes
imul instructions until
rcx is 0.
However, the function
bar unrolls its loop 4 times:
There are two loops:
- The loop beginning at
30:processes 4 elements from
- The loop beginning at
70:processes one element from
The variables are assigned as follows:
r8dcontains the remaining number of elements to process after the
30:loop has finished.
rdipoints to the beginning of the data for
idxvariable, updated once per loop iteration
rcxis also the
idxvariable, but updated more frequently in the unrolled
rsipoints to the first data that the
30:loop is unable to process.
The main difference seems to be that
foo counts bytes while
bar counts elements. Perhaps the tuple
(x, idx) has size 16 bytes which is larger than the maximum scale of
8 in addressing modes like
QWORD PTR [rdi+rdx*8+C].
rustc might choose its loop strategy before later optimizations make them equivalent. The difference goes away if
bar are changed to take
u8s rather than
In any case, changing one of my loops to use
bar’s style improved the emulator’s FPS by 10%.
The CPU Address Space consists of many components which each listen for reads and writes to specific addresses.
The Mapper takes an address and does a linear search through all of them looking for the responsible component.
When I first wrote the code, I mapped the address space from first-to-last:
However, this puts the
cartridge at the very end. The cartridge contains all of the game’s code, so every time the CPU fetched a new instruction it got a worst-case linear-search.
I experimented with an LRU cache, but found the best improvement by simply reordering those statements to put the most frequently-accessed components first.
This also gave a 10% improvement in FPS.
This article discussed an NES emulator I developed using the Rust programming language.
If you’re interested in learning more about NES development, I recommend the Nesdev Wiki for technical details needed to make emulators or games.
Future articles on this subject may cover:
- Training an agent to play Mario
- Improving the performance of the emulator
- Developing emulators for other systems
- Compatibility issues with less-common games
- How specific NES games worked internally
- Porting the emulator to run on a GPU6
- Writing an NES game in Rust
- Making new optimization passes for
- Accelerating an emulator with a JIT compiler
My emulator actually uses quite a few
Boxobjects. Mostly this isn’t to allocate memory, but to assign data a fixed memory location so I can safely take mutable pointers to it. I think all dynamic allocation is removable except for the choice of mapper when loading a cartridge. Since each game can have different components inside, dynamic allocation seems necessary there. ↩
There are digital logic simulators created from special photographs of the CPU. These are not more accurate than a correctly-implemented cycle-accurate emulator, but are useful debugging tools to determine what the correct behavior should be. There are rare cases where even this is not enough to explain the precise behavior of the CPU, but no published NES games make use of this detail. ↩
I haven’t confirmed with the digital logic simulator that the 7 clock cycles map precisely to the 7 actions I mentioned. It’s an educated guess that serves to illustrate the difference between instructions and clock cycles. ↩
These numbers are simplified and assume the NTSC video standard. NES consoles sold in regions that use the PAL standard have slightly different timings to generate video. Even/odd frames may take an extra clock cycle. Refer to the Nesdev Wiki for exact timing details. ↩