Building an Emulator with WebAssembly and Rust
Hello everyone. I am Andy O'Neill. I am from the Raleigh office and I am a part of the UI Engineering studio.
For fun, a couple of years ago I made I made a web-based game emulator for a game system called Gamebuino. Before we begin, let me show you the end result.
_Play and describe a few games._
My first version of this emulator was written in JavaScript. Recently I rewrote the emulator in Rust, compiled to WebAssembly. This was my first time using both of those technologies, so I want to share what I learned with you all. Hopefully there should be some content for everyone. I am going to break this up into roughly 4 sections.
1. First I'll give you a bit of background on my story up until the rewrite;
2. Then I'll give you an overview of the Gamebuino hardware and how the game emulation works;
3. After that we will dive into WebAssembly;
4. And finally we will talk about the Rust programming language.
If you can make it to the end, I'll share the link to the emulator so we can all play some games.
So, I was introduced to Gamebuino a few years ago when I purchased the MAKERbuino. The MAKERbuino is a kit where you build your own hand-held gaming system. It was my second time ever soldering anything, and I recommend building something electronic like this to anyone who hasn't. The MAKERbuino is based on the original Gamebuino Classic. The Gamebuino Classic has extremely limited hardware: an 8-bit microcontroller running at 16Mhz and a tiny black and white screen that ancient Nokia phones used to use. By flickering pixels on and off quickly, you can even simulate a gray color.
Part of the fun was putting together the hardware, but I really enjoyed writing games as well. Sure, you can download games other people in the community wrote, but you can also build your own games using the Arduino IDE and a Gamebuino library that handles interfacing with the hardware for you. One of the games I made was Solitaire. Anyone remember the bouncing cards when you won Solitaire on Windows back in the day? Definitely a required feature.
This may seem counter-intuitive, but the limited hardware was actually a plus for me. It was exciting and challenging to try and squeeze everything I could from the hardware.
And then Gamebuino announced they were coming out with a new version: the Gamebuino META. - color screen, higher resolution, faster, 32-bit processor - I was on board. Still limiting enough that you would have to get creative to write interesting games, but unlocking a bunch of new possibilities. I ported my Solitaire game. I ported a simple ray-caster. I tried creative ways of using the screen. Other people in the community came up with even more interesting games and demos.
VIDEO
But that's not what this talk is going to focus on. A couple of people made an emulator for the Gamebuino Classic, one of them even running in a web browser. I had never written an emulator, but I wondered if it would be possible for the Gamebuino META. I was lucky enough to get an early version of the Gamebuino META. I started tinkering with writing an emulator in JavaScript. It started with small steps to prove I could get close to the end performance I would need. Then I started implementing the instructions that the CPU supported. Eventually it could "run" simple programs. I started using a demo program from Adafruit that would draw lines and shapes to the screen. One of the most satisfying moments of this whole project was when I first saw this demo program working in my web browser!
That's not to say things went smoothly. There were plenty of bugs to squash. For example, one demo program was supposed to print the digits of pi. My emulator was printing 3 point plus a bunch of incorrect digits. (Interesting side note\: the processor does not have native support for floating point numbers, so all floating point logic is handled in software.) It was pretty obvious that some instruction was not implemented correctly, but how do you track something like that when the problem could be the side effect of a single instruction after potentially millions of successful instructions? I spent many hours decompiling programs to ARM Assembly and tracing through by hand until I found something my emulator was doing incorrectly.
Eventually it could run real Gamebuino games. The Gamebuino creator even put it on the official gamebuino.com website. All was good. Then I started wondering if I could port this from JavaScript to WebAssebmly. After a bit of background on what is involved in emulating the Gamebuino, I will take you through my journey of porting this to Rust compiled to WebAssebmly.
Microcontroller
32-bit ARM Cortex-M0+ @ 48MHz
256KB flash
32KB SRAM
6 serial communication modules (SERCOM)
Timers and interrupts
Direct memory access (DMA) controller
...Which takes us to the next section of this talk. To give this talk context, we need to discuss the hardware of the Gamebuino META.
Perhaps the most important part to creating the emulator, it is running a 32-bit ARM-based processor at 48MHz. Many instructions can be executed in a single clock cycle, so we are talking about theoretically simulating 48 million instructions per second. Keep in mind that each of these simulated instructions involves several steps, such as: parsing the instruction, reading and writing values to memory, updating condition flags, and incrementing the program counter.
256KB of Flash memory. This is where the game is stored when running. Additional games can be loaded from an SD card. Although assets could be loaded from the SD card on demand, most games include _everything_ within those 256KB. Compared to programs sizes on modern PCs or JavaScript files for a simple website, this is miniscule!
32KB SRAM. This is where the running program stores any value that can change while the program is running, such as variables. 32KB isn't even enough to have a buffer representing every pixel on the screen!
There is built-in support for some serial communication modules. I bring it up because we will need to emulate this behavior for talking to the screen and buttons.
The processor can also be interrupted for certain events. One example is an interrupt for a certain amount of time passing, which is used by the game loop for displaying content at a fixed number of frames per second.
And finally, I'll mention the direct memory access controller. It allows the microcontroller to copy blocks of memory from one location to another without blocking the CPU from doing other work. We need this working as well as the game library depends on it to send screen data.
Don't worry about this text being too small to read. This gives you a rough idea of how many instructions the CPU understands. There are dozens of them, but each one does one simple task. Once the emulator can handle each of these instructions, it can run any program that is thrown it's way.
Screen
160 × 128 pixels
65K colors
Communicates over SERCOM4 using SPI
The screen is color, 160 by 128 pixel. Interestingly, it doesn't use 8 bits for red, green, and blue that you may be familiar with. Instead it is 16 bits per pixel, with red and blue using 5 bits and green using 6 bits. As mentioned on the previous slide, communication uses SERCOM.
Buttons
Communicates over SERCOM4 using SPI
D-pad, A, B, Menu, and Home all encoded as a single byte
There are 8 buttons. The button state is sent to the processor as a single byte, with each button using a different bit. This makes it super easy to emulate.
Ignored
And there are many aspects of the hardware that this emulator just ignores. Perhaps they will be added one day.
listen for button input
initialize emulator with game ROM
call gameLoop()
function gameLoop() {
tell emulator which buttons are pressed
loop for some number of steps {
simulate next CPU instruction
}
draw screen to <canvas>
schedule gameLoop() to be called in the future
}
With that background, we can take a stab at some pseudo code for the emulator. We will need to:
1. listen for input from the keyboard, mouse, gamepad, or other controller inputs we want to support,
2. load the game ROM we want to run,
3. and repeatedly call the game loop.
Within the game loop, we need to:
1. simulate some number of CPU instructions,
2. draw the game screen to the canvas,
3. and schedule the next execution of the game loop.
The last step is worth particular note. JavaScript runs on the main thread. If the emulation stayed in a hard loop, the browser would never be able to render changes to the screen. We need to have moments in the logic where it sleeps, allowing other browser tasks to complete. The more we are not hogging the CPU, the more responsive the site will be.
Why?
My first implementation of this emulator was written in TypeScript. It was incredibly satisfying to see it work successfully. Which brings us to our next section about WebAssembly, starting with why rewrite this in WebAssembly?...
I had a hunch that I could get better performance by using WebAssembly. Many of the operations that the emulator simulates involve 8, 16, and 32-bit integer math. JavaScript numbers are represented as 64-bit floating point numbers. Don't get me wrong - JavaScript engines have become incredibly smart and fast. It just felt like it would be adventageous to have lower level control over memory for this particular workload. I don't want anyone to equate WebAssembly meaning faster than JavaScript. That definitely isn't always the case. This is a sample size of 1, and perhaps the JavaScript version could be faster than the WebAssebmly version with some tweaks. What I do know is that as someone who is very familiar with JavaScript and completely new to Rust, the end result ran several times faster!
The "Before" chart shows that the vast amount of time was spent by the browser running the emulator. The "After" chart shows that roughly a quarter of the time is spent running the emulator, leaving much more time available to the browser. Note that this was run on a beefy developer machine. You can imagine that running on lower end devices was very slow for the JavaScript version and acceptable for the WebAssembly version.
So depending on the workload, performance could be a reason for "Why." There are some other good reasons, including:
1. Existing legacy codebases in C or C++
2. Startup speed - JavaScript goes through many steps to execute. It has to download a rather verbose text format of the script. It needs to parse and execute the code. There is then usually a system that monitors hot paths and speculatively optimizes these paths to architecture-specific machine code. WebAssembly, on the other hand, is transferred as a binary format and can be converted to machine code more quickly.
3. I am all for using the right tool for the job. Since JavaScript has been the only language that runs in the browser for decades, we've seen numerous other languages that get transpiled to JavaScript. WebAssembly really is a compilation target. This opens up the opportunity for using many different languages in the browser.
(module
(func (export "add") (param $lhs i32) (param $rhs i32)
(result i32)
(i32.add
(local.get $lhs)
(local.get $rhs)
)
)
)
Chances are you'll never need to interact directly with the WebAssembly format. However, I'm one of those people that like to see how things work under the hood. I think some of you will find this interesting, but bear with me if not. We will be through this section before too long.
Just like various assembly languages that are compiled to machine code, WebAssembly has a text representation for the binary format. Here is a simple WebAssembly module. It:
* declares a function that has two 32-bit integers as input and a 32-bit integer as an output
* names this function `add` so that we will be able to call it from JavaScript
* adds the two integers together and returns the result
<!DOCTYPE html>
<html>
<head> … </head>
<body>
<script>
(async function() {
const obj = await WebAssembly.instantiateStreaming(
fetch("add.wasm")
);
console.log(obj.instance.exports.add(1, 2)); // 3
})();
</script>
</body>
</html>
Don't think JavaScript is out of the picture when using WebAssembly. The two need each other. WebAssembly has no direct access to the DOM. JavaScript loads the WebAssembly, and the two sides communicate via an interface.
Here is a bare minimum. We:
1. Use `WebAssembly.instanciateStreaming` to load the WebAssembly. Note that there are some other methods that perform similar functionality.
2. Call the `add` function that was exported from the previous slide.
That's it. Just keep in mind that JavaScript and WebAssembly agree upon an interface to talk to each other, and you probably want to minimize the number of times you cross that interface since it isn't free.
(module
(import "js" "mem" (memory 1))
(data (i32.const 0) "HAL")
(func (export "doSomething")
(local $i i32)
(loop $loop
(i32.store8
(local.get $i)
(i32.add
(i32.load8_u
(local.get $i)
)
(i32.const 1)
)
)
(local.set $i
(i32.add
(i32.const 1)
(local.get $i)
)
)
(br_if $loop
(i32.lt_u
(local.get $i)
(i32.const 3)
)
)
)
)
)
Now let's dive deeper. You have already seen an add operation on a 32-bit integer. How many value types are there? I found this quite surprising. There are only 4: 32 and 64-bit integers, and 32 and 64-bit floats. That is all. So how does WebAssembly handle something more complicated like strings or complex data structures? The WebAssembly has access to a piece of linear memory, addressable per byte. Think of it as a big array. It is up to the WebAssembly program to manage this memory however it sees fit. It seems reasonable that a string could just be consecutive letters stored in a format like ASCII. Time to look at a simplified example. Suppose we have a 3 letter word stored in this memory starting at address 0. Let's write a program that loops over the 3 letters and increments each by one. Line 3 is a new one that initializes this memory with "HAL", starting at address 0.
One more thing to mention before we trace through this code - Remember our add operation from the previous slide? This is syntactic sugar that is different from how the program is stored in binary format. Conceptually, WebAssembly is a stack-based machine. Values are pushed on and poped from a stack. *(Note that this is just conceptually; this may be optimized away when converted to machine code on the target computer.)* The order is different in the binary format. First, a constant of 1 is pushed onto the stack. Then, the value of `i` is pushed onto the stack. Finally, the add operation pops the top two values off the stack and replaces it with the result.
So I have rewritten the instructions in the order they appear in the binary format. We are going to trace through this algorithm.
Across the top is the value of the memory and our local variable `i`. Down the left side are the values on the stack. You can see that memory has been initialized with HAL. We will be working with the numeric representation of those letters.
The function starts with a loop instruction. All this is is a label we can jump to later.
Next we push the value of `i` onto the stack.
Then we push the same value onto the stack. I know this seems weird, but just go with me for a minute.
The `load8_u` instruction pops an address off the stack, and loads that value from memory as an unsigned 8-bit number.
1 is pushed onto the stack.
The top two values are popped and added together.
`store8` pops a value and index off the stack, and stores the value in memory. This is why we had that extra `i` on the stack from the start of the loop.
1 is pushed onto the stack.
`i` is pushed onto the stack.
Those two values are added together.
A value is popped off the stack and stored in `i`.
That value is pushed back onto the stack.
3 is pushed onto the stack.
Those two values are compared using "less than" as the comparison operator. The result of this operation is 1 for true or 0 for false.
And at the end of the loop we branch to the label if there is a truthy value on the stack, which is the case.
So we made it one iteration through this loop. The end result is that the value at memory location 0 was incremented by 1, and `i` was incremented by 1. Let's run through another iteration, but at a higher level.
First, we want to get the value of memory at index `i`.
Next, we want to increment that value by 1.
Then we want to store the value back to memory.
Then we want to increment `i` by 1.
Finally, we want to see if we've reached the end of our loop.
I'll trace through the final iteration without comments for anyone that wants to think about it.
The final branch does not trigger, and the function completes. And I'll display the memory as ASCII characters again.
Rust + WebAssembly = ❤
wasm-pack
Small runtime, no garbage collector
Low-level control
Focus on safety
Modern features
Error messages and documentation
Time to come back up a level for our final section of the talk. It doesn't make sense to write WebAssembly by hand. Instead, it is a compilation target for other languages. I chose to go with Rust, which I will explain why on this slide.
Rust has excellent support for compiling to WebAssembly. There is a `wasm-pack` that will compile your Rust source code to WebAssembly. As we looked at previously, there is some boilerplate code needed for JavaScript and WebAssembly to communicate with each other. `wasm-pack` will also generate JavaScript that makes this communication super simple.
So why Rust as opposed to some other language? One of the pros is that it has a very small runtime and no automatic garbage collector. If we were compiling a higher-level language that depends on a garbage collector, this logic would have to be included in the `.wasm` file because WebAssembly does not currently have any notion of a garbage collector.
Also I wanted fine-grained control over how execution and memory is used because my goal was to optimize for speed.
So far, languages like C or C++ would also meet these requirements. Since Rust is newer than those languages, it has many features that I found personally intriguing. Rust has a huge focus on safety. This may seem a bit weird if you haven't seen another language like this, but there is no `null`, so there aren't null pointer exceptions. The compiler is very strict. In many scenarios, if the compiler detects that you haven't handled all possible scenarios, it will produce a compile-time error. And the feature that was the largest learning curve for me was ownership. Each value in Rust has a variable that is the owner, and there can only be one owner at a time. When that owner goes out of scope, the value will be freed. References to values are also extremely useful when you want to pass a value into a function. There are also special rules for references: you can either have a single mutable reference, or any number of immutable references. These restrictions allow Rust to ensure your program never references invalid memory, but it does result in a different mental model when writing your program.
There are also modern features like closures and a rich package system called Crates.
With such strict compilation, it is nice that the compilation error messages are very detailed, and often suggest what needs to change.
I am sure I could have come to a similar result with my emulator using another langugae, but these are the reasons I went with Rust. And learning a new programming language is always a pro in my book!
fn fizz_buzz_1() {
for i in 1..=30 {
if i % 3 == 0 && i % 5 == 0 {
println!("FizzBuzz");
} else if i % 3 == 0 {
println!("Fizz");
} else if i % 5 == 0 {
println!("Buzz");
} else {
println!("{}", i);
}
}
}
Let's take a look at some Rust code. I'm going to take a look at FizzBuzz since I'm guessing a number of you already know about it. Basically you iterate over numbers. If the number is divisible by 3, print Fizz. If the number is divisible by 5, print Buzz. If it is divisible by both, print FizzBuzz. Otherwise just print the number.
Overall the syntax feels very similar to a lot of programming languages. Line 1 defines a function named `fizz_buzz_1`. Line 2 starts a `for` loop. Lines 3-11 are an `if`, `else if`, `else` block. Nothing too crazy here.
The one surprising bit for me is on line 2. You will notice that it is iterating over the values in a range. Don't worry about the range syntax for now. This is very similar to a JavaScript for..of loop. I was surprised to learn Rust does not have a traditional for loop that takes an initial value, condition, and incrementer. However, it makes sense when taken in the light of Rust's security focus. This for..in style for iterating over an array makes it harder to access invalid memory.
fn fizz_buzz_2() {
(1..=30).for_each(|i| match get_special_message(i) {
Some(msg) => println!("{}", msg),
None => println!("{}", i),
});
}
fn get_special_message(i: isize) -> Option<&'static str> {
if i % 3 == 0 && i % 5 == 0 {
Some("FizzBuzz")
} else if i % 3 == 0 {
Some("Fizz")
} else if i % 5 == 0 {
Some("Buzz")
} else {
None
}
}
Now I've rewritten FizzBuzz to show off some Rust features that I find interesting. It is purposely complicated in some ways to show off these features, so don't consider this a good program.
Let's start with the funtion starting on line 8. You can see some of the type annotations. This function takes an argument `i` that is of type `isize`. It then returns an `Option`. `Option` is used for many things, but one is an alternative to `null` values. It is an `enum`, which can take on one of two values: either a `Some` or a `None`. `enum`s can also hold additional data. For `Some`, it can hold a value, like you see on line 10, 12, and 14. So this function will return a `Some` value of FizzBuzz, Fizz, or Buzz if it is one of the special cases, or `None` if it is not a special case.
There is also a subtle twist in how Rust returns a value from this function. Rust is also an expression oriented language. Statements end in semicolons and expressions do not. A block will return the value of the last expression. Notice that line 10 does not end in a semicolon. So the result of that `if` block is `Some("FizzBuzz")`. if-else expressions actually result in the value of the selected block, so lines 9-17 evaluate to whichever condition block was true. You can actually use an if-else in Rust much like a ternary operator in languages like JavaScript. And then finally, the function will implicitly return the last expression in the function.
Now let's see how this function is used. On line 2 we call the `for_each` function on a range. This function takes a closure to call for each element, much like passing an arrow function to `forEach` in JavaScript. The `i` surrounded by pipes represents the arguments to the closure, and the `match` expression is the body. Had there been more than one expression, it could be wrapped in curly braces for a block expression. `match` is like a super-powered `switch` statement. In our case, we do something different based on whether there was some special message or there was none. This is yet another example of Rust's security. Had we not defined what to do in the `None` case, the program would not have compiled.
pub struct St7735 {
…
image_data: [u32; St7735::WIDTH * St7735::HEIGHT];
}
impl St7735 {
…
pub fn byte_received(&mut self, value: u8,
porta: &PortRegisters, portb: &PortRegisters) {
…
let r = (0b1111100000000000 & pixel_data) >> 8;
let g = (0b0000011111100000 & pixel_data) >> 3;
let b = (0b0000000000011111 & pixel_data) << 3;
let color = (255 << 24) | // alpha
(b << 16) | // blue
(g << 8) | // green
r; // red
…
self.image_data[base_index] = color;
}
}
I won't be going over much of the code for the emulator, but let's look at a couple of interesting snippets. This is part of the code that handles pixels being drawn to the screen. Line 1 defines a `struct`, which has similarities to classes in other languages. On line 3 we declare that this `struct` has an array of length `WIDTH` times `HEIGHT`. Line 8 defines a function on this `struct` that is called whenever a message is sent to the screen hardware. Let's ignore most of the logic handled by this function and focus in on when it needs to draw a pixel to the screen. Lines 11-17 do some bitwise manipulation to convert the rgb value into the correct format. This is something we can also do in JavaScript, but in Rust we have more control over the type of the numeric data. And finally on line 19 we put this data in our array at the correct location for this pixel location.
step(timestamp) {
… // Snip code to calculate number of iterations
// Run number of emulated cycles equal to `iterations`
this.gamebuino.run(iterations, this.buttonState);
// Draw to canvas
let buf8 = new Uint8ClampedArray(
memory.buffer, // buffer
this.gamebuino.image_pointer(), // byteOffset
160 * 128 * 4 // length
);
this.imageData.data.set(buf8);
this.ctx.putImageData(this.imageData, 0, 0);
this.ctx.drawImage(this.canvas, 0, 0);
// Request next step
this.requestId =
requestAnimationFrame(t => this.step(t));
}
And this is the JavaScript side where we will see this image data being used to draw to a canvas.
Line 5 is where we are calling into the WebAssembly code. Note that `wasm-pack` generated some JavaScript code so that we can easily call the exposed WebAssembly functions.
The format of the `image_data` from the previous slide was special. It is the same format JavaScript uses for images that can be drawn to a canvas. So rather than drawing to the canvas a single pixel at a time, we can just take that whole pixel array, store it in image data, and draw the whole image at once. That is what is going on in lines 8-15. Take special notice of the `Uint8ClampedArray`. We are using the shared memory buffer from WebAssembly. (Remember the slide where we manipulated the memory directly in WebAssembly?) We start at the index defined by the location of the start of the array, and continue for a length including the whole screen worth of data.
This wasn't too bad. However, an alternative solution would be to use the Rust `web_sys` crate. It includes bindings to interact with web APIs directly from the Rust code, including the DOM, and for our specific scenario, the canvas. At the end of the day, I believe this library would have done something similar to my solution to implement this functionality. Worth it for my own learning to do it by hand.
fn execute_instruction(&mut self,
instruction: Instruction) {
match instruction {
Instruction::LslImm { rs, rd, offset } => {
let original = self.read_register(rs);
let result = original << offset;
self.set_register(rd, result);
self.cond_reg.c =
original & (1 << offset) != 0;
self.set_nz(result);
}
…
Instruction::Beq { offset } => {
if self.cond_reg.z {
self.set_register(PC_INDEX,
self.read_register(PC_INDEX) + offset);
self.increment_pc();
}
}
…
}
}
There is nothing really new in this slide, but here is a snippet of the logic that performs the CPU instruction emulation.
To squeeze a bit more performance out of the emulator, all of the instructions in the emulated program are pre-parsed into an `Instruction` enum. (Theoretically, a game could overwrite program memory while running, defeating this pre-parsing, but in reality, I haven't run into a game that does this yet.) Each enum option represents a different type of CPU instruction.
We use a `match` instruction to determine which instruction to run on line 3.
Line 4 matches a logical shift left by a constant amount. You can see the shift on line 6, storing the result in line 7, and updating some condition flags on lines 8-10.
Line 13 matches a branch if equal instruction. So depending on if the zero condition flag is set, it will update the program counter register to where we want to jump to.
That's really it! We only looked at a small portion of the actual code for the emulator, but my primary goal of this talk was to share what I learned about using WebAssembly. I can say that I was plesantly surprised how quickly I was able to pick up Rust and compiling to WebAssembly when I had almost no experience with it prior to this project. It isn't a solution to everything, but I encourage you to check it out if you found this talk interesting.
Links
Here is the link to the emulator I promised. _Show how to use it correctly._
I highly recommend the Rust and WebAssembly "book" below. Also there is a link to my source repo. Be gentle; I'm sure more eyes will find plenty of things that can be improved.
I have found programming games on the Gamebuino to be very accessible, even if you have no programming experience. On the off chance you are interested in writing games for a system like Gamebuino, there are other options. Two that I'd recommend checking out include Pokitto and Arduboy.
Thanks to everyone that came to listen to me! Are there any questions?