Skip to the content.

This is part of the series behind the scenes of RP2040 Doom:

See here for some nice videos of RP2040 Doom in action. The code is here.

Overview

Getting Doom to run in our 264K of RAM is hard work anyway, but there are trade-offs with speed to consider too:

Moving code to RAM speeds things up

The flash on the RP2040 is mapped into the address space via XIP (eXecute In Place), and bus reads within the flash window can be translated into (Q)SPI communication with the flash chip to fetch the data. This type of access is much slower than a regular RAM read. To improve things a little, the RP2040 has a 16K RAM cache in front of the XIP mechanism, which can avoid the slow read for addresses/data already in the cache. This does help tremendously for relatively small hot code or data, however given that RP2040 Doom is accessing large amounts of level data and graphics from flash every frame, there is likely a lot of churn.

I did consider fine control of which data may be cached or not, however I have found this to be hard in general, and often counter-productive, given that you often end up not caching code/data that might actually have helped.

Note that the RP2040 XIP cache can be turned off altogether, conjuring another 16K of usable RAM, however RP2040 Doom has all its level data, graphics, many lookup tables, and most of our code in flash, so there is no clear 16K worth of code/data in flash that should be promoted to RAM at the definite cost of all other flash access definitely being slow.

Given the above, I decided to leave the XIP cache to do its thing, and select a few small areas of hot code or data to promote to RAM manually. Basically my strategy here was:

Moving data to flash to save RAM slows things down

Whilst vanilla Doom loads level data for the current level into RAM, we do not have the space to do so. It isn’t just the level data though, as we have to move other lookup tables and game data structures into flash also to make enough space for definitely mutable data in our limited RAM.

RP2040 Doom necessarily takes a performance hit for this, which it must make up in other places. Fortunately the CPU clock frequency is high compared to the machines of the day, so the net effect does not seem to be too bad as long as these penalties are limited to code that does not run many thousands of time per frame.

Compressed data has a decoding overhead

As mentioned in the previous section, the level data and graphics are all compressed in flash. This again costs us extra instructions to decode them, however the hope at least, is that the reduced XIP cache churn, and the fewer actual accesses to the flash due to the smaller working set size, do at least offset this a little.

Note that there is not generally space in RAM for any caches of decompressed data, so it is super important that the compressed data be decompressible with minimal overhead.

Use of the two cores

The RP2040 has two symmetric Cortex-M0+ processors. RP2040 Doom splits the work as follows:

Core 0 - “Game loop”

The main thread of execution on core 0 runs the main game loop, very much as vanilla Doom would. This runs as fast as it can, and if a frame takes 12ms, 26ms, or 40ms to complete then so be it. The spread of other responsibilities between IRQs and the other core are designed to make this variability have no effect on the real time activities such as VGA display generation, sound effects or music.

Each game loop consists of:

  1. Game logic, input handling, any changes to game/menu state etc.
  2. Rendering the frame.

Core 1 - “Game loop” synchronized worker

The main thread of execution on core 1 also runs a loop, coordinate by semaphores to be in synch with the loop on core 0. Core 1 is actually blocked during first step (game logic) above, as this is a very short step, and blocking core 1 here removes the need to protect a whole slew of other state with mutexes.

As soon as core 0 starts rendering the frame however, core 1 is free to generate music/sound effects as needed. The audio signal generation consumes one buffer every 20ms or so, and there is a short queue. The code on core 1 needs to poll within every 20ms on average, to see if a buffer has been emptied and now needs filling with newly generated music/sound data.

This audio generation work, which takes anywhere from about 1.5ms to 4.5ms is hopefully done while core 0 is busy with the scene traversal, and other non-parallelizable parts of the rendering. Once the rendering has reached the point where core 1 can contribute in parallel, core 0 sets a semaphore, and from then, core 1 will also do rendering work, but still interspersed with sound generation as needed.

Note that it is possible during this parallel phase that core 0 finishes before core 1, in which case core 0 may steal any the sound generation work before core 1 notices it.

Core 1 IRQ - VGA signal generation

The VGA signals themselves are generated by PIO state machines, however these state machines generate IRQs when CPU involvement is required. The most invasive here are once for every display line, i.e. 60000x per second, when the CPU must set up a new DMA transfer for the scanline. These IRQs happen on core 1.

Core 1 IRQ - Scanline-buffer filling

The concept of scanline buffers is discussed more in the section on rendering, but basically one scanline worth of data must be generated - for use in an upcoming DMA to the PIO state machines - 200 times per frame, i.e. 12000 times per second. These IRQs also happen on core 1, and top up a small circular queue of scanline buffers ready for signal generation; the small circular queue provides some elasticity, allowing these IRQs to run at a lower priority/larger jitter than the signal generation itself.

Note: These two sets of IRQs take a non-trivial amount of core 1’s time, however they are the only IRQs running on core 1, there is actually enough time to service them all, and it is therefore nice to be able to just “farm off” video generation to core 1 and not worry too much about it!

Core 0 IRQ - Audio signal generation

RP2040 Doom currently outputs sound as I2S using another PIO state machine. Again data is DMAed to the state machine. The data has been generated by the main thread of execution on core 0 or core 1 as described above; the IRQ is simply responsible for taking full audio buffers from a queue and starting a DMA for them to the state machine, and passing emtpy buffers back to the pool.

Core 0 IRQ - USB handling

RP2040 Doom uses TinyUSB for USB support. Support for host mode, where devices such as keyboards are connected to the RP2040, is still pretty nascent in TinyUSB. As a result these IRQs can be of slightly unpredictable durations, so are run as a lower priority on the non time-sensitive core 0.

Core 0 IRQ - I2C Network Code

The I2C network is relatively slow; 1Mbps, so important events are handled via IRQ on core 0. For the node hosting the game the IRQs are infrequent as DMA can be used. For other nodes who have joined the game, IRQs are potentially fired as often as per byte, but the amount of data transferred per second is pretty small.

A note on core 0/1 stack consumption

As you’ll see in the section on RAM usage below, the stack sizes for the two cores are different and indeed have been finely tuned to the largest they can be, and yet still fit other things where I wanted them. Extreme care has been taken in avoiding stack overflows on either core, with careful:

Completely by accident as it happens, I created a useful canary between important code/data and the core 1 stack, which is the one that is most constrained. This canary is the cached palette used for the status bar. Any mild stomping off the end of the stack during development caused color corruption of the status bar, rather than a crash!

General code optimization

Code is generated with -Os, i.e. optimized for size, as without that there just isn’t enough room in 2M of flash for everything to fit. Some key functions are explicitly optimized at a higher more space hungry level.

Assembly is rarely used, except for:

  1. A few key inner loops.

  2. OPL2 synthesis, which as well as containing some important inner loops, also sports some rather funky control flow to avoid significant Cortex-M0+ function call overheads within inner loops.

  3. An optimized fixed point-multiplication implementation. Doom uses a 16:16 fixed point format, and the simple way to multiply two such numbers in C is to pre-convert one of the numbers to an int64_t, perform a 64x32 multiplication and then downshift the result to preserve the correct significant bits. RP2040 Doom uses a much more compact inlined assembly multiplier for this fixed-point multiplication which avoids calculation of later discarded cross-terms.

RP2040 Custom Hardware

RAM Usage

A straight RP2040 compile of the Chocolate Doom source code off which RP2040 Doom is based requires 300K of static mutable data and a minimum of about another 700K of Doom “zone memory” for dynamic allocation.

I have said that the RP2040 has 264K of RAM. This isn’t quite true; it has:

Data in different banks can be accessed by different bus masters concurrently. This is important when considering speed. Particularly, “Scratch X” and “Scratch Y” are little havens of peaceful isolation from other code, and I try to treat them this way.

Whilst the USB controller is actually used for keyboard input, it doesn’t need all 4K for that, so I steal 3K as RAM!.

The following is a rough outline of how RAM is used in RP2040 Doom:

RAM usage

The majority, almost 180K, is related to the display, and the largest chunk of the rest is for dynamic memory allocation.

The Doom “zone memory” and the regular malloc heap are combined into one region, as maintaining too many heaps with separate smaller free spaces and fragmentation is wasteful. The heap is actually 58K, but I showed it as 52K above, because I separated out the audio/scanline buffers that are malloced into separate line items. The “zone memory” part which started at 700K for our Chocolate Doom ancestor now only consumes up to about 45K depending on the level being played.

There are thousands of code changes related to reducing the amount of memory used, so I’m just going to classify the main techniques used below.

Note that a Chocolate Doom executable and a RP2040 Doom executable can be built from the same codebase, so the changes are made with typedefs wherever possible and to a lesser extent #ifdefs. In many cases I may change the meaning of variables, or indeed the contents of structures between the two variants, so I also use new macros or inline methods to provide a common way to access fields, with just the implementation of those macros/methods changing between the builds.

General RAM saving techniques

There is no doubt much more RAM saving that could be done, however the initial goal was to get the DOOM1.WAD to run on the RP2040 at all skill levels, which they do. All levels on Ultimate Doom and Doom II seem playable on “Hurt Me Plenty”, however I have not played every level through to completion. Some of the levels in these later games can occasionally run out of space for the rendering data structures, causing some areas of the screen to become black, so it would be worth coming back later to reduce the RAM usage more, noting that there are several hundred static variables I haven’t really even scrutinized at all.


Read the next section Music And Sound, or go back to the Introduction.

Find me on twitter.