Skip to the content.

This is part of the series behind the scenes of RP2040 Doom:

See here for some nice videos of RP2040 Doom in action, including the sound.

Sound Introduction

RP2040 Doom attempts to recreate the sound feel of the original Doom game in the ’90s.

One of the main sound cards of the day was the Sound Blaster which had 8-bit sampled audio output along with a 9-channel Yamaha YM3812 (OPL2) synthesizer as used on the prior Ad Lib card. Another popular card at the time the original Doom was release was the Gravis UltraSound. The latter however used samples for MIDI synthesis, and RP2040 Doom does not have RAM or flash space for the samples, so OPL2 synthesis it is!

The OPL2 chip generates samples at the frequency of 49,716Hz. Other Doom ports may choose to perform down-sampling to other more common frequencies, or perhaps even generate OPL2 samples at a lower frequency, but as the RP2040 can output sound at any frequency I want, I chose to use 49,716Hz as the actual audio output frequency with no resampling for simplicity.

Audio is output in 16-bit stereo over I2S using one of the RP2040’s PIO state machines to encode the output based on raw signed 16-bit samples fed via DMA. Using 8-bit PWM would be equally trivial, but - perhaps incongruously given the 8-bit nature of the time - I figured that 16-bits sound is preferable, and in either case you need some sort of external circuitry. I2S->analogue chips are common and cheap, and whilst the RP2040 can certainly produce PWM sound much better than 8-bit, it starts to take a bit more effort!

Sound Effect Generation

Doom sound effects are 8-bit mono samples, generally, though not necessarily, at 11,025Hz. RP2040 Doom supports 8 channels (which is what vanilla Doom supports) and each channel has volume/pan settings to map onto the stereo output.

Samples are frequency-converted by simple fixed point 16:16 fractional stepping through the sample data, and mixed according to their volume/pan settings into the final 16-bit stereo values.

In order to make RP2040 Doom fit on a Raspberry Pi Pico, the sound effects needed to be compressed, and this is done using ADPCM using the ADPCM-XQ library. This ends up with 4-bits per sample but largely imperceptible quality loss. Actually, I believe the SoundBlaster of the day may also have stored samples in 4-bits using ADPCM, so maybe this is somewhat historically accurate!

For the best speed on a Cortex M0+, it is very important to keep loops tight, keep everything in registers and avoid function calls within tight loops. The RP2040 has very fast RAM access, with loads/stores being 2 cycles, and bulk loads/stores shaving more off that. It therefore tends to be much more efficient to do everything for a single channel first, then everything for the next channel, even if doing some means writing samples to a temporary buffer, rather than attempting to do everything for a single sample across all channels before moving onto the next sample.

The sound code runs an optimized tight loop for each channel, updating the 16:16 fractional sample position, loading the next 8-bit sample, applying volume/pan multipliers, and adding the value, with clamping, into the final 16-bit stereo 1024 sample output buffer. Note that 1024 is a completely arbitrary buffer size, that I chose early on, and never found a reason to change! As a result a new sound buffer must be generated on average about every 2ms.

The 8-bit samples obviously have to come from the ADPCM compressed data. Again, for speed on a Cortex-M0+ it is better to not mix concerns, so a small 8-bit buffer of decompressed samples is kept for each of the 8 channels. The buffer size is 249, as this is the size of the ADPCM block being used. When a channel reaches the end of the decompressed buffer, which at 11,025Hz is after 2.2ms, the tight loop pauses, and the ADPCM decoder is called to decompress another 249 samples of data.

I did try some low-pass filtering of the samples after their upscale from 11,025Hz to 4,9716Hz, which is really not that expensive on the RP2040, but didn’t produce any noticeable (to me) improvement.

OPL2

The OPL2 (Yamaha YM3812) is a music synthesizer supporting 9 normal melodic channels as well as noise based percussion sounds. You can read more about it on Wikpedia here.

Synthesis proceeds as follows:

18 Oscillators

Each oscillator starts with a waveform at a particular frequency; the following waveforms are supported:

YM3812 operation waveforms

The wave amplitude is controlled with an ADSR envelope, which models the volume changes as a note is pressed, held and released. Each oscillator also supports tremelo and vibrato.

9 Melodic Channels

Each of the 9 melodic channels is formed by combining up to two oscillators, as follows:

  1. A single oscillator output can be used.
  2. The oscillator outputs may be added together (AM).
  3. One oscillator output can be used to modulate the frequency of the other oscillator (FM).

Percussion Channels

The YM3812 supports a percussion mode where digital bits from 3 of the oscillators are combined to produce 5 percussion channels. This is actually quite complex, and I was very pleased to discover after a while, that Doom does not use them, and so they can be ignored completely! I do not implement them, saving valuable CPU time.

OPL2 emulation

I tried a variety of different OPL2 emulators both hooked into Chocolate Doom on my PC, but also playing individual songs on the RP2040, where I didn’t yet have a full working Doom build.

Picking a codebase to work from

Chocolate Doom uses the Nuked OPL3 Emulator. This was the first one I tried, however even running at the final 270Mhz overclock used for RP2040 Doom, this takes 300% of the CPU time on one core. I took a look at the code, decided I had no idea what was going on, and decided to go in search of another OPL2 emulator. In retrospect, looking at the code now, this code might have been as good a starting point as my final choice, but I didn’t realize that at the time.

I next looked at the Woody OPL2 emulator that originated in DOSBox. This looked a little more comprehensible, and seemed to do work more linearly per oscillator, rather than doing everything across the board one sample at a time like Nuked OPL3 does. Sadly, the Woody emulator uses floating point, and RP2040 has no hardware floating point support. I spent a while converting the floating point to fixed point, which seemed to be promising for a while, but soon turned into a quagmire.

Around that time, I read some more about the OPL2 internals, and realized that the real chip produced its output using simple integer math with no multiplies via canny use of logarithms and lookup tables. This made my attempts to force a natively floating point “Woody” emulator into fixed point look particularly stupid. Therefore, I searched again, and found emu8950, which seemed a lot more comprehensible given my new understanding, and decided to try work from there.

Optimizing the code

I had my choice of OPL2 emulator, but emu8950 was still too slow to run on the RP2040, using considerably more than 100% of one overclocked core. I wanted to be able to treat the OPL2 generation almost as background noise (sorry) in terms of CPU usage, so I needed to speed things up a lot.

The original emu8950 code has a function call to generate a sample that loops over all the channels, calling functions to generate the oscillator outputs for that time point. These functions then call more functions to generate the sine wave, apply the envelope, advance the envelope etc. All in all, it is a whole tree of function calls per sample, which is fine on a fast PC, but not good on a Cortex M0+ where the overhead of a function call is generally 30-40 cycles, with the corresponding loss of most of your precious “in-register” state.

Therefore, RP2040 Doom turns the code inside out. The innermost loops are for generating the sample values for a single oscillator into a buffer, and then buffers for the different oscillators are combined and so on. This is a fundamental shift in the implementation of emu8950, so was implemented as an alternative via #ifdef and then thoroughly tested by diffing textual output of every oscillator for every sample between the two versions of the code.

Further optimization work was then performed, still not yet moving to the RP2040 itself.

At this point, with everything sample accurate, it was time to really dive in with optimizations, so that OPL2 generation could be something that could be forgotten about even on the RP2040 device.

The result

The result of all the hard optimization work is that our OPL2 generation of 1024 samples for up to 9 channels takes anywhere between about 1.5ms and 4.5ms, averaging around 2ms.

In other words, we use just 5%-20% of one overclocked core, which is really great, and about a 30x speed improvement from where it all started!


Read the next section Network Games, or go back to the Introduction.

Find me on twitter.