And here is something being done by John Carmack, where the issue is ensuring that a head-mounted display tracks the image closely. The immersive feel is totally due to getting low latency so that when you move your head around, you don't detect lag in movement.
Carmack recently lamented in a tweet that typically, you can send a transatlantic ping faster than you can get a pixel to the screen.
So, there are some analogies here that are relevant.
- Render an Image Frame (width*height pixels) at 60fps. 60fps is the Screen Refresh Rate. Frames are rendered on a VSync, which is a period at which you can draw to the screen without image tearing artifacts. If you miss this deadline then you will get choppy video. Color depth is generally going to be 8 bits per color, or 32-bits total for red,green,blue,and alpha.
- Touches come in from the OS at some unknown rate that seems to be at around 200fps. I will call this the Control Rate. It is the rate at which we can detect that some control has changed. It is important because people are very sensitive to audio latency when turning touch changes into audible sound changes.
- Like the Image Frame rendered in a Screen Refresh, an Audio Frame is made of samples (an analog to pixels). At 44.1khz, about 5ms requires 256 samples. 5ms is a rate of about 200fps when you think of sending a finished frame off to the audio API like an audio VSync. If you miss this deadline, you will get audible popping noises; so you really cannot ever miss this deadline. The bit depth is going to be either 16 or 32 bits, and the internal processing before final rendition is generally going to be in 32 bit floating point arithmetic. And analogous to color channels, there will generally be a stereo rendition which doubles the data output.
In the same way, audio frames could run at 60fps (1024 samples), and the "smoothness" of the audio is not an issue. But taking 23ms to respond to control changes feels like an eternity for a music instrument. The control rate really needs to be somewhere around 5ms. Note that the amount of data that would be emitted per Audio Frame is roughly the same amount of data as 1 line of the display frame in most cases. But we need to render them at a rate of 4 to 8x as often.
CSound is a very old and well-known computer synthesizer package, and one of the first things you put into the file when you are working on it is the control rate versus the sample rate. The sample rate (ie: 44100 hz) is generally about 10x to 20x the control rate (220 hz) and the control rate is 4 to 8x the screen refresh rate (60hz).
GPUs and Rates
So, with this heirarchy of rates in use, it is fortunate that the higher rates have less data that is output. A GPU is designed to do massive amounts of work during the period between VSync events. Video textures are generally loaded into the card and used across many Image Frames, with the main thing being that at 60fps geometry must be fed in, and in response (width*height) pixels need to be extracted.
To use the GPU for audio, it's only 4 to 8x the rate (to match the control rate). And because the sample rate is going to be on the order of 200x the control rate, then we expect to have to return merely hundreds of samples at the end of each audio frame. This way of working makes sense if there is a lot of per-sample parallelism in the synthesis (as there is with Wavetable synthesis).
Similar to video textures, the wavetables could be loaded into the GPU early on during initialization. At that point, making control changes and extracting new audio buffers is going to be generating all the traffic from the GPU to the rest of the system. If there were an audio equivalent to a GPU, then the idea would be to set the sample rate for the Audio GPU, and simply let the card emit the samples to audio without a memory copy at the end of its operation. Currently, I use vDSP, which provides some SIMD parallelism, but it certainly doesn't guarantee that it will run its vector operations across such a large sample buffer as 256 samples in one operation. I believe that it will generally do 128 bits at a time and pipeline the instructions.
In fact, the CPU is optimized for single thread performance. Latency is minimized for a single thread. For GPUs, latency is kept low enough to guarantee that the frame will reliably be ready at 60fps, with as much processing as possible allowed at that time (ie: throughput on large batches). An Audio optimized "GPU" may have a slightly different set of requirements that is somewhere in the middle. It doesn't emit millions of pixels at the end of its completion. But it could have large internal buffers for FFTs and echoes, and emit smaller buffers. In fact, the faster it runs, the smaller the buffers it can afford to output (to increase the control rate). In the ideal case, the buffer is size 1, and it does 44100 of these per second. But realistically, the buffers will be from 64 to 1024 samples, depending on the application. Music instruments really can't go above 256 samples without suffering. At a small number of samples per buffer, the magic of this GPU would be in doing vector arithmetic on the buffers that it keeps for internal state so that it can quickly do a lot of arithmetic on the small amount of data coming in from control changes. This would be for averaging together elements from wavetables. The FFTs are not completely parallel, but they do benefit from vector arithmetic. It's also the case that for wavetable synthesis, there is a lot of parallelism in the beginning of the audio processing chain, and non-parallel things come last in the chain; at the point at which the number of voices is irrelevant, and it's generally running effects on the final raw signal.