Cantor's sound engine is the first of my three apps where I am experimenting with parallelism and the complexities of handling audio aliasing (rather than taking a "don't do that" approach). I never got into instrument making thinking I would get sucked into actual audio synthesis; but it's unavoidable because MIDI has far too many problems with respect to setup and consistency. The fact that I insist upon fully working microtonality almost makes MIDI untenable. I don't have the electrical engineering background to compete with the great sound engines in the store; I got into this because the playability of iOS instruments is totally laughable at the moment. I think anybody that can write a simple iOS app can absolutely run circles around the current crop of piano skeuomorphisms that rule the tablet synthesizer world today; if you judge by the playability of the resulting instrument. So on this basis, I'm not giving up because of the Moogs and Korgs of the world putting out apps of their own. :-) But for now... because MIDI still sucks for this use case, I still need an internal engine; most users will never get past the setup barrier that MIDI imposes.
So, one of the more surprising revelations to me when I started into this was that you can't avoid multi-sampling, even if you aren't making a sample-based instrument, and even if you only have one sound. The idea that you can just speed up or slow down a digital sample to change its pitch is only half-true. The same sample played at higher than original speeds is actually not the same wave shape as the original sample. You not only need to drop half the samples to go up an octave, but you need to cut out frequency content that it can no longer accurately represent. Because this instrument doesn't have a discrete set of notes, and is inherently fretless, it also needs perfect continuity when bending from the lowest to the highest note. So the function that defines a wave cycle not only takes phase into account, but the maximum frequency at which you might play it back determines its shape.
And also, because I have a life outside of iOS instrument making, I am always looking to tie-ins to things that I need to know more generally to survive in the job market. I am currently very interested in the various forms of parallel computing; specifically vector computing. Ultimately, I am interested in OpenCL, SIMD, and GPU computing. This matches up well, because with wavetable synthesis, it is possible in theory to render every sample in the sample buffer in parallel for every voice (multiple voices per finger when chorusing is taken into account).
So with Cantor, I used Apple's Accelerate.Framework to make this part of the code parallel. The results here have been great so far. I have been watching to see if OpenCL will become public on iOS (no luck yet), but am preparing for that day. The main question will be whether I will be able to use the GPU to render sound buffers at a reasonable rate (ie: 10ms?). That would be like generating audio buffers at 100 fps. The main thing to keep in mind is the control rate. The control rate is the rate at which we see a change in an audio parameter like a voice turning on, and how long until that change is audible. If a GPU were locked to the screen refresh rate and that rate was 60fps, then audio latency will be too high. 200fps would be good, but we are only going to output 256 samples at that rate. The main thing is that we target a control rate, and ensure that the audio is fully rendered to cover the time in between control changes. It might even be advantageous to render ahead (ie: 1024 samples) and invalidate samples in the case that a parameter changed, which would allow audio to be processed in larger batches most of the time without suffering latency.
Todo:
One thing I am not doing explicitly so far is handling interpolation between samples in these samplebuffers. I compensate by simply having large buffers for the single-cycle wave (1024 samples). The single cycle wave in the buffer is circular, so linear interpolation or cubic spline between them are plausible. There is also the issue of the fact that as you go up an octave, the wave is oversampled by 2x because you will skip half the samples. Because of this, a single sample represented in all the octaves could fit in a space that's about 2x the original sample size (n + n/2 + n/4 + n/8 +...).
And finally, there is the issue of control splines themselves. It is currently an ad-hoc ramp being applied to change the volume, pitch, timbre for a voice. It's probably better to think of it as a spline per voice. The spline could be simple linear or cubic (possibly), or the spline could limit changes to prevent impulsing (ie: more explicit volume ramping).
Post Processing:
The part that generates the voices is easily done in parallel, which is why I am thinking about GPU computation. In an ideal world, I load up all of the samples into the card, and when fingers move (at roughly the control rate), the kernel is re-invoked with new parameters, and these kernels always complete by returning enough audio data to cover the time until the next possible control change (ie: 256 samples); or possibly returning more data than that (ie: to cover 2 or 3 chunks into the future that might get invalidated by future control changes) if it doesn't increase the latency of running the kernel.
I am not sure if final steps like distortion and convolution reverb can be worked into this scheme, but it seems plausible. If that can be done, then the entire audio engine can essentially be run in the GPU, and one of the good side effects of this would be that patches could be written as a bundle of shader code and audio samples, which would allow for third-party contributions or an app to generate the patches - because these won't need to be compiled into the app when it ships. This is very much how Geo Synthesizer and SampleWiz work together right now, except all we can do is replace samples. But the main question is one of whether I can get a satisfactory framerate when using the GPU. I have been told that I won't beat what I am doing any time soon, but I will investigate it just because I have an interest in what is going on with CUDA, OpenCL anyhow.
OSC, MIDI, Control:
And of course, it would be a waste to make such an engine and only embed it into the synth. It might be best to put an OSC interface on it to allow controller, synth, patch editor to evolve separately. But if OSC was used to eliminate the complexity of using MIDI (the protocol is the problem ...the pipe is nice), then the question of whether to use a mach port or a TCP port becomes a question. And also an issue of what kind of setup hassles and latency issues have now become unavoidable.
Aren't there security concerns involved with running 3rd party shader code? Will Apple even allow it?
ReplyDeleteFascinating otherwise. I wonder how you could become even more parallel than just a voice per shader core.
GLSLStudio got past this concern, and apparently he had to arm wrestle with Apple to make it happen. Anyways, it's not the per-voice parallelism that's most interesting. It's the per-sample parallelism. My current vDSP code is written as if all 256 samples for current voice are written in parallel. Of course there are probably only 4 SIMD lanes (i.e.: 4 floats at a time), but it has good locality already. And the code is ready for an actual 256 samples in parallel, and it could be extended to go across voices as well. (Well, all parallel for 3 chorus voices times number of fingers times number of samples, then they are added into fingers times voices, and then the final effects like reverb probably require an FFT or some other processing that is not completely independent. The major question, that I can only answer by finally getting an engine written is what framerate I can get. For 5ms response, that's 200fps. That's roughly 256sample buffers being emitted about every 5ms; the control rate.
DeleteAh, okay. That makes a lot of sense. I wish I had more to contribute as a followup, but I'm not familiar with the iPhone graphics hardware or anything. Thanks for taking the time to explain!
DeleteThe main thing, which doesn't require any detailed knowledge of how it will get done in graphics cards is that there is going to be one instruction at a time operating on huge arrays of data; rather than threads. ArrayC = ArrayA + ArrayB, etc. The loop goes inside the instruction rather than looping over single-element instructions. vDSP (not running on the GPU, but in CPU with batches of size 4 or 8) has a huge number of operations that will work on entire arrays (add,mult,negate,mod,etc). This is how graphics cards are fundamentally different, because the whole processor is based around this concept. In theory, you can manipulate each pixel in parallel under the restriction that all pixels are executing the same instruction. This (vector computing, ie SIMD) is why shaders have their own compiler. The code needs to be written such that the compiler can generate these data-parallel loops.
DeleteHey Rob, you planning on open Sourcing your sound engine?
ReplyDeleteNo; at least not until I really get to the point where I have totally given up on the project. Cantor's audio engine is really minimal anyhow, but more of an experiment in how to get some parallelism out of it, and a demonstration of how Cantor's MIDI output should be interpreted.
DeleteI OpenSourced Mugician because I gave up on it as a viable commercial project. It was written to do exactly what it does and nothing more, where the sound engine is too volatile for me to be making updates to it. I'm really amazed that the only thing that came out of OpenSourcing Mugician was blind knock-offs from all over Asia (and one dickhead that says "buy my app to support the free version of Mugician!")
Cantor is where I fixed a lot of software-engineering sins, and have something that can stand up to much more changes. I only OpenSourced Cantor's MIDI (ie: DSPCompiler) because other synths need to be able to interpret my special form of MIDI before there's any point in me emitting it(!).
Cantor has parts that can easily be re-used in other projects. I need to be more careful about what I put out in the open, or I risk alienating potential collaborators. I might work with the guy that does Arctic Synth, or something with Wizdom could come up in the future (less likely, but possible).
btw... Cantor (started off as AlephOne) was primarily an experiment in having *no* sound engine at all. The code in DSPCompiler basically is the heart and soul of what Geo/Cantor does, presuming that you are interested only in using it as a MIDI device. ThumbJam is probably the best third-party synth, followed by SampleWiz for this purpose.
DeleteBut between Geo and Cantor, I think I have proven beyond all doubt that MIDI support is a White Elephant. The fact that these kind of apps absolutely require good pitch handling, and have no real concept of "notes", makes MIDI too hard to setup in the few cases where it will work, rare to find a synth that can handle it reasonably, and too hard to explain because the setup happens in the synth instead of the controller. It basically exposes MIDI in its current state as a totally broken CF of a protocol; and it makes me look stupid. Cantor's internal audio engine is literally driven by the MIDI output to prove that all these nuances don't have to be broken just because MIDI is involved.
I may start a new project where I ditch MIDI entirely, just focus on a great sound engine, and make the MIDI come second as I had done with Geo. The real trick in doing it this way is in avoiding a situation where you basically have multiple sound engines and inconsistent results within one synth.
I hear you with all the issues. Musix Pro has been sitting on the shelf for months simply because of all of the midi issues you have encountered. It's very hard to create an innovative instrument within the constraints of midi. I'm by no means a synth expert and don't wish to be. But I'm slowly being forced down the same path as you in having to create a better synth in order to realize the full potential of my instrument.
DeleteTo put it bluntly: MIDI is really, really, really killing off innovation because it's broken in precisely the places where it has to excel on a tablet. MIDI HD might fix it, but if it isn't to be super-complex, then it has to be a new protocol. If it's a new protocol, then why not OSC? The only thing that stopped me from ditching MIDI already was the existence of a background pipe for MIDI. If apps could do low latency OSC in the background over a standardized pipe, then almost all of the complexity and gotchas disappear.
DeleteIf MIDI wasn't so F-ing broken, almost none of the apps in the store would bother writing their own sound engines(!!!). I would have never bothered to write a sound engine. There would be audio engines from a few companies with a steep electrical engineering background, like Moog, etc; and everything else would just be controllers that take advantage of the surface. It has been decades since anybody has cared what a keyboard controller sounds like... just plug it into a MIDI synth that might not even have its own controller on it. It causes developers to do spend all time where they are weakest (electrical engineering issues), at the expense of where they are strongest (interfaces).
And this leaves us with a gaggle of truly stupid synthesizers on iOS, where they are all knob/sliders and almost no playing area... totally unplayable... so now you plug-in keyboard controllers into an iPad which doesn't really compete with a typical hardware synth (the one that you are plugging into your iPad!), other than possibly offering some reconfigurable real-time controllers on the synth's surface. So the things that you plug into the iPad are often better than the iPad that you are plugging into... It's epic idiocy.
The ability to negotiate the setup must be STANDARDIZED. OSC and MIDI actually share this fatal flaw. If a synth has these controls for its current patch "knob:A,knob:B,slider:C", then when you open up a controller to connect to it, the process of connecting should proxy these controls through the controller's UI automatically. Pitch handling should just work, as there is nothing more fundamental to a music instrument (MIDI...wtf!?). The apps on-board should not be fighting over sample rate and latency (one controller is generating sound, and all others are controlling). I could go on... It may not even be OSC that replaces MIDI. It may be something totally iOS specific that fixes all of these stupid problems, and quickly spreads out to hardware from there.