Audio Languages and iOS, WinRT, Android, etc.
I have been examining the various ways to simplify development of music instruments for touch screens. I wrote Mugician, Geo Synth (also known as Pythagoras in its early development), and Cantor (known as AlephOne early on). Mugician and AlephOne are open source projects that I have posted into github. Geo Synth is a commercial project that is released under Wizdom Music (Jordan Rudess, the Dream Theater keyboardist's company). I built these apps because I believe that the instrument layout that they have on touchscreens will completely overtake the guitar and keyboard industry at some point when latency and pressure sense issues are dealt with. I have a lot of videos posted about it, and there are a lot of pro musicians that use Geo in particular (and some using Mugician as well):
http://www.youtube.com/rrr00bb
The main reason I believe that this will happen is that this layout on a touch screen solves intonation and pitch problems that guitars and keyboards actually make worse. It is even a good ergonomic setup because there are 10 fingers on top of the playing surface, with a layout that's familiar to guitarists, and easier than keyboards (even for actual keyboardists). The regular layout makes it easier to play without having to have extensive feel feedback (ie: reaching proper sharps and flats on a piano where you can't feel where white/black key boundaries are). This allows you to play very fast; significantly faster than real guitar playing even.
The Beginning and Background MIDI
After a few years of playing around with iOS controller code, a year into it, I came to the conclusion that I am doing something architecturally very wrong (along with the rest of the iOS community). Before Audiobus came out, there was no way to contain the list of requirements for an iOS based audio app, to keep simple ideas from turning into time-burning monstrosities. Back then, you could run another app in the background, and even send it MIDI messages. But MIDI has so many problems that you still need to have an internal audio engine to have any guarantee about the user experience, and to ease setup. (So, in my view; the existence of internal audio engines in any MIDI controller app means that MIDI doesn't actually work as it was intended. I will have more on OSC later. If MIDI did what it is supposed to do really well, then nobody would bother writing a synth to go with a controller, or vice versa. It would not make sense financially to ever do it. But the combination over MIDI sucks, so we end up doing both sides of it.) Because audio was still isolated between apps there was a lot of pressure to make every instrument have:
- A basic controller. Most people imitate pianos, because that's what synthesis guys are used to looking at. Though on the tablet, it's always a horribly unplayable imitation of the piano. I don't understand why people keep doing this. We have had 2 years of this, where there's little point in having the controller because it's not playable - but it's included anyway. You can make a playable controller, but the first thing that has to happen is to drop the requirement that it resemble a piano; as this has been proven for many years to not work on a touch screen.
- The two main jobs of the controller should be to proxy out to controls for the synths/DAWs in the background when required (ie: volume knobs, timbre sliders, etc), and to simply play voices with the correct amplitudes and pitches. MIDI's very strong note orientation makes it an awful protocol for meeting these requirements. OSC can trivially do all of this correctly, but it's a very vague standard. It's so vague that if you make an OSC controller, you generally need to ship a script that is put into the front of the synth to unpack the OSC messages and do what it needs to do to the synth. That's where audio languages come in later.
- A synthesis engine. This is a pretty steep electrical engineering task to do really well. This is why Animoog is one of the best ones out there. Any instrument that simply plays back samples is really unexpressive, and misses the point about music instruments and expressivity. If you are making a MIDI controller, you can (and should) stop here if at all possible and just send MIDI messages. When background MIDI came out, it was such a wonderful new thing to have, presuming that you could generate MIDI to do what you want and the majority of synths interpreted this correctly. What should have happened as a result was that the app world should have split between people providing MIDI controllers and those providing synthesizers, and nobody wasting their time duplicating a mediocre version of whatever the app primarily was not. This typically means mediocre controllers on apps designed to be synths.
- A recording functionality, of audio copy and paste. Actually, it's really a request to include some level of a DAW in every single music instrument. Because this is tied into the synth, it's generally a mediocre DAW functionality. You can't really use these instruments naturally because you have relatively short buffers to record audio into. AudioCopyPaste is quite useless if the primary use case is somebody playing each track non-stop for 20 minutes. That's precisely the kind of use case I cared about.
The iOS audio system wasn't designed primarily for real-time control (at the audio sample rate). We are also dealing with ARM processors, because we run on battery power. Because of this, it has always been a struggle to get instrument-quality latency on any instrument; let alone an instrument that can't stick to doing one thing well and throwing everything else overboard to solidly meet the performance guarantees. Currently, audio buffers are between 5ms and 10ms for "good" apps, though iOS gives about 20ms as the default (under the assumption that you are just playing back sound files). It should really get down to about 1ms audio buffers to meet the standards of professional audio quality. Beyond even that, almost no instruments (including my own) will adjust the latency to reduce jitter to 0ms (by making all latency larger, but constant); because that's usually an expensive thing to implement in an audio engine. Remember that there is no standard for generating audio, and there is a mix of professionals and amateurs doing the best that they can do while scribbling waveforms into raw audio buffers. This means that we have a lot of crappy effects units, audio glitching, aliasing, etc.
For reference, there are these different kinds of latency that we have to deal with:
- Graphics frame rate, about 60fps ("real-time", but a single frame miss will not result in a disaster). The user interface can cause other stuff to stall when it's over budget. This is especially true if the interface is in a high level language like C# or Lua. Also, if you try to use the GPU to render audio frames, then you could be limited to the graphics frame-rate, or at least locked out from rendering audio while graphics rendering is in progress.
- Audio frame out rate, 44.1khz (44000 "frames" per second, where a single frame miss results in a horrible popping noise. It's hard-real time).
- Audio frame in rate, 44.1khz or 22khz. If you are making a guitar effects processor, then you have to add up the latency of incoming, and latency of outgoing, and latency of minimum time to compute a frame for output. Because of this, just because you can build a wicked fast tablet instrument, doesn't mean that you can use that same effects chain for a fast guitar effects pedal.
- Control rate, at least 200fps (256 audio frames). But would like to quadruple performance to use 64 audio frames at 800fps, for 1ms latency and jitter. This is somewhat tied into how fast the touchscreen can sense. If the touch timestamp is sample accurate, then we can send the timestamp to the synth and synth can make the latency even to reduce jitter to zero.
- MIDI/OSC packet latency/jitter. It's very network dependent, and can either be negligible or fatal depending on how much of it is.
Latency issues are still a bit of a mess, and are ignored or not understood by a lot of people. This is especially true on Android, where latency is very far beyond fatal for creating real-time music apps.
Audiobus
Then Audiobus came out. Audiobus wonderfully fixes the huge problem that iOS had with audio isolation in apps. It's existence is very necessary.
Audiobus lets every app (input app, effects unit app, output app) read and write to the Audiobus callbacks so that separate apps can be chained together. So at about 200 times a second, Audiobus is in the background calling every participating app (an input app, and effects app, and an output app) to run all of their callbacks in order. This has the effect of now having 4 apps have hard real-time requirements at audio rates(!). AudioBus + controllerApp + fxApp + dawApp ... that is 4 apps. At 200 times a second, that's 1000 callbacks getting filled out in a second. Also, that's 4 apps with a lot of memory allocated to wrangling audio in real-time. ControllerApp is going to have a user interface on it as well. The user interface can hog resources and cause everything else to stall. The user interface can easily be the thing that causes the whole pile of audio generating apps to not run properly in real-time. It's hard to overemphasize the problem that gets created. If there were only one hard real-time process responsible for generating audio, with everything else being a controller of some kind, then glitching and performance issues mostly go away.
Audiobus also creates a somewhat political issue with app development as well. Audiobus app will list controls that are input, effects, or output capable. It has nothing to say about controllers. If you app is not a synth or a DAW, then realistically it should not be generating audio. If you app is a controller, then it should implement MIDI or OSC. But if you are such a controller, you are technically not in the "Audiobus" enabled category; which means that your app essentially doesn't exist for a lot of users. So what do we do? We pointlessly add in Audiobus support into controllers for no reason, just so we can get into that category. If you unnecessarily generate audio, you just eat up memory and CPU; resources that actual Audiobus apps really need. :-) Controllers are essential to a chain of Audiobus apps, but controllers don't generate or deal with audio. Controller to synth is one protocol, and synth to effects to daw is another protocol. Note that, if Audiobus had a much requested feature to save setups, then it probably would have to include MIDI and OSC connections as well.
Controller ->
MIDI/OSC -> Synth ->
Audiobus -> Effects ->
Audiobus -> DAW
It should be like that.
MIDI versus OSC
I have posted at length about all the technical problems that MIDI has on touch screens; a very long and technical read:
http://rrr00bb.blogspot.com/2012/04/ideal-midi-compatible-protocol.html
The problems are deep issues with what can and cannot be said in the MIDI protocol, especially in the subset that all synths are required to understand. The main problem with MIDI is that it is oriented around notes, rather than around frequency/pitch. MIDI's note bending is a hack that has all kinds of obvious corner cases that it can't represent; all of these corner cases don't show up in piano controllers (which is why they never got fixed), but they are essential cases for string instruments (which is why MIDI guitars are not standard, and are deficient in all kinds of various ways when people have tried). Real oscillators don't know anything about notes, and are frequency oriented.
OSC can be made to handle all of this stuff very nicely. OSC is just a remote procedure call syntax. It's literally just named functions with parameters going back and forth, like:
/guitar/string0,fff 1.0, 440.0, 0.5 #amplitude, frequency, timbre
/guitar/string0,fff 0.9, 442.0, 0.53 #amplitude, frequency, timbre
...
The problem with it of course is that all controllers are essentially custom creations. The messages going to the synth, and from synth to controller could be anything at all. If you defined what a keyboard looks like, you could standardize it.
Audio Languages
So, now this brings me to the heart of the problem I am facing. I want to completely separate out the audio language from the controller. I don't want the protocol that the controller speaks to make assumptions place unintended limits on the controller. And I don't want the controller user interface to hurt the real-time audio engine. So, I have an experiment here on Windows8 (a 27inch screen), where I have a C# program that sends UDP packets to an audio language ChucK:
A lot of audio language aficionados are fond of SuperCollider, Max/MSP/Pd, and CSound is an older language that is still used. There are a few more audio languages, but those are the popular ones. These languages have common characteristics:
- open up a listening port for incoming OSC messages, and get them into a script that does something with the messages
- because OSC is just a standard for sending messages, the synth front-end must have a script pushed into it to actually control oscillators and effects parameters, etc.
- they all let the sound patch be defined completely in a script.
- the script can be pushed into the synthesizer from the controller. this means that the real-time synthesis engine is one app (ie: scsynth, csound, chuck), and the patch comes from the controller
- in some of these languages, ChucK in particular, you setup the network of effects and advance time explicitly. As an example, you create an oscillator at amplitude 1, and 440hz. You then tell the engine to move forward 30 milliseconds. When that happens, 30 milliseconds of audio is generated. This is a very hard-real-time notion of how a language should work. It is the main thing that the environments are missing when we try to write synthesizers. This kind of environment is most critical when you try to do things like increasing latency of events to provide zero jitter; for when you want sample-accurate timing and want to delay every change by 5ms rather than at the beginning of a new audio buffer, which guarantees at least 5ms of jitter (ie: latency of 2.5ms with 5ms jitter vs 5ms latency with 0ms jitter).
- you can inject an entire sequencer into the sound engine, and only send it control changes after that.
- you can define effects units like reverbs and distortion units - in scripts that run on the tablet - and install them into the audio engine at runtime. at this point, the mentality could not be any more different from MIDI (and Audiobus) than this. This is where environments like Max/MSP make a lot of sense on tablet computers.
Audio Language Present
The current group of audio languages have features that don't make them ideal for what I am trying to do. The current problem with them is that most of them are oriented around live coding, or in the case of CSound, around offline score rendering. These are both a different perspective from the use of creating hard real-time OSC protocol synthesizers that are driven primarily by controllers.
Csound is a pretty fast environment. It is well known in academic circles, where offline rendering of music is a plausible thing to do. MIDI and OSC support are a horrible after-the-fact hack however. The language is really low level, and will not appeal to a lot of people who would otherwise write patches for it. It's designed mostly around building up a static graph of oscillators and filters. It builds for all the primary desktop environments. CSound also has some pretty bizarre limitations that forced me to change my OSC messaging to break code that could simultaneously talk to SuperCollider and ChucK.
SuperCollider is very actively developed. But currently, it's under a GPL3 license, though work is being done to detangle things to allow for GPL2 compliant builds. Because of this, it almost doesn't matter what else is good about it; because this situation is currently fatal for all of the tablet environments. The $0.99 app model depends on the use of Digital Rights Management (locking down hardware and preventing you from doing whatever you want on your own device), so DRM will never go away. End users have spoken; and they will never give up the $0.99 app model where they can have the interface polish of commercial apps, and close to the prices of free apps at the same time. The DRM conflicts with GPL licensing, and the SuperCollider devs seem pretty adamant about not just ignoring this issue and letting SuperCollider be embedded anyway. GPL2 compliant builds may also have issues as well for user-facing projects that have a short tail, just cannot be open source projects (rock stars don't work for free for very long, it's not just developers required to make apps happen, etc). But ignoring that huge issue, SuperCollider is very mature, and has a pretty healthy user and developer base. It is based on a Smalltalk-like language. The only major technical downsides seem to be that a lot of development is directed towards areas that are irrelevant to the main task of creating realtime OSC synthesizers driven by controllers. Much work goes into user interface things that won't be useful on the tablet platforms, and on things that seem to be related to the live-coding use cases.
Pd (Max/MSP) is an interesting contender. The underlying language is hidden behind a simple and standardized visual interface. In some ways this is good, in that it's easy to do simple things. In other ways, it's really terrible, in that when faced with doing real work, you can easily face a ball of unmaintainable synthesis code that would be simple with the tried-and-true abstractions available in standard programming languages. It's BSD licensing is very compatible with commercial products. Some people have contributed ARM specific improvements.
ChucK is the closest thing to a modern environment in this group. It is a bit less capable than SuperCollider in some respects, but the language really makes a lot of sense, and because the project is run by Smule, these guys understand the tablet world pretty thoroughly. Performance issues aside, it is the most interesting option that seems to be the least burdened by irrelevant features. It isn't freely licensed on iOS however (though its license is not an impossible one for iOS like GPL3). ChucK's applicability seems to have a lot of applicability outside of just audio synthesis as well. It's an interesting lesson in how real-time languages should look in general.
Audio Languages In an Ideal World
Dependencies: One of the more problematic things I encounter with these languages is the kinds of dependencies that they have. The current languages were very much designed with a desktop world in mind. When building for a new platform that they did not envision, these are not going to be designed as pure buffer generators that get hooked up into callbacks (ie: Audiobus, CoreAudio callbacks, WASAPI callbacks). Ideally, the core project should not have branches per platform; but as projects that build around the core. Any audio language project started from scratch should build without pulling in a lot of dependencies, and should ultimately just invoke callbacks that the language has filled in with audio data. This is how Audiobus works, and presumably how The Amazing Audio Engine will work (though these projects are very iOS specific). A real source of heartburn is that even "standard" frameworks will pose a problem. OpenGL, OpenCL, OpenAL, etc, are the usual route to portability; then Microsoft uses WinRT and insists on DirectX and WASAPI, etc. Using straight C code with a minimum of dependencies is generally the only way to avoid this problem.
SIMD: Few of these languages take advantage of SIMD in their implementations (single thread lockstep parallelism, the kind that you need for fast and short convolutions, filtering, for just rendering entire audio buffer in parallel, etc). These are all in C or C++, and there is no good standard for doing this yet. But it is typically necessary that per-platform, there needs to be SIMD optimized builds for the engine to be feasible on ARM processors. Examples are vDSP, and ARM intrinsics. OpenCL addresses these issues in theory, but it's unclear if GPUs can be used in practice for this kind of audio composting. The SIMD work might be tied into the audio language VM, rather than compiling elemental functions to use SIMD at runtime.
The Language: Because these environments have hard-real-time requirements, there is a real conflict with having a dynamic environment as well. These languages run in custom virtual machines. They do real-time garbage collection. Because of this, the language cannot get overly baroque without messing up the goals of the language. The language should work well in a hard-real-time environment. This generally means that memory allocation and de-allocations are much more conservative, and algorithms run in consistent times.
Language Simplicity: A variant of LISP that deals with arrays and SIMD directly seems to be the most obvious candidate to get started with. There are existing audio languages that use LISP as their basis. A virtual machine for running an audio language should at least start out very simple, and grow as needed. The main issue with using LISP in this capacity would be to support actual arrays from the outset, and allow for array operations to be SIMD parallelized (ie: avoid a high garbage collection rate, locality issues, etc).
The OSC Protocol: The most wonderful thing about SuperCollider is how the language environment (sclang) talks to the real-time engine (scsynth) over OSC. It is an effective split between hard real-time and soft real-time. It allows the compiler to simply be removed from environments where it isn't needed. The controller should do something similar to sclang's equivalent, and use OSC as the protocol over which to inject patch definitions into the synthesizer.
The VM: The virtual machine could be a traditional one written by hand. It could also be LLVM output that is consumed and turned into native code. LLVM is designed to run on many systems, but again I run into issues with standards possibly not being usable in certain places (WinRT? How about generating CLR as a backend instead?). OpenGL drivers on OSX already work like this. They take shader code and generate the card specific assembly language; and this is for a pretty performance area critical part of the system.
Patch Generation
When I was working on Cantor (AlephOne), I had started to write a Python compiler to generate SIMD C code (vDSP - partially realized in my github project DSPCompiler and in AlephOne) for the audio engine from LISP input. I ran into this problem because OpenCL wasn't available, and I had a similar problem. When you try to generate wide SIMD code, your code turns completely "inside out" where the outer loop that existed in the serial version goes into every instruction of the "assembly language expansion" of the original C code. For example:
//can't parallelize that
for(int i; i<N;i++){x[i]=a[i]+b[i]*c[i];}
Becomes like:
for(i : 0..N) mul b, c, x
for(i : 0..N) add x, a, x
But that doesn't support dynamic use cases. The patches would need to be compiled into releases. But a VM supporting SIMD instructions appropriately could provide a speedup even when the originating language is high level.