Popular Posts

Tuesday, December 27, 2011

MIDI and Over-Abstraction

Let me know specifically where this is wrong. I have actual code that works, so even if it's wrong, it's not theoretical - it works very well.

I am getting my thoughts together on the rationale for doing MIDI the way that I have. The feedback I get on what I am trying to do shows a divide between people that worry about keeping hardware simple versus those trying to keep software simple. It is instructive to first talk not about MIDI and sound rendering, but about OpenGL and drawing APIs.

Cooked Standards

The OpenGL that I am used to is the desktop 1.x versions of it. It provides a state machine that feels a lot like an assembly language for some real hardware device. But it is not painfully low level, because it provides the things that you are almost guaranteed to need in every app. The basic idea is that it will do buffering and matrix transforms for you, and you just feed it what you want. For example pseudo code:

setCameraPointLookingAt(x,y,z, atX,atY,atZ);
turnOnLightSource(lightX,lightY,lightZ, lightR,lightG,lightB);
vertex(x,y,z, r,g,b);
vertex(x+dx,y,z, r,g,b);
vertex(x+dx,y+dy,z, r,g,b);

What is really cool about doing things this way is that you don't have to supply all of the equations for moving points around in a 3D projection. You don't have to build any kind of API to buffer data. These things are provided for you, because you will need this functionality in every app. But newer versions of OpenGL threw all of this away. The main reason for it is that the huge number of APIs to cover every scenario for lighting and coordinate transforms keep growing. Their existence makes the spec unstable. These 'fundamental' things such as lighting, buffering, and coordinate transforms are really conveniences that should be out of the standard.

Raw Standards

The problem with this is that something like OpenGL needs to make the implementation as low-level as possible, while still allowing broad compatibility. So, the requirement for programmable shaders does not mean that it should just be shoehorned into the existing specification. What they did was to introduce an even lower-level API as the new standard, where old code is essentially emulated on top of it. The new APIs are kind of like this:


So if you want the old-code look, then you use an API that looks like the old code, but ultimately turns into this underneath. If you need the new features, then you write a framework that lives inside of your framework that approximates the old APIs where you can, and goes its own way where it must do so. This keeps the specification from being larded with everything that the designers didn't think of. The hardware has a smaller burden of what it must recognize, as does the OpenGL graphics driver. But the burden moved to the programmer to use (or write) a library over top of it to get it going. In short, they solved the stability problem by making the primitives more general.

So newer versions of OpenGL recognize that in the end, all of this code configures shaders in the graphics pipeline and builds buffers of different kinds of vertices. So new versions of OpenGL don't even bother providing 'fundamental' things like lighting, buffering, and coordinate transforms. The burden is on the developer to create these things. This is not a hassle if OpenGL developers use a standard software library above the raw library. All that has happened is that complexity moved out of the hardware (or at least the API that hides the hardware) into the client software.

VRML and other SceneGraph APIs went nowhere. They provided really easy abstractions, at the expense of creating objects that were too high level. But in the end, the ability to simply make shaded/textured triangles won. Making the API too high level simplifies things for somebody who insists on writing directly to the hardware, at the expense of imposing limitations. The limitations need to be worked around, so then garbage creeps into the spec because the spec is too high level. For OpenGL, staying very low level is the strategy for having the spec withstand change.

It makes the applications more complex to build, at the expense of taking out parts of the spec that are subject to much variation, by only including primitives.

MIDI Is Too High Level, Making The Spec Too (F-ing) Large

MIDI seriously suffers from this over-abstraction problem. Because hardware devices speaking raw MIDI with very little computational power on-board are the primary focus, it is understandable that the model is of turning notes on and off and bending them. In this view, it isn't the controller's job to either: 1) determine exact pitch 2) manipulate note on/off to create solo mode 3) manipulate the envelope of a note. This view is tightly tied to the assumption of discrete keyboard keys that send a MIDI message when the key goes down. The idea of completely continuous controllers could be shoe-horned in as an incompatible addition to the spec, but that's not a good thing because it takes an already complex spec and makes it more complicated without even being backwards compatible. What would be better is to make the primitive elements more general, and make controllers handle it themselves in a backwards compatible way.

MIDI isn't primitive enough. The abuses of the spec that are currently possible, but dicey, need to be standardized so that they are totally expected and legitimate. Note on/off are clearly fundamental. Bends are fundamental, but their semantics are underspecified. You should be able to compute a frequency given a note and a bend value, and work in actual frequencies. This is because frequencies are fundamental, even though MIDI doesn't quite get this concept. Envelopes on the sound are fundamental as well.

Bend And Legato

Because almost all MIDI devices only support the on/off/bend messages, and ONLY that reliably, it's foolish to implement huge parts of the MIDI spec and simply demand that synths and controllers understand all of it. This is especially true on iOS, where such demands will simply result in incompatibility on the part of developers who will simply implement what they need to ship. I am sticking to what is known to work everywhere; and then using the NRPN to putty in the cracks in such a way that it is not a disaster if the synth doesn't understand. This is analagous to rendering 3D graphics as simple triangles if a special shader is unusable. This principle is also why web pages are not usually completely broken when different browsers are used against non-portable pages. A wrong pitch due to an unrecognized message is far more debilitating than getting a right pitch with a note re-trigger at an inappropriate time, especially because many patches don't have a noticeable attack anyway.

Fundamental Elements

So, these are the fundamental elements to build up a correctly functioning MIDI synth that has control over pitch, polyphony, and legato:

0x90 - note on/off
0x80 - note off (i send 0x90, but should recognize 0x80 as equivalent
0xe0 - bend (pitch bend setting still applies)
0xbxxxxx - a special 'note tie' NRPN that states that the next note on and note/off pair are actually tied together.

The first rule is that there are no real 'notes', only frequencies that we make by a combination of midiNote number and bend. We don't try to re-tune the notes, but use pitch bend in combination with note to get the exact pitch. We also place all current notes into unique channels, and try to behave reasonably when this is not possible (ie: channel bend reflects last note down). This is the only really reasonable way to do this because the note on that we choose is what we *name* the note. This note name is what a piano that doesn't understand bends will play if it's asked to play our note.

Because we can always exceed the bend width, note tie says to 'continue note in the same state from the note turning off to the next one turning on'. This note can, and usually does change channels, because of the requirement for every note going down to go down into its own channel. You have to hold off on reusing a channel for as long as possible, because when a note is turned off, it will still respond to bends while it is releasing.

Keeping It Simple In Software, Hardware Isn't Our Problem

We are not worried about requiring a full-blown brain in the client, as hardware vendors might object to. Moving complexity out of the synth and into the controller makes an incredible amount of sense on iOS. This is because the controller will need some of the low level details in order to render itself on the screen. We have the pitch implied by where the finger touches, the pitch that we are drifting to due to fretting rules, and the actual pitch being played. We need to know all of this information in the controller. The standard MIDI equivalent would simply have the controller knowing about where the fingers are, and being more-or-less ignorant of what the synth is doing with this information. So in our case, the controller manipulates the pitch wheel to create the fretting system, and the synth has no idea what intonation we are using. It's not the synth's business to know this.

Similarly with polyphony rules, the synth can't just have a 'solo mode' setting. AlephOne and Geo both have a per-string polyphony that essentially adds the concept of 'polyphony groups'. The polyphony groups act similar to channels in that the controller will turn notes on and off to get the right polyphony behavior. This way we can chord and do legato at the same time. It's a controller-dependent thing, and it's not the synth's business to know any of this.

Similarly with legato. Legato *usually* tracks polyphony by playing attack on the first note down in the polyphony group. But in reality, on a string instrument, whether to pick or legato a note is decided on a per note-basis. It's not a mode that is enabled or disabled for the whole controller.

Because almost nothing recognizes more than note on/off/bend, anything else that the MIDI spec states is quite irrelevant in practice. The note tie addresses something that nothing in the spec does, and doubles as the legato, and it's not a major problem if it's not implemented. To somebody implementing a synth, a small number of primitives (only one thing beyond the standard elements) gives good overall behavior.

There is also the issue of the same note being played multiple times. AlephOne does chorusing. It doesn't do this with any post-processing effects. It works by playing the same note, microtonally displaced, twice everywhere. This is one of the reasons why simply picking note numbers and bending them around is a bad idea. On a guitar, the high E note is played simultaneously from 2 or 3 positions all the time. The assumption that you bend a key is rooted in the idea of a keyboard with one key per 'note'.


So, yeah, this is a hack with current MIDI. OSC is too general (in the sense that OSC messages have no inherent semantics, just syntax - it's XML hell all over again). And what I have read of MIDI proprosals that aren't made of on/off/bend seem unworkable in practice. If we are all on iOS and MIDI becomes too complex and not compatible with existing hardware anyway, we will simply throw the protocol away entirely and use on-board synths.

Real World Implementation

This is actually impemented in Geo Synthesizer and AlephOne on the client end, and in SampleWiz, ThumbJam (where it is especially good, specifically on the JR Zendrix patch), and Arctic Synth on the server end (I believe - I don't have a new enough version of the OS to run it). But anyways, it's not a theoretical idea. It has been implemented multiple times, and is demoed here (switching intonations while playing and doing polyphonic bends - and single polyphonic playing of this works fine against a lot of synths that have weak MIDI implementations):