Popular Posts

Tuesday, December 27, 2011

MIDI and Over-Abstraction

Let me know specifically where this is wrong. I have actual code that works, so even if it's wrong, it's not theoretical - it works very well.

I am getting my thoughts together on the rationale for doing MIDI the way that I have. The feedback I get on what I am trying to do shows a divide between people that worry about keeping hardware simple versus those trying to keep software simple. It is instructive to first talk not about MIDI and sound rendering, but about OpenGL and drawing APIs.

Cooked Standards

The OpenGL that I am used to is the desktop 1.x versions of it. It provides a state machine that feels a lot like an assembly language for some real hardware device. But it is not painfully low level, because it provides the things that you are almost guaranteed to need in every app. The basic idea is that it will do buffering and matrix transforms for you, and you just feed it what you want. For example pseudo code:

setCameraPointLookingAt(x,y,z, atX,atY,atZ);
turnOnLightSource(lightX,lightY,lightZ, lightR,lightG,lightB);
vertex(x,y,z, r,g,b);
vertex(x+dx,y,z, r,g,b);
vertex(x+dx,y+dy,z, r,g,b);

What is really cool about doing things this way is that you don't have to supply all of the equations for moving points around in a 3D projection. You don't have to build any kind of API to buffer data. These things are provided for you, because you will need this functionality in every app. But newer versions of OpenGL threw all of this away. The main reason for it is that the huge number of APIs to cover every scenario for lighting and coordinate transforms keep growing. Their existence makes the spec unstable. These 'fundamental' things such as lighting, buffering, and coordinate transforms are really conveniences that should be out of the standard.

Raw Standards

The problem with this is that something like OpenGL needs to make the implementation as low-level as possible, while still allowing broad compatibility. So, the requirement for programmable shaders does not mean that it should just be shoehorned into the existing specification. What they did was to introduce an even lower-level API as the new standard, where old code is essentially emulated on top of it. The new APIs are kind of like this:


So if you want the old-code look, then you use an API that looks like the old code, but ultimately turns into this underneath. If you need the new features, then you write a framework that lives inside of your framework that approximates the old APIs where you can, and goes its own way where it must do so. This keeps the specification from being larded with everything that the designers didn't think of. The hardware has a smaller burden of what it must recognize, as does the OpenGL graphics driver. But the burden moved to the programmer to use (or write) a library over top of it to get it going. In short, they solved the stability problem by making the primitives more general.

So newer versions of OpenGL recognize that in the end, all of this code configures shaders in the graphics pipeline and builds buffers of different kinds of vertices. So new versions of OpenGL don't even bother providing 'fundamental' things like lighting, buffering, and coordinate transforms. The burden is on the developer to create these things. This is not a hassle if OpenGL developers use a standard software library above the raw library. All that has happened is that complexity moved out of the hardware (or at least the API that hides the hardware) into the client software.

VRML and other SceneGraph APIs went nowhere. They provided really easy abstractions, at the expense of creating objects that were too high level. But in the end, the ability to simply make shaded/textured triangles won. Making the API too high level simplifies things for somebody who insists on writing directly to the hardware, at the expense of imposing limitations. The limitations need to be worked around, so then garbage creeps into the spec because the spec is too high level. For OpenGL, staying very low level is the strategy for having the spec withstand change.

It makes the applications more complex to build, at the expense of taking out parts of the spec that are subject to much variation, by only including primitives.

MIDI Is Too High Level, Making The Spec Too (F-ing) Large

MIDI seriously suffers from this over-abstraction problem. Because hardware devices speaking raw MIDI with very little computational power on-board are the primary focus, it is understandable that the model is of turning notes on and off and bending them. In this view, it isn't the controller's job to either: 1) determine exact pitch 2) manipulate note on/off to create solo mode 3) manipulate the envelope of a note. This view is tightly tied to the assumption of discrete keyboard keys that send a MIDI message when the key goes down. The idea of completely continuous controllers could be shoe-horned in as an incompatible addition to the spec, but that's not a good thing because it takes an already complex spec and makes it more complicated without even being backwards compatible. What would be better is to make the primitive elements more general, and make controllers handle it themselves in a backwards compatible way.

MIDI isn't primitive enough. The abuses of the spec that are currently possible, but dicey, need to be standardized so that they are totally expected and legitimate. Note on/off are clearly fundamental. Bends are fundamental, but their semantics are underspecified. You should be able to compute a frequency given a note and a bend value, and work in actual frequencies. This is because frequencies are fundamental, even though MIDI doesn't quite get this concept. Envelopes on the sound are fundamental as well.

Bend And Legato

Because almost all MIDI devices only support the on/off/bend messages, and ONLY that reliably, it's foolish to implement huge parts of the MIDI spec and simply demand that synths and controllers understand all of it. This is especially true on iOS, where such demands will simply result in incompatibility on the part of developers who will simply implement what they need to ship. I am sticking to what is known to work everywhere; and then using the NRPN to putty in the cracks in such a way that it is not a disaster if the synth doesn't understand. This is analagous to rendering 3D graphics as simple triangles if a special shader is unusable. This principle is also why web pages are not usually completely broken when different browsers are used against non-portable pages. A wrong pitch due to an unrecognized message is far more debilitating than getting a right pitch with a note re-trigger at an inappropriate time, especially because many patches don't have a noticeable attack anyway.

Fundamental Elements

So, these are the fundamental elements to build up a correctly functioning MIDI synth that has control over pitch, polyphony, and legato:

0x90 - note on/off
0x80 - note off (i send 0x90, but should recognize 0x80 as equivalent
0xe0 - bend (pitch bend setting still applies)
0xbxxxxx - a special 'note tie' NRPN that states that the next note on and note/off pair are actually tied together.

The first rule is that there are no real 'notes', only frequencies that we make by a combination of midiNote number and bend. We don't try to re-tune the notes, but use pitch bend in combination with note to get the exact pitch. We also place all current notes into unique channels, and try to behave reasonably when this is not possible (ie: channel bend reflects last note down). This is the only really reasonable way to do this because the note on that we choose is what we *name* the note. This note name is what a piano that doesn't understand bends will play if it's asked to play our note.

Because we can always exceed the bend width, note tie says to 'continue note in the same state from the note turning off to the next one turning on'. This note can, and usually does change channels, because of the requirement for every note going down to go down into its own channel. You have to hold off on reusing a channel for as long as possible, because when a note is turned off, it will still respond to bends while it is releasing.

Keeping It Simple In Software, Hardware Isn't Our Problem

We are not worried about requiring a full-blown brain in the client, as hardware vendors might object to. Moving complexity out of the synth and into the controller makes an incredible amount of sense on iOS. This is because the controller will need some of the low level details in order to render itself on the screen. We have the pitch implied by where the finger touches, the pitch that we are drifting to due to fretting rules, and the actual pitch being played. We need to know all of this information in the controller. The standard MIDI equivalent would simply have the controller knowing about where the fingers are, and being more-or-less ignorant of what the synth is doing with this information. So in our case, the controller manipulates the pitch wheel to create the fretting system, and the synth has no idea what intonation we are using. It's not the synth's business to know this.

Similarly with polyphony rules, the synth can't just have a 'solo mode' setting. AlephOne and Geo both have a per-string polyphony that essentially adds the concept of 'polyphony groups'. The polyphony groups act similar to channels in that the controller will turn notes on and off to get the right polyphony behavior. This way we can chord and do legato at the same time. It's a controller-dependent thing, and it's not the synth's business to know any of this.

Similarly with legato. Legato *usually* tracks polyphony by playing attack on the first note down in the polyphony group. But in reality, on a string instrument, whether to pick or legato a note is decided on a per note-basis. It's not a mode that is enabled or disabled for the whole controller.

Because almost nothing recognizes more than note on/off/bend, anything else that the MIDI spec states is quite irrelevant in practice. The note tie addresses something that nothing in the spec does, and doubles as the legato, and it's not a major problem if it's not implemented. To somebody implementing a synth, a small number of primitives (only one thing beyond the standard elements) gives good overall behavior.

There is also the issue of the same note being played multiple times. AlephOne does chorusing. It doesn't do this with any post-processing effects. It works by playing the same note, microtonally displaced, twice everywhere. This is one of the reasons why simply picking note numbers and bending them around is a bad idea. On a guitar, the high E note is played simultaneously from 2 or 3 positions all the time. The assumption that you bend a key is rooted in the idea of a keyboard with one key per 'note'.


So, yeah, this is a hack with current MIDI. OSC is too general (in the sense that OSC messages have no inherent semantics, just syntax - it's XML hell all over again). And what I have read of MIDI proprosals that aren't made of on/off/bend seem unworkable in practice. If we are all on iOS and MIDI becomes too complex and not compatible with existing hardware anyway, we will simply throw the protocol away entirely and use on-board synths.

Real World Implementation

This is actually impemented in Geo Synthesizer and AlephOne on the client end, and in SampleWiz, ThumbJam (where it is especially good, specifically on the JR Zendrix patch), and Arctic Synth on the server end (I believe - I don't have a new enough version of the OS to run it). But anyways, it's not a theoretical idea. It has been implemented multiple times, and is demoed here (switching intonations while playing and doing polyphonic bends - and single polyphonic playing of this works fine against a lot of synths that have weak MIDI implementations):


  1. I don't think that MIDI is too high level, I think the opposite is true, that it's necessarily primitive due to its heritage. MIDI predates OpenGL by a full decade, and while OpenGL has a well-defined scope (rendering graphics on a monitor), MIDI was designed to accommodate everything from electronic pianos to drum triggering, to lighting rigs, to raw (SysEx) data transfer, and did it all in an era of 8-bit 64k computing, when 38400bps was an impressively fast data transfer rate (MIDI got popular even before ethernet was really around).

    It's been comparatively easier for specifications like OpenGL to keep up with the times, partly because video hardware has a pretty short life cycle -- once a video board is 3-5 years old, it's not reasonable to expect it to be supported by OS and app developers. With DirectX (Microsoft's equivalent to OpenGL), any hardware older than a couple of years simply will not run the current full-featured version of the video spec.

    That's harder to do with MIDI, because there are still plenty of 25+ year old synths in beloved operation, which are often irreplaceable by current technology (especially when there's analog circuitry involved). Plus the MIDI user base is much smaller than the entirety of people who want to play games or watch videos on their computer, so progress has been understandably slower.

    But there's been progress! There are plenty of more modern specifications for working with music and audio, like VST (and other plugin APIs like AU and RTAS), ReWire, OSC, Max/MSP, etc. And if you want to get past assumptions of 12-tone music, heck, you can do that with analog CV gear from the 1960s (CV typically uses a scheme of 1 volt per octave to encode pitch).

    I see your point about OSC being too general, but I think there's a necessary tradeoff between flexibility and easy interoperability -- for you, accommodating virtual string bending and per-string polyphony may be natural inclusions, but there's somebody else annoyed at the idea that they should be expected to describe their music in terms of "fundamental frequencies". And the semantics of MIDI are somewhat up for grabs too; we can agree that the number #36 means C1, but it also means "kick drum" and "pulse the light purple (or whatever)" too.

    I would highly encourage you to take a look at Ableton Live + Max4Live sometime; I think there's a free 2-week trial you can download, and it lets you do easily things like control a microtonal synth with your facial expressions; my mind boggles at the thought of what somebody like yourself could do with it. ;-)

  2. With all that said, you know what pisses me off? About MIDI. The simplest, most basic stuff - pitch control - doesn't work. It doesn't matter what does in light of that:

    1) #36 and bend set at value 10000 doesn't have a perfectly well defined frequency 100% of the time. I play Maqam Bayati (quartertones) in GarageBand and it's kind of sort of, but not quite, right as long as I never chord anything. Forget microtones, simply tracking the pitch that the controller wants (per-finger bends) is too underspecified to work everywhere reliably. WTH... It's just specifying frequencies. It's 2012, and somebody brings up some device from 1984 to explain why this simple thing doesn't work.

    2) Starting at #0 and bending up 1 cent per second up to #127 is a perfectly well defined thing, yet I can't represent it in MIDI correctly.

    3) Everybody who has ever touched a touch screen wants to drop his fingers on E+G and bend it to F+A and have it just work correctly in spite of the fact that these are different intervals.

  3. 4) Because I make an instrument where 3) is totally obvious, I get bad reviews when 3) doesn't "just work". It can't "just work" without violating MIDI either as practiced (on/off/per-channel-14-bit-bend are the only thing that works) or as MIDI is 'specified' (who cares what specified... it has to work!, and work with what it's plugged into!). Everybody wants to just assign 1 channel for the whole thing and have it "just work". People have been dicking around with "looks like a piano" synths for so long that they are blind to how incredibly broken it all is. I am so tired of explaining this to users. My standard answer is that these scenarios will "just work" when we stop supporting MIDI.

    5) The MIDI spec doesn't really say what happens in OMNI mode, other than "responding to all channels". But I will tell you what happens: it behaves as if every channel is replaced with a zero! It produces super-wrong behavior in this instance! If I play two pitches in different channels and bend them in different directions and run them through omni, the pitches are just wrong. If I play the same note #36 and interleave the note up, i get a silence when the first note comes up, rather than after they both come up. Omni changes what 'correct order' for notes is.

    6) When a note is turned off, you can still hear it during release. So you can't touch its pitch wheel until it's done releasing. When is that? So I just cycle the channels to pick the last currently released one. But this massively complicates everything else. Because of this one thing, note ties have to be able to cross channels.

    7) Send out 1 MIDI stream to something that doesn't know bends and one that does. You really should rename E to F as you bend to F# for the sake of the piano. This is why note ties are a critical oversight in the spec. There is no notion of degrading gracefully in the spec.

    The MIDI spec also says that it's up to the implementer to either layer duplicate notes or to treat on/off as state changes. If you assume they picked the second option, but the synth behaves as the first, then you get stuck notes! (ie: for note #33 ... on on off off should produce note/note/silence... but on on off ... depends on the synth. etc.)

    So, in light of this, I really don't care about the other esoteric stuff that MIDI does support. It's an inefficient generalized byte tunnel where the client and server have to agree ahead of time if you try to fix it and be incompatible with what exists, or you can have it not work work as it should if you stick to the spec. I'm not even doing anything complex with expression parameters or Timbre... FFS, I am just trying to get totally correct pitch handling, and that is all!

    TouchScreens are the breaking point. The MIDI standard will fork, be bypassed, or the hacks that actually work will be standardized behind a library so that it will work consistently across devices.

  4. In any case... my point is, the fundamental messages that can be used to correct pitch behavior in a 100% way need to be clarified in the standard so that it's not producing wrong behavior (ie: common sense behavior... not slavishly producing wrong pitches according to spec!). This means that note ties need to be added to the spec, and pitch bend needs to be defined as meaning that it's definitely 1 whole tone up or down until you send a message to change it... the pitch implied by note+bend must be unambiguous at all times.

    I brought up OpenGL as an example of this principle of clarifying the primitives and throwing out all the shit on top of the primitives that don't need to be in the standard. The reality is that developers will implement some amount of the standard and ship when they reach diminishing returns. So, the standard that they need to implement must be 100% correct (with a few features) after only a few primitives are implemented.

  5. I agree with you about the shortcomings of MIDI, but I also understand why things are the way that they are, both because of MIDI's history, and the fact that what's there is perfectly suitable for most people, a great majority of the time.

    The "fork or bypass" of MIDI has already been happening in the desktop/laptop world for several years now. Plugins within a host interoperate via VST (etc.), apps communicate with other apps with ReWire, if you want to do something experimental with hardware, it's pretty simple to roll your own solution with OSC, Max/MSP and soforth (there is already a bunch of Arduino code that makes it easy peasy for homebrew hardware to talk to Max, Pd, MIDI, etc.).

    Really the only piece that isn't there yet is being able to do it all on a specific $500 device that runs a highly proprietary operating system.

    Besides, the big hurdle isn't mature audio APIs for iOS, it's the fact that current touchscreen technology isn't really suited for a precise, expressive input device. The lack of non-hackish pressure sensitivity (velocity AND aftertouch please), and total lack of haptic feedback, is a real barrier to connecting with the device in the same way that I can with a traditional instrument or mechanical controller. Touchscreens are tremendously versatile and convenient, but versatility and convenience can be dangerous distractions to making music -- what makes music/art GOOD is very often a function of its limitations.

  6. What already happened is that because a midi byte pipe is already sanctioned on ios, and there isnt any other reasonable ICP mechanism, people are tunneling audio buffers through sysex. It will be like audio marked with a form of midi to describe the audio content. Osc doesnt seem like it will succeed because of loose semantics, and on ios, there is not a fast pipe to pass it over.

    So midi will get abused into a protocol that meets the requirements. It wont really be all that compatible with midi by the time everything works as it should.

    The pressure and velocity are my other pet peeve issues, but first things first... A continuous fretboard with dynamic fretting is what is going to change things. It may become a main motive to fix the pressure and velocity problem.