Popular Posts

Tuesday, September 29, 2015

Correspondence Chess, Latency, and Offline Progress

Offline Applications


In server and desktop systems, it is normal to simply assume that connectivity works when both client and server are up and running.  When using this style of application, it is normal to show errors to the user when an attempt is made to contact the other party, while it is unreachable.  This situation is somewhat tolerable in desktop and server applications, because connectivity is generally wired and highly reliable.  But this situation is completely intolerable for just about every other kind of situation.  As an example, TCP masks packet loss at the byte stream level and hides errors that don't result in missed deadlines.  But most TCP apps will claim errors if the underlying connections reset, in spite of the fact that there is still plenty of time to try again.

In mobile applications, there will be frequent bad connectivity.  For phones, it is caused by going through subway tunnels, low batteries, and simply moving out of range.  It is a terrible user experience to be in the middle of web browsing, and to try to bring up something that was already loaded, and have it get lost due to the network being down.  This is simply bad application design.  HTML5 supports offline apps, but it is highly unusual to be dealing with even mobile apps that behave correctly in the face of being mostly-offline.

For servers, outages are rare, but they can be caused by maintenance mishaps, or outages of dependent services (such as the ISP).  If a client can specify a real-time deadline for an answer, then it doesn't matter what connectivity looks like if answers always come back on time.  Especially when deadlines are long (on the order of hours or minutes), machines can simply wake up and connect periodically to stay up to date.  This allows battery powered machines experiencing low demand to spend most of their lives hibernating.  Note that the latency of such long deadlines says nothing about throughput of requests, which may actually be required to be high in some circumstances.

One thing that a TCP stream cannot do is to hibernate sessions for truly long periods of time.  TCP streams cannot take out leases of hours or days and then persist to survive process or operating
system restarts.

Queued Transaction Processing


What is necessary is for machines that are executing a protocol (having a session) to be able to simply queue up outbound requests while there is nothing ready to receive them.  If all requests have deadlines on them, we can get errors from our local queue if the deadline passes without a response.  If a deadline passes while the unacknowledged request is being serviced, we must assume that the other party may have given up.  And for any unacknowledged tasks in progress, we must be able to send a cancellation request, and get back an answer (with a definite deadline of its own) to whether the cancellation succeeded.

Even the notion of whether the other party is "online" is expressed merely in terms of deadlines.  If the other party doesn't respond to a request to exchange packets in a timely fashion, it is offline for all practical purposes.  A server that exchanges packets with us in a timely fashion can be considered online if it meets the deadline for that criteria; though it may actually be asleep when the first attempt to contact it is made.

Correspondence Chess


Masking errors when the network is down is a rather obvious thing to do.  Something that is far less obvious is to handle "Phone Tag" situations to dramatically reduce latency.  Consider a game of correspondence chess:

  1. e4 e5  2. Nf3 Nc6  3. Bc4 Bc5



Now assume that each player can only make one move per day, at 8:30am in their own timezone on opposite ends of the world, while having breakfast before going into work. There is never a time when both parties are actually online.  Even with moves forwarded into a third party queue that is always up, a 30 move game is going to take two weeks. This is intolerable, particularly because almost all of the moves are predictable. Action should only stop when surprises arise.

So, let's model correspondence chess as message passing state machines.  The state machine is the combination of the board position and who moves next.  We have moves, and contingencies.  Contingencies are moves that we foresaw the other side making.  The first player, Alice, plays a game that starts in state A.

A -> B
B --> C
C -> D
B --> E
E -> F

The single arrow is a move being committed to (where the actual move is not being specified, and we are only tracking state changes to the game).  From any state, there is only one single arrow that moves to a new state.  But states that have double arrows can have multiple possibilities.  These are all the possibilities that we foresaw.  After those possibilities, we will generally include our response to it.  Bob, who is playing the other side of the game , cannot see Alice's thought process.  But Bob thinks and makes predictions of his own:

A --> B
B -> C
C --> D
D -> G
A --> H
H -> I

Bob is expecting Alice to either move the game into state B or H.  If Alice gets Bob's messages, she can simply append them:

A -> B
B --> C
C -> D
B --> E
E -> F
A --> B
B -> C
C --> D
D -> G
A --> H
H -> I

Alice can use a naive algorithm that is similar to computing group permutations.  Knowing that we are in state A, we see that it will transition to B.  If there was no prediction of this, then the answer to this merge would simply be "B", in which case, the game resumes at state "B", and Alice assumes that Bob will come to the same conclusion.  But if we walk through the list (looping around the top if we have to), we see "A --> B".  This means that it was predicted.  So that means that we search for "B -> " to figure out what the next state is.  The answer is C.  We walk forward through the list and wrap around to find "B --> C", so we use one more move to find D.  The transition to D was predicted, due to "C --> D".  That gives us one more transition to G.  "D --> G" is not in the list.  So the transitions stop moving forward.  The final state is now G, that Bob put the game into.  Alice how plays from G to J, and doesn't bother making a prediction yet.

G -> J

When Bob gets Alices's data, he appends Alice's initial exchange to his list as well.  The lists are in a different order, but the answer comes out the same.  They both agree that the game has progressed to state G.  In doing this, they agreed upon: "A -> B -> C -> D -> G".  If they can continue to make 5 moves per exchange, then the game completes 5 times faster.  Alice has decided to move from G to J, but hasn't communicated that to Bob yet.  Alice insists that the game is at J, while Bob insists that the game is at G.  But this is consistent, since Alice passed the game through G as well.  When they exchange again, they will both pass through J.

Now imagine that when Alice and Bob play, they are offline almost all the time.  When they realize that the other side is offline, they just keep making more predictions.  Within trade-offs of sending huge predictions (and the power consumption of simply making the predictions) versus minimizing pointless round trips, they make as much progress per move as possible.  In other games, like Go, the games can be much longer (300 moves!), and have more boring sequences in between the surprises that cause prediction to fail.  This feature of Go makes offline correspondence games take unbearably long.

Notice that this is an O(n^2) algorithm.  This can be fixed to handle large exchanges by sorting all transitions to perform a binary search for transitions, which reduces the algorithm to O(n lg n).  The exchanged move predictions can be garbage collected to start at the last state that both sides agree on.  However, for actual games, keeping the full record around is useful for analyzing how predictions failed.

Correspondence Protocols


Now imagine that we substitute games with networking protocols in general.  The distributed protocol family, Paxos, uses the distribution of a change log to keep multiple copies of a state machine synchronized.  When the responses within the protocols only have a few answers (ie: booleans, enumerations), both sides can specify their predictions as far out into the future as possible.  This is conceptually similar to TCP windows moving the stream forward on both sides.

Like Paxos, there isn't a reason to limit the exchange to only 2 participants.  If a protocol has 3 or more parties taking turns, exchanging messages for correspondence gets all parties up to date.

Note that in general, network protocols with a lot of chatty back and forth are often badly designed.  Relational database protocols (ie: MySQL, Postgres) often have this problem.  When client and server are co-located, there doesn't seem to be an obvious issue.  But put them over a high bandwidth delay link such as a satellite, and performance can become terrible even when throughput should be very high.  Doing correspondence handles foreseeable back and forth exchanges to minimize latency.  This was a phenomenon we actually observed, when moving to an application server happened to batch up queries to minimize round trips to create extremely dramatic performance improvements.

When working with people in other timezones, it may be advantageous to handle requirements ambiguities by coding all of the possible solutions into git branches.  When there is a back and forth later on about what the requirements actually are, we find out which branch we should be merging.

Delay/Disruption Tolerant Networking

Some of these ideas are not new, because they are exactly what happens in the extreme case of space travel.  As nodes orbit and move around, connectivity is only possible during opportunities.  So, DTN protocol handles this at the level of routing.  The main idea is to persistently queue data until there is a send opportunity.  If the destination node is never reachable directly, then sending to some intermediate nodes that queue up data (and take custody over it) will have to do.  In the RFCs, there is mention of designing application protocols to minimize round trips; without it ever detailing such a scenario.  This implies that even though the problem is generic, it is an application protocol concern.  It is also not clear whether timeouts for messages are always put in at the DTN layer.  This is important, because if connectivity is out 95% of the time, then the 5% time window where communication is working, you need to forward data that built up in queues during the other 95% of the time.

For terrestrial apps, DTN may be overkill in a mobile setting.  This is because the unreliability isn't so much in the routing, but in the clients connections to the internet.  For offline correspondence apps, it may be perfectly fine to treat "The Internet" as up or down as a whole.  Because the point is for clients to queue up outgoing requests, and speculate to minimize round trips (ie: playing out all expected responses in game trees).  When there is connectivity, it will probably be reliable and have enough bandwidth to drain the client queue.  The intermediate queue will almost always be up because it is in a replicated data center.

Generalizations


Having an actual state machine might be too stringent.  In many cases, it may be better to think not in terms of state machines, but preconditions.  A precondition that we are in an exact state matches exactly one state.  A precondition that is a bit weaker could match a large number of states.  But we may be certain of how we need to respond to it.  For instance, if our king goes into check during Chess while there is exactly one way out of it, we might as well commit to making that response ahead of time.  In Go endgames, the game may break down into independent subgames.  We may want to commit to our response to moves in the independent subgames, rather than computing all of the combinations of states exactly. 











Thursday, February 12, 2015

Signed, Unsigned, and Circular

Signed, Unsigned, Circular


In doing networking work and related kernel work, we have seen every new developer run into a variation of a simple problem with the use of counters that wrap around.  In the case of TCP, it is sequence numbers.  In the case of kernel work, it is jiffies.  It also happens when you are waiting for your two digit number to be called in a fast food place.

Suppose that two events happen, named a and b.  The event a happened earlier in real time than b happened.  Records of these events are stamped with a limited resolution timer, and upper bits that tell you the true ordering are discarded.  To make the problem easier to reason about, we will use an 8-bit timer.
struct Event { uint8 sequenceNumber; .... }
We are assuming a 2s compliment representation of integers; the standard computer integer representation.  So, should these timestamps really be signed or unsigned values?  They are usually implemented using unsigned values, but they are actually neither.  People get bugs in their code by asking if one event happened before another by comparing sequenceNumber using the less than operator, which is wrong.  We are assuming a 2s compliment representation of integers; the standard computer integer representation.  So, should these timestamps really be signed or unsigned values?

Geometry Of Circular Numbers


Firstly, 2s compliment numbers are circular in some sense just by virtue of their limited number of bits.  For all of them, addition and multiplication work in the same way.  But whether a number is signed, unsigned, or circular depends on: the definition of bit extension when assigning to a larger number of bits, the definition of the comparators, and the definition of whether some arithmetic overflowed.  Beware of ignoring these distinctions!

Assume that events are stamped around a clock going clockwise.  Event a is defined to be less than b when a diameter (a,a') contains b in the arc clockwise to it.  In this graphic, the diameter(b,b') contains a only in the counterclockwise arc, so b is greater than a.  This is the same thing as saying that the distance between from a to b is positive (as a signed integer).

The comparison is only valid if the real distance (including the lost upper bits in the sequence number) between a and b is half of the total distance around the circle.  Also, circular numbers cannot really be extended (ie: assigned to a value with more bits) without  invalidating ordering relationships of the original values.

Because they can't really be sign extended (without changing their values), and have their own comparison operators, they really should have their own type.  (Don't get me started about network versus host ordered ints not being native!)
circular int8 sequenceNumber;
In the Linux kernel headers, jiffies.h explicitly states that it initializes the jiffies counter on bootup to be negative 5 minutes into the past so that the kernel will crash 5 minutes after startup if you treat jiffies like unsigned integers.

Circular numbers are in use when you see macros that redefine comparison like:  "((signed)b - (signed)a) > 0", where b and a are usually unsigned storage types.

Geometry of Unsigned and Unsigned Numbers


The point a is considered (signed) less than b if clockwise along the arc from the maximum negative number m, we first encounter a.  Signed values are extended by copying the higest bit (a sign bit) into the new bytes; ie: extend negative numbers with 1s and positives with 0s.

The point b is considered (unsigned) less than a if clockwise along the arc from z we first encounter b.  They are extended by padding with 0s.



Tuesday, August 20, 2013

Types and Calling Orders


Assuming that when a blob of code reaches some length it is re-factored into smaller functions, having no type system means that you can't cut off the search for what could go wrong at the inputs, outputs, and very limited side-effects; which would be like being able to prove that something is right with O(n) code review rather than from O(n lg n) code review up to O(n^2) for code full of side-effects.

If an API has M methods that only have a small number of valid calling orders, then there might be O(M!) wrong orders that kind of work well enough to be ticking time bombs.  Types can be used to document/enforce calling orders almost up to a regular expression; but not a grammar, ie: provably balanced alloc/free, push/pop, etc.  (todo: Enforcing up to a grammar may be possible with a linear type system, ie: Rust.)  Documenting calling orders is possibly more important than input/output type enforcement for this reason; which is why people read examples rather than auto-generated docs.

This extra sensitivity to ordering is what makes shared memory multi-threaded code so much harder than share-nothing concurrent code; in a measurable exhaustive bug-search sense.  It almost always has to do with two things going on that 'dont commute'.  It's a common phenomenon (commutators) in mathematics as the launching point from something that's trivial suddenly becoming hard:

A * inverse(A) = 1, and B * inverse(B) = 1

Yet:

A * B * inverse(A) * inverse(B) != 1

Because A and B don't completely commute.  It's why a Rubiks cube is trivial to solve no matter how many times you scramble opposite faces, but suddenly becomes a real problem once you start mixing faces that share an edge.  A lot of important mathematical problems are rooted in this.  It's exactly this that causes two tasks that share resources to be more complicated than the two tasks in isolation.


In addition to the actual ordering, a lot of type systems are kind of lacking in not supporting subset types, with the most important one being that every pointer is a non-null value by default (and casting a nullable X ptr to a non-null X ptr throws an exception at cast time when null rather than ever allowing nullable X ptr de-reference attempts (at random locations!) in the code.)  Other examples would be simple ranges like [0..12] as types so that there need not be runtime checks (ie: cast 13 to one of these and it throws an exception...it's not possible to find an out of range value assigned to one).


Not all type systems are perfect, but I gotta shake my head when I hear that not having a type system is somehow more productive.  It's procrastination.  You end up revisiting whipped-up code later as it breaks during major re-factoring efforts and writing and maintaining unit tests that essentially check types, assertions, and calling orders.

Friday, April 19, 2013

Upsert Counters In Postgres

Upsert 


When faced with a situation in which you are inserting huge volumes of data ('log volumes of data', think append-only or circular queue tables) and need to keep some statistics, you would like to have counters.  In this situation, I am working with Postgres, but upsert concepts also exist in MySQL apparently.  This is related to a common situation in which you know the primary key and some initial values for a datapoint (to describe the object that the primary key refers to), along with data that will be updated constantly.  You don't want to look these up to see if they exist in order to do an update on them, because you want to do a huge number of inserts in a batch.  The simplest upsert is like this:

sql = "with upsert as
(update ourtable
  set mutable1 = ?
  where id=?
  returning id)
insert into ourtable
(id,constant1,mutable1)
select ?,?,?
where not exists (select 1 from upsert)"

#pseudocode usage like this...
def OurTableUpsert(statement,id,constant1,mutable1):
  #move on to the next item
  statement.item++
  statement.row[item].args[0] = mutable1
  statement.row[item].args[1] = id
  statement.row[item].args[2] = id
  statement.row[item].args[3] = constant1
  statement.row[item].args[4] = mutable1

#Make a giant batch of updates against empty table
statement = prepare(sql)
OurTableUpsert(statement,8,'Ovechkin Goals',1)
....
OurTableUpsert(statement,8,'Ovechkin Goals',2)
...
OurTableUpsert(statement,1003,'Planets',9)
...
OurTableUpsert(statement,1003,'Planets',8)
...
execute(statement)

Which lets us set mutable1 of ourtable to a value given an id, whether this row already exists or not.  This is important because we want to make huge batch updates, and don't have time to go figure out what already exists.  It prevents us from having to do a select on the items to generate inserts on the ones that don't exist and updates on the ones that do exist, and lets us do everything as one batch statement.  The constant fields like constant1 are never updated, only supplied with initial values, and the mutable fields mutable1 are constantly updated.


Counters


A more complicated example that's a collection of counters.  The counters should be sparse such that values of zero don't have a row in the database, and you only write back to the database to increment their values.

Updates alone don't let us do counter-like behavior.  The upsert is general enough that we can supply incremental values.  We change the statement to make the arguments supply increments, and partition the increments by day:

sql = "with upsert as
(update ourcounters 
  set counter = (? + counter)
  where id=? 
  and starttime <= now() and now() < stoptime
    returning id)
insert into ourcounters 
(id,counterdescription,counter,starttime,stoptime)
select ?,?,?,?,
  date_trunc('day',now()),
  date_trunc('day',now() + interval '1 day')
  where not exists (select 1 from upsert)"

#run something like this
def OurCounters(id,description,increment):
  statement.item++
  statement.row[item].args[0] = increment
  statement.row[item].args[1] = id

  statement.row[item].args[2] = description
  statement.row[item].args[3] = increment

#Increment a whole bunch of counters
statement = prepare(sql)
OurCounters(1,'United States',235)
OurCounters(44,'UK',53)

OurCounters(44,'UK',99)
OurCounters(1,'United States',23523)
...
execute(statement)

You can turn this into a stored procedure and supply it with a huge list of counter values to increment.  These are sparse counters, so there won't be a row in the database for them if the value is zero.  It increments the counter for the current day so that you can retrieve total counters over a window of days.  The trick here is that the raw counts get bucketed by day.  That way, when you startup the application from the beginning, you can retrieve a total count:

select id,description,sum(counter) from ourcounters

And then in memory, keep increments to the total that last came from (or was written to) the database.  Periodically you write them out.  But you write them out so that it places it into an increment for the current day.  This way you can keep incremental statistics on a large amount of data, where there is only a lightweight query on first startup, and an efficient way to write updates and filter out for an initial time window.

Summary


I have worked on relational database systems that generally deal with a fast ingest of events (where a mismatch between the insert and delete rate becomes a severe problem when you reach the maximum number of rows you would like to maintain in the database).  With excursions off into Hadoop, Cassandra, Solr, etc, I start to run into situations where relational databases are being used as ringbuffers or append-only immutable data, while the rest is boiling down to having a great system for counters (including nested counters). Upsert was something that I had not seen until recently.  It looks like it will come in handy in a lot of places.



Monday, March 25, 2013

Git Merge Versus Rebase

Git Merge Versus Rebase


You will see a lot of strange advice in Google searches on merge versus rebase, and how to deal with conflict resolution.    The advice conflicts, so it obviously isn't all good advice.  This link here has it right:

http://gitguru.com/2009/02/03/rebase-v-merge-in-git/

The biggest usability goof in the git interface is just the fact that "git pull" isn't a rebase by default.  Rebase causes the commits to appear such that incoming changes come in in between old origin state and our changes, where we make the edits to apply over top of the incoming changes.  (Read gitguru's article until it makes sense!)

The fact that rebase creates a rewrite of history is feared to be a problem by some developers, but it's not a problem if pre-rebased (and pre-squashed) commits are never pushed up into the master commit.  In other words, you can rebase that which is still local to your machine, or only cloned into a throw-away branch that somebody reviewing your code is using.

The Common Case


Your pull (from your master branch) should always be "git pull --rebase", or similarly "git fetch; git rebase", except for a few special cases.  Doing fetch, rebase separately lets you deal with rebase conflicts in the master (which you should not have when you follow this discipline).  It would also be helpful if git branches were forced to be named like file paths to reflect the history of how branches were created.  Consider this slightly verbose way of working:


#Alice has been asked to add Xmpp and Gopher protocol support
#so we make the tree of branches
git clone git://origina/ManhattanProject
cd ManhattanProject
git checkout master
git checkout -b masterProtocolSupport
git checkout -b masterProtocolSupportXmpp
git checkout masterProtocolSupport
git checkout -b masterProtocolSupportGopher

#Alice is asked to fix an SSL bug
git checkout master
git checkout -b masterFixSSLBug

Each bug is in its own branch, because only a subset of these branches will ultimately be approved to be pushed into the repository.  The clone creates a leaf off of the origin master branch.  Doing a checkout -b creates a leaf off of the currently checked out branch and changes over to that branch.  Alice gets to work, making commits into these branches:

masterProtocolSupportGopher
masterProtocolSupportXmpp
masterFixSSLBug

Alice is told that there were bug fixes checked into master and pushed to the origin.  We always follow the rule that we begin at the master and rebase before merge, so that all merges will fastforward.  If there are any conflicts, the conflicts are always resolved in rebases - not in merges.  (This rule is only violated in scenarios where we have to merge trees across different origin servers, where the common ancestor commit is too far in the past.).  This is the default scenario:

git checkout master
git pull --rebase
#we should only get conflicts when rebasing leaf branches - if you get a conflict here, you screwed up
#stop checking into your local copy of master after you fix this mess.
git checkout masterProtocolSupport
git rebase master
git checkout masterProtocolSupportXmpp
#may get conflicts on this rebase - fix them here
git rebase masterProtocolSupport
git commit

git checkout masterProtocolSupportGopher
#fix conflicts here
git rebase masterProtocolSupport

Assuming that the conflicts you get in the Xmpp versus Gopher branch are distinct, this is straightforward.  You then need to figure out what subset of your work will be approved to be merged into the local repository.  So, somebody pulls alice's masterProtocolSupportGopher and masterProtocolSupportXmpp branch.  The reviewer decides that the Gopher support isn't worth the trouble and tells Alice to discard the Gopher work, and push the Xmpp branch.  The SSL fixes are also approved without controversy.  So we need to make a test branch that includes all that is approved:

#start from the root of the tree and just ensure that we rebase before we push in all cases.
#if fastforward merge fails, then fix it in the leaf and merge it up rather than messing with non-leaf
git checkout master
git pull --rebase
git checkout masterProtocolSupport
git rebase master
git checkout masterProtocolSupportXmpp 
git rebase masterProtocolSupport #squash commits after this
git checkout masterFixSSLBug 
git rebase master #squash commits after this
#we are rebased

#make a disposable testmaster branch that merges all approved work
#that we can run tests on.  merge into test rather than master
git checkout master
git checkout -b testmaster
#merge everything down into test
git checkout masterProtocolSupport
git merge masterProtocolSupportXmpp
git checkout testmaster
git merge masterProtocolSupport #merge approved branches
git merge masterFixSSLBug #merge approved branches

#this proves that a push to origin master will 
#fast-forward if you do it now.
#run tests before you do that!
#if it works, then merge this into master, push it, 
#and discard this branch when done.


So we merge protocol support and ssl fixes into our test branch after we squash the commits for our leaf branches to include commentary from the reviewer (who it was, etc), to move out irrelevant noise from the detailed original commits.  Because we always rebase before merge, all the merges should fastforward.  This will give us a very clean and easy to read git log.

Testing


We run our tests and see that they pass.  We can do this test before we ask for a code review of our leaf branches as well, but ultimately we need to run tests on the subset of branches that was approved to ensure that we don't push breakage into the repository when an unapproved branch gets left out.  After the tests pass push the work up,

#the merge into testmaster proved that the push would fast-forward if we did it.
git checkout master
git merge testmaster #merge what we tested, equiv to merging approved branches
git push origin master
#we don't need this branch now
git branch -d testmaster

Exceptions


If you work like this, then you won't be dealing with merge problems in git.  You won't be pushing broken commits up into the master branch for others to figure out (because you can actually integration test before you even push!).  You will still have to resolve conflicts when you rebase, but that is exactly where the resolution needs to be done.  There will still be unusual scenarios where you have to merge in both directions, when you are not dealing with a straightforward tree of branches around the master.  It is possible to use merges for everything, and there is no permanent damage to the repository if you make a mistake.

There will also be scenarios where you want to commit into a non-leaf node like masterProtocolSupport for the cases where the same merge conflicts happen in both leaf branches.  In that case, Xmpp and Gopher would treat masterProtocolSupport like the remote.  The same rebase before merge rule still applies.

There are also cases where you might not rebase from a remote, violating the rebase from root towards leaf rule.  You may have to merge when you have multiple origins (origina,originb) that are evolving independently (because you want to rebase on one consistent parent branch).



Saturday, January 19, 2013

A Hypothetical Modern Audio Language

A Hypothetical Modern Audio Language

I have been thinking generally about the progression of languages with respect to supporting parallelism.  When C and Unix started to take over the world, parallelism meant processes.  This meant that the languages were very much single threaded, and used interprocess communication to get things done between them.  Then languages like Java and C# came along and popularized the usage of threads with shared memory.  Recently, languages like Erlang and Go (and Lua to some degree) came along to take large scale concurrency seriously; making coroutines mainstream.  Coroutines are like generators or iterators that go both ways between caller and called.  It is a mechanism to allow code to be structured in a truly concurrent way.  What these languages still are missing is what is becoming important for numerically intensive applications and mobile devices: SIMD parallelism.  SIMD (and SIMT) parallelism is what we get from GPUs.  GPUs have very high latency, but the current CPU chips have SIMD instructions that do SIMD on a smaller scale (NEON, SSE, etc).  Like coroutines, the parallelism it provides isn't based on multiple independent thread executions.  Instead, it's based on very wide registers which contain large arrays of values with instructions that can operate on all elements of the array at once.  For branchless computations, this can give a fast speedup, and provide deterministic timing when doing so.

I have been mainly mulling over an ideal language for creating OSC controller patches.  All existing languages are a bit of a monster to build because of the amount of stuff they depend on.  Ideally, it's a pure embeddable C library where callbacks are plugged into the target OS, rather than having the dependencies in the language runtime.  ChucK is almost an ideal fit, but that probject seems to not be going anywhere.  SuperCollider is interesting, but it has a very wide scope.  The vast majority of what it does (and therefore the vast majority of dependencies it has) is irrelevant to what I want to do.  I originally thought that a LISP variant would be an ideal language to try to start such a project because it dramatically simplifies the language aspect of it.  But the more I think of it, I want the capabilities of ChucK, combined with the message passing concurrency of Erlang, combined with a way to hide the intense arithmetic so that it is easy to SIMD parallelize.

Specifically for audio apps, you need good soft real-time behavior.  One of the problems that needs to be handled is sample-accurate jitter handling.  Presume that when client talks to server, the client gets sample-accurate timestamps for when the gesture was done, and sends it to the server.  If the audio buffer size is 5ms, then an ideal zero-jitter audio representation will match client and server clocks to schedule incoming oscillator changes to always have exactly 5ms of latency (no more, and no less.  if there is less, then there must be a non-zero jitter as a trade-off.  jitter is worse than latency.)

So I started to fake up some pseudocode, and it looks like this:

//Load up a sample with rules about how it responds to phase changes
soundfile:table{file="chime.wav",attack=0,sustain=256,loop=1024,end=2000}

//This is the graph of oscillators
//No sound is made.  It's just data setup.
a:oscil{table=soundfile}
b:chorus{in=a}
c:reverb{in=b}
d:dac{in=c}

//tell the dac to start running, ultimately pulling data from oscillator through the chorus and reverb
dac.start();

//This is the osc listener
spawn fun osclistener
  o:osclistener{port=9999}  
  recv o
    /rjf/t0,t = {controllernow}:
      timediff = controllernow-now
    /rjf/p0,iifft = {voice,phase,freq,amp,timestamp,phase=attack}:
      at latencycompute(timediff,timestamp) a{freq=freq,amp=amp,phase=attack}
    /rjf/p0,iifft = {voice,phase,freq,amp,timestamp,phase=sustain}:
      at latencycompute(timediff,timestamp) a{freq=freq,amp=amp,phase=sustain}
    /rjf/p0,iifft = {voice,phase,freq,amp,timestamp,phase=release}:
      at latencycompute(timediff,timestamp) a{freq=freq,amp=amp,phase=release}
    after 100ms:
      shutdownAllVoices()

x:oscil(expr= fun(t)
  t & (t%256)
)

Erlang and Go do message passing concurrency in a pretty straightforward way.  Erlang claims to do soft real-time well.  What Erlang does not do well is strings (normal string usage is like 8 bytes per char because it represents them as a literal linked list of integers), and high intensity numerical computation.  But OSC messaging conceptually fits masterfully into Erlang's way of thinking.  There needs to be things added, such as branchless code needs to be rendered to SIMD instructions if possible.  What would make sense is to either use Erlang or Go as the actual language for this purpose.  But it's also ideal that the language is a pure C library without a ton of dependencies (Lua actually fits this criteria very very well).

BitWiz is an astonishingly simple example of what you can do with creative use of branchless code.  It's not completely clear to me yet how to apply all of those 8-bit lessons to 16-bit yet, but if you look carefully at BitWiz code, you can see that the entire audio buffer fill can be generated in parallel (per sample).  Those kind of simple and branchless expressions (where values change only in between audio frames) should be part of any audio language.

But if you want to run this code on iOS and WinRT simultaneously, there is a huge problem with dependencies.  You cannot depend on LLVM at runtime as an implementation strategy.  If you could do it on iOS, you can't do it on WinRT.  Lua has always been a problem for latency and jitter (I used it in Xstrument at one point).

But this language could just have a non-general purpose VM as its basis.  The VM would be a domain specific CISC instruction set (ie: lots of domain specific builtin primitives), and the language directly talks to the server in terms of those primitives.  Like SuperCollider, that would put the compiler into the client, and it sends raw instructions over OSC to the server.

Otherwise, this is a thought experiment in how good SuperCollider would really be for this task (other than the huge number of dependencies, GPL licensing issues, etc).  As good as SuperCollider is, I still think it's not so great as a language.  It could learn some lessons from Go, Erlang, Lua, and especially ChucK.

Sunday, January 6, 2013

Audio Languages

Audio Languages and iOS, WinRT, Android, etc.

I have been examining the various ways to simplify development of music instruments for touch screens.  I wrote Mugician, Geo Synth (also known as Pythagoras in its early development), and Cantor (known as AlephOne early on).  Mugician and AlephOne are open source projects that I have posted into github.  Geo Synth is a commercial project that is released under Wizdom Music (Jordan Rudess, the Dream Theater keyboardist's company).  I built these apps because I believe that the instrument layout that they have on touchscreens will completely overtake the guitar and keyboard industry at some point when latency and pressure sense issues are dealt with.  I have a lot of videos posted about it, and there are a lot of pro musicians that use Geo in particular (and some using Mugician as well):

http://www.youtube.com/rrr00bb

The main reason I believe that this will happen is that this layout on a touch screen solves intonation and pitch problems that guitars and keyboards actually make worse.  It is even a good ergonomic setup because there are 10 fingers on top of the playing surface, with a layout that's familiar to guitarists, and easier than keyboards (even for actual keyboardists).  The regular layout makes it easier to play without having to have extensive feel feedback (ie: reaching proper sharps and flats on a piano where you can't feel where white/black key boundaries are).  This allows you to play very fast; significantly faster than real guitar playing even.

The Beginning and Background MIDI



After a few years of playing around with iOS controller code, a year into it, I came to the conclusion that I am doing something architecturally very wrong (along with the rest of the iOS community).  Before Audiobus came out, there was no way to contain the list of requirements for an iOS based audio app, to keep simple ideas from turning into time-burning monstrosities.  Back then, you could run another app in the background, and even send it MIDI messages.  But MIDI has so many problems that you still need to have an internal audio engine to have any guarantee about the user experience, and to ease setup.  (So, in my view; the existence of internal audio engines in any MIDI controller app means that MIDI doesn't actually work as it was intended.  I will have more on OSC later.  If MIDI did what it is supposed to do really well, then nobody would bother writing a synth to go with a controller, or vice versa.  It would not make sense financially to ever do it.  But the combination over MIDI sucks, so we end up doing both sides of it.)   Because audio was still isolated between apps there was a lot of pressure to make every instrument have:
  • A basic controller.  Most people imitate pianos, because that's what synthesis guys are used to looking at.  Though on the tablet, it's always a horribly unplayable imitation of the piano.  I don't understand why people keep doing this.  We have had 2 years of this, where there's little point in having the controller because it's not playable - but it's included anyway.  You can make a playable controller, but the first thing that has to happen is to drop the requirement that it resemble a piano; as this has been proven for many years to not work on a touch screen.
    • The two main jobs of the controller should be to proxy out to controls for the synths/DAWs in the background when required (ie: volume knobs, timbre sliders, etc), and to simply play voices with the correct amplitudes and pitches.  MIDI's very strong note orientation makes it an awful protocol for meeting these requirements.  OSC can trivially do all of this correctly, but it's a very vague standard.  It's so vague that if you make an OSC controller, you generally need to ship a script that is put into the front of the synth to unpack the OSC messages and do what it needs to do to the synth.  That's where audio languages come in later.
  • A synthesis engine.  This is a pretty steep electrical engineering task to do really well.  This is why Animoog is one of the best ones out there.  Any instrument that simply plays back samples is really unexpressive, and misses the point about music instruments and expressivity.  If you are making a MIDI controller, you can (and should) stop here if at all possible and just send MIDI messages.  When background MIDI came out, it was such a wonderful new thing to have, presuming that you could generate MIDI to do what you want and the majority of synths interpreted this correctly.  What should have happened as a result was that the app world should have split between people providing MIDI controllers and those providing synthesizers, and nobody wasting their time duplicating a mediocre version of whatever the app primarily was not.  This typically means mediocre controllers on apps designed to be synths.
  • A recording functionality, of audio copy and paste.  Actually, it's really a request to include some level of a DAW in every single music instrument.  Because this is tied into the synth, it's generally a mediocre DAW functionality.  You can't really use these instruments naturally because you have relatively short buffers to record audio into.  AudioCopyPaste is quite useless if the primary use case is somebody playing each track non-stop for 20 minutes.  That's precisely the kind of use case I cared about.
The iOS audio system wasn't designed primarily for real-time control (at the audio sample rate).  We are also dealing with ARM processors, because we run on battery power.  Because of this, it has always been a struggle to get instrument-quality latency on any instrument; let alone an instrument that can't stick to doing one thing well and throwing everything else overboard to solidly meet the performance guarantees.  Currently, audio buffers are between 5ms and 10ms for "good" apps, though iOS gives about 20ms as the default (under the assumption that you are just playing back sound files).  It should really get down to about 1ms audio buffers to meet the standards of professional audio quality.  Beyond even that, almost no instruments (including my own) will adjust the latency to reduce jitter to 0ms (by making all latency larger, but constant); because that's usually an expensive thing to implement in an audio engine.  Remember that there is no standard for generating audio, and there is a mix of professionals and amateurs doing the best that they can do while scribbling waveforms into raw audio buffers.  This means that we have a lot of crappy effects units, audio glitching, aliasing, etc.

For reference, there are these different kinds of latency that we have to deal with:
  • Graphics frame rate, about 60fps ("real-time", but a single frame miss will not result in a disaster).  The user interface can cause other stuff to stall when it's over budget.  This is especially true if the interface is in a high level language like C# or Lua.  Also, if you try to use the GPU to render audio frames, then you could be limited to the graphics frame-rate, or at least locked out from rendering audio while graphics rendering is in progress.
  • Audio frame out rate, 44.1khz (44000 "frames" per second, where a single frame miss results in a horrible popping noise.  It's hard-real time).
  • Audio frame in rate, 44.1khz or 22khz.  If you are making a guitar effects processor, then you have to add up the latency of incoming, and latency of outgoing, and latency of minimum time to compute a frame for output.  Because of this, just because you can build a wicked fast tablet instrument, doesn't mean that you can use that same effects chain for a fast guitar effects pedal.
  • Control rate, at least 200fps (256 audio frames).  But would like to quadruple performance to use 64 audio frames at 800fps, for 1ms latency and jitter.  This is somewhat tied into how fast the touchscreen can sense.  If the touch timestamp is sample accurate, then we can send the timestamp to the synth and synth can make the latency even to reduce jitter to zero.
  • MIDI/OSC packet latency/jitter.  It's very network dependent, and can either be negligible or fatal depending on how much of it is.
Latency issues are still a bit of a mess, and are ignored or not understood by a lot of people.  This is especially true on Android, where latency is very far beyond fatal for creating real-time music apps.

Audiobus



Then Audiobus came out.  Audiobus wonderfully fixes the huge problem that iOS had with audio isolation in apps.  It's existence is very necessary.

Audiobus lets every app (input app, effects unit app, output app) read and write to the Audiobus callbacks so that separate apps can be chained together.  So at about 200 times a second, Audiobus is in the background calling every participating app (an input app, and effects app, and an output app) to run all of their callbacks in order.  This has the effect of now having 4 apps have hard real-time requirements at audio rates(!).  AudioBus + controllerApp + fxApp + dawApp ... that is 4 apps.   At 200 times a second, that's 1000 callbacks getting filled out in a second.  Also, that's 4 apps with a lot of memory allocated to wrangling audio in real-time.  ControllerApp is going to have a user interface on it as well.  The user interface can hog resources and cause everything else to stall.  The user interface can easily be the thing that causes the whole pile of audio generating apps to not run properly in real-time.  It's hard to overemphasize the problem that gets created.  If there were only one hard real-time process responsible for generating audio, with everything else being a controller of some kind, then glitching and performance issues mostly go away.

Audiobus also creates a somewhat political issue with app development as well.  Audiobus app will list controls that are input, effects, or output capable.  It has nothing to say about controllers.  If you app is not a synth or a DAW, then realistically it should not be generating audio.  If you app is a controller, then it should implement MIDI or OSC.  But if you are such a controller, you are technically not in the "Audiobus" enabled category; which means that your app essentially doesn't exist for a lot of users.  So what do we do?  We pointlessly add in Audiobus support into controllers for no reason, just so we can get into that category.  If you unnecessarily generate audio, you just eat up memory and CPU; resources that actual Audiobus apps really need.  :-)  Controllers are essential to a chain of Audiobus apps, but controllers don't generate or deal with audio.  Controller to synth is one protocol, and synth to effects to daw is another protocol.  Note that, if Audiobus had a much requested feature to save setups, then it probably would have to include MIDI and OSC connections as well.

Controller -> MIDI/OSC -> Synth -> Audiobus -> Effects -> Audiobus -> DAW

It should be like that.

MIDI versus OSC



I have posted at length about all the technical problems that MIDI has on touch screens; a very long and technical read:

http://rrr00bb.blogspot.com/2012/04/ideal-midi-compatible-protocol.html

The problems are deep issues with what can and cannot be said in the MIDI protocol, especially in the subset that all synths are required to understand.  The main problem with MIDI is that it is oriented around notes, rather than around frequency/pitch.  MIDI's note bending is a hack that has all kinds of obvious corner cases that it can't represent; all of these corner cases don't show up in piano controllers (which is why they never got fixed), but they are essential cases for string instruments (which is why MIDI guitars are not standard, and are deficient in all kinds of various ways when people have tried).  Real oscillators don't know anything about notes, and are frequency oriented.

OSC can be made to handle all of this stuff very nicely.  OSC is just a remote procedure call syntax.  It's literally just named functions with parameters going back and forth, like:

  /guitar/string0,fff 1.0, 440.0, 0.5      #amplitude, frequency, timbre
  /guitar/string0,fff 0.9, 442.0, 0.53     #amplitude, frequency, timbre
  ...

The problem with it of course is that all controllers are essentially custom creations.  The messages going to the synth, and from synth to controller could be anything at all.  If you defined what a keyboard looks like, you could standardize it. 

Audio Languages



So, now this brings me to the heart of the problem I am facing.  I want to completely separate out the audio language from the controller.  I don't want the protocol that the controller speaks to make assumptions place unintended limits on the controller.  And I don't want the controller user interface to hurt the real-time audio engine.  So, I have an experiment here on Windows8 (a 27inch screen), where I have a C# program that sends UDP packets to an audio language ChucK:



A lot of audio language aficionados are fond of SuperCollider, Max/MSP/Pd, and CSound is an older language that is still used.  There are a few more audio languages, but those are the popular ones.  These languages have common characteristics:
  • open up a listening port for incoming OSC messages, and get them into a script that does something with the messages
  • because OSC is just a standard for sending messages, the synth front-end must have a script pushed into it to actually control oscillators and effects parameters, etc.
  • they all let the sound patch be defined completely in a script.
  • the script can be pushed into the synthesizer from the controller.  this means that the real-time synthesis engine is one app (ie: scsynth, csound, chuck), and the patch comes from the controller
  • in some of these languages, ChucK in particular, you setup the network of effects and advance time explicitly.  As an example, you create an oscillator at amplitude 1, and 440hz.  You then tell the engine to move forward 30 milliseconds.  When that happens, 30 milliseconds of audio is generated.  This is a very hard-real-time notion of how a language should work.  It is the main thing that the environments are missing when we try to write synthesizers.  This kind of environment is most critical when you try to do things like increasing latency of events to provide zero jitter; for when you want sample-accurate timing and want to delay every change by 5ms rather than at the beginning of a new audio buffer, which guarantees at least 5ms of jitter (ie: latency of 2.5ms with 5ms jitter vs 5ms latency with 0ms jitter).
  • you can inject an entire sequencer into the sound engine, and only send it control changes after that.
  • you can define effects units like reverbs and distortion units - in scripts that run on the tablet - and install them into the audio engine at runtime.  at this point, the mentality could not be any more different from MIDI (and Audiobus) than this.  This is where environments like Max/MSP make a lot of sense on tablet computers.

Audio Language Present



The current group of audio languages have features that don't make them ideal for what I am trying to do.  The current problem with them is that most of them are oriented around live coding, or in the case of CSound, around offline score rendering.  These are both a different perspective from the use of creating hard real-time OSC protocol synthesizers that are driven primarily by controllers.

Csound is a pretty fast environment.  It is well known in academic circles, where offline rendering of music is a plausible thing to do.  MIDI and OSC support are a horrible after-the-fact hack however.  The language is really low level, and will not appeal to a lot of people who would otherwise write patches for it.  It's designed mostly around building up a static graph of oscillators and filters. It builds for all the primary desktop environments.  CSound also has some pretty bizarre limitations that forced me to change my OSC messaging to break code that could simultaneously talk to SuperCollider and ChucK.

SuperCollider is very actively developed.  But currently, it's under a GPL3 license, though work is being done to detangle things to allow for GPL2 compliant builds.  Because of this, it almost doesn't matter what else is good about it; because this situation is currently fatal for all of the tablet environments.  The $0.99 app model depends on the use of Digital Rights Management (locking down hardware and preventing you from doing whatever you want on your own device), so DRM will never go away.  End users have spoken; and they will never give up the $0.99 app model where they can have the interface polish of commercial apps, and close to the prices of free apps at the same time.  The DRM conflicts with GPL licensing, and the SuperCollider devs seem pretty adamant about not just ignoring this issue and letting SuperCollider be embedded anyway.  GPL2 compliant builds may also have issues as well for user-facing projects that have a short tail, just cannot be open source projects (rock stars don't work for free for very long, it's not just developers required to make apps happen, etc).  But ignoring that huge issue, SuperCollider is very mature, and has a pretty healthy user and developer base.  It is based on a Smalltalk-like language.  The only major technical downsides seem to be that a lot of development is directed towards areas that are irrelevant to the main task of creating realtime OSC synthesizers driven by controllers.  Much work goes into user interface things that won't be useful on the tablet platforms, and on things that seem to be related to the live-coding use cases.

Pd (Max/MSP) is an interesting contender.  The underlying language is hidden behind a simple and standardized visual interface.  In some ways this is good, in that it's easy to do simple things.  In other ways, it's really terrible, in that when faced with doing real work, you can easily face a ball of unmaintainable synthesis code that would be simple with the tried-and-true abstractions available in standard programming languages.  It's BSD licensing is very compatible with commercial products.  Some people have contributed ARM specific improvements.

ChucK is the closest thing to a modern environment in this group.  It is a bit less capable than SuperCollider in some respects, but the language really makes a lot of sense, and because the project is run by Smule, these guys understand the tablet world pretty thoroughly.  Performance issues aside, it is the most interesting option that seems to be the least burdened by irrelevant features.  It isn't freely licensed on iOS however (though its license is not an impossible one for iOS like GPL3).  ChucK's applicability seems to have a lot of applicability outside of just audio synthesis as well.  It's an interesting lesson in how real-time languages should look in general. 

Audio Languages In an Ideal World



Dependencies: One of the more problematic things I encounter with these languages is the kinds of dependencies that they have.  The current languages were very much designed with a desktop world in mind.  When building for a new platform that they did not envision, these are not going to be designed as pure buffer generators that get hooked up into callbacks (ie: Audiobus, CoreAudio callbacks, WASAPI callbacks).  Ideally, the core project should not have branches per platform; but as projects that build around the core.  Any audio language project started from scratch should build without pulling in a lot of dependencies, and should ultimately just invoke callbacks that the language has filled in with audio data.  This is how Audiobus works, and presumably how The Amazing Audio Engine will work (though these projects are very iOS specific).  A real source of heartburn is that even "standard" frameworks will pose a problem.  OpenGL, OpenCL, OpenAL, etc, are the usual route to portability; then Microsoft uses WinRT and insists on DirectX and WASAPI, etc.  Using straight C code with a minimum of dependencies is generally the only way to avoid this problem.

SIMD: Few of these languages take advantage of SIMD in their implementations (single thread lockstep parallelism, the kind that you need for fast and short convolutions, filtering, for just rendering entire audio buffer in parallel, etc).  These are all in C or C++, and there is no good standard for doing this yet.   But it is typically necessary that per-platform, there needs to be SIMD optimized builds for the engine to be feasible on ARM processors.  Examples are vDSP, and ARM intrinsics.  OpenCL addresses these issues in theory, but it's unclear if GPUs can be used in practice for this kind of audio composting.  The SIMD work might be tied into the audio language VM, rather than compiling elemental functions to use SIMD at runtime.

The Language: Because these environments have hard-real-time requirements, there is a real conflict with having a dynamic environment as well.  These languages run in custom virtual machines.  They do real-time garbage collection.  Because of this, the language cannot get overly baroque without messing up the goals of the language.  The language should work well in a hard-real-time environment.  This generally means that memory allocation and de-allocations are much more conservative, and algorithms run in consistent times.

Language Simplicity: A variant of LISP that deals with arrays and SIMD directly seems to be the most obvious candidate to get started with.  There are existing audio languages that use LISP as their basis.  A virtual machine for running an audio language should at least start out very simple, and grow as needed.  The main issue with using LISP in this capacity would be to support actual arrays from the outset, and allow for array operations to be SIMD parallelized (ie: avoid a high garbage collection rate, locality issues, etc).

The OSC Protocol:  The most wonderful thing about SuperCollider is how the language environment (sclang) talks to the real-time engine (scsynth) over OSC.  It is an effective split between hard real-time and soft real-time.  It allows the compiler to simply be removed from environments where it isn't needed.  The controller should do something similar to sclang's equivalent, and use OSC as the protocol over which to inject patch definitions into the synthesizer. 

The VM: The virtual machine could be a traditional one written by hand.  It could also be LLVM output that is consumed and turned into native code.  LLVM is designed to run on many systems, but again I run into issues with standards possibly not being usable in certain places (WinRT?  How about generating CLR as a backend instead?).  OpenGL drivers on OSX already work like this.  They take shader code and generate the card specific assembly language; and this is for a pretty performance area critical part of the system.

Patch Generation

When I was working on Cantor (AlephOne), I had started to write a Python compiler to generate SIMD C code (vDSP - partially realized in my github project DSPCompiler and in AlephOne) for the audio engine from LISP input.  I ran into this problem because OpenCL wasn't available, and I had a similar problem.  When you try to generate wide SIMD code, your code turns completely "inside out" where the outer loop that existed in the serial version goes into every instruction of the "assembly language expansion" of the original C code.  For example:

//can't parallelize that
for(int i; i<N;i++){x[i]=a[i]+b[i]*c[i];}
Becomes like:

for(i : 0..N) mul b, c, x
for(i : 0..N) add x, a, x

But that doesn't support dynamic use cases.  The patches would need to be compiled into releases.  But a VM supporting SIMD instructions appropriately could provide a speedup even when the originating language is high level.