Popular Posts

Friday, April 19, 2013

Upsert Counters In Postgres

Upsert 


When faced with a situation in which you are inserting huge volumes of data ('log volumes of data', think append-only or circular queue tables) and need to keep some statistics, you would like to have counters.  In this situation, I am working with Postgres, but upsert concepts also exist in MySQL apparently.  This is related to a common situation in which you know the primary key and some initial values for a datapoint (to describe the object that the primary key refers to), along with data that will be updated constantly.  You don't want to look these up to see if they exist in order to do an update on them, because you want to do a huge number of inserts in a batch.  The simplest upsert is like this:

sql = "with upsert as
(update ourtable
  set mutable1 = ?
  where id=?
  returning id)
insert into ourtable
(id,constant1,mutable1)
select ?,?,?
where not exists (select 1 from upsert)"

#pseudocode usage like this...
def OurTableUpsert(statement,id,constant1,mutable1):
  #move on to the next item
  statement.item++
  statement.row[item].args[0] = mutable1
  statement.row[item].args[1] = id
  statement.row[item].args[2] = id
  statement.row[item].args[3] = constant1
  statement.row[item].args[4] = mutable1

#Make a giant batch of updates against empty table
statement = prepare(sql)
OurTableUpsert(statement,8,'Ovechkin Goals',1)
....
OurTableUpsert(statement,8,'Ovechkin Goals',2)
...
OurTableUpsert(statement,1003,'Planets',9)
...
OurTableUpsert(statement,1003,'Planets',8)
...
execute(statement)

Which lets us set mutable1 of ourtable to a value given an id, whether this row already exists or not.  This is important because we want to make huge batch updates, and don't have time to go figure out what already exists.  It prevents us from having to do a select on the items to generate inserts on the ones that don't exist and updates on the ones that do exist, and lets us do everything as one batch statement.  The constant fields like constant1 are never updated, only supplied with initial values, and the mutable fields mutable1 are constantly updated.


Counters


A more complicated example that's a collection of counters.  The counters should be sparse such that values of zero don't have a row in the database, and you only write back to the database to increment their values.

Updates alone don't let us do counter-like behavior.  The upsert is general enough that we can supply incremental values.  We change the statement to make the arguments supply increments, and partition the increments by day:

sql = "with upsert as
(update ourcounters 
  set counter = (? + counter)
  where id=? 
  and starttime <= now() and now() < stoptime
    returning id)
insert into ourcounters 
(id,counterdescription,counter,starttime,stoptime)
select ?,?,?,?,
  date_trunc('day',now()),
  date_trunc('day',now() + interval '1 day')
  where not exists (select 1 from upsert)"

#run something like this
def OurCounters(id,description,increment):
  statement.item++
  statement.row[item].args[0] = increment
  statement.row[item].args[1] = id

  statement.row[item].args[2] = description
  statement.row[item].args[3] = increment

#Increment a whole bunch of counters
statement = prepare(sql)
OurCounters(1,'United States',235)
OurCounters(44,'UK',53)

OurCounters(44,'UK',99)
OurCounters(1,'United States',23523)
...
execute(statement)

You can turn this into a stored procedure and supply it with a huge list of counter values to increment.  These are sparse counters, so there won't be a row in the database for them if the value is zero.  It increments the counter for the current day so that you can retrieve total counters over a window of days.  The trick here is that the raw counts get bucketed by day.  That way, when you startup the application from the beginning, you can retrieve a total count:

select id,description,sum(counter) from ourcounters

And then in memory, keep increments to the total that last came from (or was written to) the database.  Periodically you write them out.  But you write them out so that it places it into an increment for the current day.  This way you can keep incremental statistics on a large amount of data, where there is only a lightweight query on first startup, and an efficient way to write updates and filter out for an initial time window.

Summary


I have worked on relational database systems that generally deal with a fast ingest of events (where a mismatch between the insert and delete rate becomes a severe problem when you reach the maximum number of rows you would like to maintain in the database).  With excursions off into Hadoop, Cassandra, Solr, etc, I start to run into situations where relational databases are being used as ringbuffers or append-only immutable data, while the rest is boiling down to having a great system for counters (including nested counters). Upsert was something that I had not seen until recently.  It looks like it will come in handy in a lot of places.