Blog of Christian Felde Technology, computers and quant finance


The economic scalability of DynamoDB

I've been using DynamoDB for a few months now after re-architecting a system which started becoming painful to scale on a traditional RDBMS system. The problem wasn't necessarily read/write performance but rather the total storage space needed as a lot of "unstructured" blobs was stored in the DB.

DynamoDB gives me a care free setup with individually scalable reads, writes, and storage, fully managed by AWS, making my life easier. I'd rather spend my time focusing on the business use case rather than managing infrastructure. Prior to picking DynamoDB I considered Apache Cassandra. Both share the same origins in the form of the Dynamo paper, and the hash + range key data model fits my use case perfectly.

Here's why I didn't pick Cassandra: For a decent minimum setup I'd argue you need at least three servers. Paying on-demand or reserved prices for three servers is one thing, but they also need someone to manage them ( = time/money ), including but not limited to OS, security, updates, backups, monitoring and other basic management tasks. You might find it enjoyable to tinker with these things, but I'd rather not. Sure, I'm fully capable of either knowing what to do or figure it out if needed, but I'd rather not spend the time on that right now (unless you pay me for it).

But it's all about scale. At some point it's more economical to manage software yourself, or manage your own hardware, or even build your own data centers. I'm far from most of these scale points, and will probably never need to bother with most of them. But when does it become more economical to run your own Cassandra cluster versus using DynamoDB?

It's not a straightforward calculation, but Stackdriver recently published an interesting performance test, comparing the new AWS C3 instances against Google and Rackspace cloud servers. Using the most cost efficient setup from their study, could we calculate when you should switch from a managed NoSQL solution to a self-managed Cassandra installation?


Space efficient Bloom filter index, 2x performance gain

To quote Wikipedia: A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.

This is often used by systems in order to avoid accessing slow media, like a disk. Take HBase or Cassandra for instance: Instead of reading their data files in order to figure out if a particular key is present in a file, a file which might be huge, they can first consult their Bloom filters and get a true/false on if the key is likely to be in there. I say likely because a Bloom filter isn’t guaranteed to be correct. More formally: False positives are possible, but false negatives are not.

Given that false negatives can’t happen, the worst we’ll do is read a file we didn’t have to read. But we’re also guaranteed to read the file if the key is present, which is the type of guarantee we need.

Using a similar technique I managed to get almost a 2x performance gain on Oracle Coherence using a custom Bloom filter based index implementation (versus no index), using just a fraction of the memory a standard index instance would use (a few kB at most). But before we conclude on that let’s present a use case and why indexing in an in-memory key-value store makes sense to begin with.


Resource aware queue

For the TLDRs:

This blog post presents the reasoning behing a project called ResourcePriorityBlockingQueue which is a blocking queue implementation that:

  1. Allows you to assign priority on tasks, just as with a PriorityBlockingQueue.
  2. Tasks may belong to task groups where each group may have a different priority. In that case, tasks are prioritized by group first, then task priority second.
  3. For a particular instance of a ResourcePriorityBlockingQueue you give it implementations of a ResourcePrioritizer and a ResourceAllocator, which further defines to which resources a particular task is made available.
  4. Focus as been put on making the code highly concurrent with as little synchronized code as possible. Lock free code is used where applicable, with efficient use of internal data structures. This gives ResourcePriorityBlockingQueue a performance characteristic which is comparable to that of a pure PriorityBlockingQueue when used in a similar fashion.
  5. It's available on GitHub, so feel free to poke around. Any constructive feedback is always welcome.