Building a better DynamoDB throughput scaling tool

By | June 24, 2014

I use DynamoDB, Amazon Web Services’ managed NoSQL data store. It’s a fantastic tool, where there’s basically no management needs placed on me. That’s with two exceptions:

  • Backups
  • Throughput scaling

Backups is fine. I built a small tool that gets the latest changes and stores them elsewhere. Throughput scaling is a bit more involved so I was hoping I could use someone elses tool for that job. So I tried Dynamic DynamoDB, but it didn’t really work out for me. Here’s an example:

dddb

This shows three days of throughput on reads or writes (doesn’t matter what it is, can’t remember), spanning 15th to 17th of May. The blue line is the actual usage, averaged over 5 minutes. The red line the provisioned throughput. I configured Dynamic DynamoDB to double the throughput if needed, which is why the red line jumps up like it’s doing.

The first thing to note: The provisioned throughput stays up there almost all day before dropping down again at midnight. Then there’s a frantic up and down scaling period just after midnight (easiest to see in the middle) before it eventually goes back up and stays there again till the next day at midnight again.

InvisibleHand sums up the fiddly bits of DynamoDB quite well:

  1. You can at most scale up 100%, but you can scale up as many times as you wish
  2. You can scale down all the way to the minimum value of 1 read or write operation per second in one go, but you can only do this 4 times within a 24 hour period (starting at midnight)

That explains the reason why the throughput was stuck way too high till midnight: Dynamic DynamoDB had used all 4 downscaling options already, so even if it wanted to scale the throughput down, it had no way of doing that. This isn’t good.

Now I’m not saying Dynamic DynamoDB isn’t a good tool, it was even mentioned on the AWS blog. It’s simply not fitting my use case patterns. At least not as far as I could figure out from the configuration options.

Other things to consider:

If you have a lot of data, a single table gets split into multiple partitions. At that point your provisioned throughput gets evenly divided between those partitions. Also, and this is something it would be nice of AWS to improve, there’s no easy way of figuring out how many partitions you have. You sort of have to guess a little.

If you therefore have a use case where only a single or a few keys are accessed a lot, your actual throughput is going to be below that of the provisioned throughput if these end up making one partition more accessed then the rest. We will have to rely on other statistics to figure out if that’s a problem for us.

So here’s the wishlist of features needed:

  1. Before downscaling, make sure you should. This needs to be based on more than just the current average throughput usage. We also need to consider the 4 downscaling opportunities we have, making sure we don’t spend them all straight away. This could potentially be solved by only allowing downscaling to happen at 4 specific times throughout the 24 hour period. But I’m not sure I like it, sounds too simple?
  2. We should only scale throughput up if we have throttled operations. If we have multiple partitions on a table we can NOT rely on comparing consumed throughput vs allocated throughput. Also, for eventually consistent reads, we want to over-consume allocated throughput because that’s what they allow us to do. So only throttled operations should matter for deciding when to scale up.
  3. We should, by looking at throttled operations, consumed throughput and allocated throughput, try to calculate how many partitions we have. This is not a big problem if the access patterns are spread out well across all partitions, but it is a problem if they are not.
  4. When scaling down, we need to take into account the assumed number of partitions combined with the current access patterns on step 3 above, making sure we don’t scale down too much. Given that we always are allowed to scale back up, we can always fix a mistake if scaling down too much. But we should try to avoid scaling down too much as it will impact the performance of applications accessing DynamoDB. Basically, we don’t want to overpay for unused capacity, but not at the cost of performance.
  5. As a bonus feature: When scaling up, take into account the number of throttled operations, number of assumed partitions, and scale up only as much as you’d think you’d need. If we get it wrong we can always scale up more. Again it’s a balance between cost and performance.

I am considering writing this tools myself, scratching my own itch. However, I’d prefer not to have to spend time on it. If anyone knows of any existing tool that would take these feature requests into account, I would really like you to leave a comment. Also, if you have any other features I’ve missed but you think is crucial, let me know!