Building a better DynamoDB throughput scaling tool, part 2

By | July 24, 2014

A month back I blogged about wanting a better throughput scaling tools for DynamoDB. Not having been able to find an existing tool that ticked all my boxes, I ended up scratching my own itch and developed a small Java tool that runs in the background, monitoring a set of DynamoDB tables. The tool satisfy these requirements:

  1. Before downscaling, make sure you should. This needs to be based on more than just the current average throughput usage. We also need to consider the 4 downscaling opportunities we have, making sure we don’t spend them all straight away. This could potentially be solved by only allowing downscaling to happen at 4 specific times throughout the 24 hour period. But I’m not sure I like it, sounds too simple?
  2. We should only scale throughput up if we have throttled operations. If we have multiple partitions on a table we can NOT rely on comparing consumed throughput vs allocated throughput. Also, for eventually consistent reads, we want to over-consume allocated throughput because that’s what they allow us to do. So only throttled operations should matter for deciding when to scale up.
  3. We should, by looking at throttled operations, consumed throughput and allocated throughput, try to calculate how many partitions we have. This is not a big problem if the access patterns are spread out well across all partitions, but it is a problem if they are not.
  4. When scaling down, we need to take into account the assumed number of partitions combined with the current access patterns on step 3 above, making sure we don’t scale down too much. Given that we always are allowed to scale back up, we can always fix a mistake if scaling down too much. But we should try to avoid scaling down too much as it will impact the performance of applications accessing DynamoDB. Basically, we don’t want to overpay for unused capacity, but not at the cost of performance.
  5. As a bonus feature: When scaling up, take into account the number of throttled operations, number of assumed partitions, and scale up only as much as you’d think you’d need. If we get it wrong we can always scale up more. Again it’s a balance between cost and performance.

I ended up ditching requirement 5 and instead always scale up by the amount of throttled requests multiplied by a simple factor. If it gets it wrong, it can always scale up again later, since we don’t have any scale up count limitations.

So what’s the result? Here’s a table read example, first for three days, then just the last day:

DynamoDB throughput scaling 3 days

DynamoDB throughput scaling 3 days

DynamoDB throughput scaling 1 day

DynamoDB throughput scaling 1 day

As we can tell, on this read case, the provisioned throughput is below the consumed throughput as we’re exploiting eventually consistent reads and their additional “free” throughput.

Also, we see delayed downscaling, as we need to be careful not to consume the 4 down scale opportunities.

One thing to note here: The system that is reading from the table will back off progressively if it gets a ProvisionedThroughputExceededException. Because of this, the throughput scaling tool doesn’t see the maximum required throughput rate immediately. This then creates a dynamic relationship between the consumer and scaling tool, resulting in the gradual increase from one level to the eventual peak rate at which the consumer is saturated by some other factors, like CPU or other external resources. It’s also interesting to look at the cross-table relationship, where a consumer reads or writes to multiple tables, where scaling one table also impacts the scaling of a second table.

In the end I’ve got myself a simple tool that does the job. This has both increased the performance of the DynamoDB table consumers as the tables get scaled up when needed, and decreased my cost because I can scale them down automatically when there’s little or no activity. A clear win-win!

Update 2016-06-23: After often getting emails about if I would open source this tool, I’ve now finally decided to do just that: