For the last month or so I’ve been rather busy developing a trading platform for algorithmic trading, connected to LMAX (you should also check out their very interesting Disruptor framework). It’s a rather comprehensive solution with both risk management, position management, back testing, data management on tick level, etc built in. Everything that connects to and communicates with the exchange is event based, basically on a topic based setup. So one event producer may have zero or more event consumers.
I’ve really enjoyed finally having a proper concurrent project to work on, poking around the Java concurrency source code in order to find a few tips and tricks as to how I can squeeze out a bit more performance. It’s not that the trading I’ll be doing is that high frequent per say, it’s more about eliminating latency. Now, of course, most of the computational power will be needed in the actual business logic as that’s where the heavy lifting happens. But since the event system that glues everything together is so omnipresent, having it perform well is also very important. Analyzing the spacing between each tick for presumably one of their busiest instruments, EUR/USD, indicates that the most frequent observation is between 1 and 10 milliseconds, so that gives me a rough estimate as to how quick the consumers need to be. If they can’t keep up on average the events will begin queuing up, which is very bad.
For the event system itself, delivery of one event from a producer to a consumer is on average 170 microseconds, so that basically leaves the consumer with almost 100% of the time (given that the exchange connected producers are very-to-fairly lightweight). I’m not really sure where to do further optimizations along my critical paths, so I might need to go crazy and do manual memory management or something (with the Unsafe class), but I doubt I’ll need it. I think most of the quick win gains currently is at the network/IO level.
Anyway, I digress. Because the point is that while performance testing various parts I got some conflicting evidence as to which concurrent queue implementation to use. So I decided to do a somewhat more comprehensive test while still running it in an isolated environment. There are three factors at play here: Do we have a slow or quick producer, slow or quick consumer, and finally are we using busy wait loops when waiting for events? A slow producer or consumer would typically be slow if it needs to wait for IO, so that’s simulated here with a
From JavaDoc we know:
ConcurrentLinkedQueueis an appropriate choice when many threads will share access to a common collection. This queue does not permit null elements.
ArrayBlockingQueueis a classic “bounded buffer”, in which a fixed-sized array holds elements inserted by producers and extracted by consumers. This class supports an optional fairness policy for ordering waiting producer and consumer threads
LinkedBlockingQueuetypically have higher throughput than array-based queues but less predictable performance in most concurrent applications.
I don’t care about the fairness policy as that isn’t an issue; we don’t have multiple reader nor writer threads per queue, only one of each. I’m not focusing on the test code, only the results. But you’re welcome to download it and have a look. It is quite straight forward where we have a producer producing 100,000 events, with the queues capped at 1000. We then vary the combinations of the three factors and run each scenario 100 times. As I had cores to spare I ran them three times, so the results below are based on 300 result values.
So we expect the LinkedBlockingQueue to perform better than the array based. Let’s look at the two scenarios where both the producer and consumer are fast (ie no sleeping/waiting in any of them), using both busy wait polling and poll with a timeout.
Without busy wait the average run time for ArrayBlockingQueue is 512.73 ms, while LinkedBlockingQueue clocks in at 363.95 ms. Using busy wait, this changes to 473.92 ms and 372.18 ms. Without putting too much weight on the changes in values between the two scenarios we can simply conclude that a LinkedBlockingQueue is, as stated by the documentation, faster than the ArrayBlockingQueue in this use case. It’s interesting to see however that there are quite a few more observations among the lower run time numbers for the ArrayBlockingQueue when using busy wait than when not using it. So if you’re stuck with an array based queue maybe that’s something to take away.
Moving on to a more realistic example, what happens when the consumer is occupied doing something with each event. The run time of course increases, but of particular interest to us we see that now the array based queue outperforms the linked based queue. The average runtimes when not using busy waits are 109,381.77 ms (array based) and 109,415.30 ms (linked based). The “theoretical” fastest possible run time would be 100,000 ms, and linked based queues show a slightly higher average overhead than the array based. Switching on busy waits we now get 109,387.41 ms and 109,433.22 ms. These results are virtually identical to not using busy waits, which isn’t surprising given that the producer is quicker than the the consumer.
The next two pictures are interesting. Here we use a slow producer but a quick consumer, only altering between using and not using busy wait. If we were to look at the CPU consumption between these two settings they would be like night and day, busy wait consuming 100% without much work being done (as there are few new events) and close to 0% when not using busy wait. This is likely also why they both differ as they do and why we get the dual peak distribution on the first of them. As the application isn’t constantly busy the OS likely moves on to other things before it comes back to our app (ie more context switching, lower app priority).
When looking at average run time values we find the linked based queue with a slightly higher overhead when we’re not using busy wait: 110,109.07 ms vs 110,079.01 ms for our array based queue. This changes when using busy wait: 109,002.13 ms for LinkedBlockingQueue and 109,067.95 ms for ArrayBlockingQueue.
Finally, what happens when both the producer and consumer are slow. Without any busy waiting I got these average values:
ArrayBlockingQueue: 109,098.06 ms
LinkedBlockingQueue: 109,112.33 ms
With busy waits:
ArrayBlockingQueue: 109,218.44 ms
LinkedBlockingQueue: 109,209.08 ms
So there you go, hopefully that makes your choice between an array based or a link based queue a bit simpler the next time you need one.
PS: I should mention that these where run on Java 1.7.0_1. I’ve seen significant performance improvements on Java 7 compared to Java 6, so if you’re running these on Java 6 that might affect your results. Let me know if it does..