Performance related posts and articles seem to be rather popular, with my Java vs C++ post being one of the big traffic drivers on this blog. Now it’s time to do another one, with a very specific use case in mind.
I’m into the third and final term for my masters in quantitative finance. As part of this term I’m doing a dissertation and the topic is basically high frequency trading strategies. So what I’m doing is looking at a relatively large amount of data and applying different trading strategies to this data. The data consists of 1 minute OHLC bars + volume for all stocks in the S&P 100 going back about 10 years.
Uncompressed and stored as comma separated text files this constitutes about 6 GB of data, and when imported to a database this is just under 118 million rows, each row being a one minute bar for one ticker. I don’t feel like torturing my laptop with all the analysis I’m running the coming months, and its also only got two cores so it would be beneficial to “outsource” this to external servers.
Luckily, in these “cloud computing” times, gaining short term access to X number of servers is easy, and also reasonably priced. But who should I choose? There’s two dominating players I’m going to evaluate: Rackspace Cloud and Amazon Web Services.
When running the tests I’ve been using Rackspace’s UK data center and Amazon’s Irish data center, simply because I’m based in London and those data centers are probably a bit less crowded than some of their US based. Both of them provide different configurations, Amazon being the most flexible in this area. I’m going to need persistant disk storage, so for Amazon I’m using EBS for the disks. For Rackspace, I’m using the 8GB server, and for Amazon the Large Instance (7.5GB). This is more than what I need; 2GB will probably be enough, 4GB for sure. But Amazon doesn’t have a 4 GB server. And more importantly, price-wise they are similar with the Amazon server being a bit cheaper, but you’ll have to pay extra for the EBS disk and IO operations.
And this is where the first big difference shows. From my understanding, Rackspace uses RAID 10 with locally located hard drives. This of course removes some of your flexibility, but this is not flexibility I need anyway. EBS on the other hand is network attached storage. So the first test is raw write speed. Using the dd command as shown below, running it three times in a row on each server gave the following results:
dd if=/dev/zero of=./dd.test bs=10M count=500
That’s a massive difference, Rackspace having an average write speed of 290 MB/s and Amazon only 81 MB/s. Both servers are running Ubuntu 10.04 LTS, with all patches applied. No other tuning changes have been performed, so it’s a stock OS. Still, don’t think any tuning changes on the Amazon server would be able to keep up with the Rackspace server anyway, with this massive difference.
This also clearly shows when I imported a subset (about 14.5 million rows) of the data from files to a MySQL database. Both servers were running MySQL 5.1.41-3ubuntu12.10 with no configuration changes. The database engine used for the table in both cases was InnoDB, and a simple Java program was used to read the files and insert them into the DB. The Java version info on both server:
java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
In this case Rackspace performed the import in 6146 seconds with the Amazon server doing the same job in 15026 seconds, or in other words over twice as slow. Not too surprising given the difference in write speed.
While the imports were running I monitored the servers and could see that a lot of performance was lost on the Amazon server due to io wait. This wouldn’t be an equally big issue for my next test which is CPU bound instead. It’s a single threaded test, so what I’m looking at here is single core performance. What the test does is load each day of intraday data for each ticker, one at a time, then with this data in memory perform a specific trading strategy with a set of parameters on that day. Also, as there’s bootstrapping involved, each set of data is regenerated 1000 times (all in memory, so no disk or DB activity for this).
Here the results are more even, with the Rackspace server performing the test in 6639 seconds and the Amazon server the same in 7025 seconds. I was considering running the same test on a “High-CPU Extra Large Instance” on Amazon as well. The write speed should be about the same according to Amazon’s own details, but it’s got a .5 increase in EC2 Compute Units pr core. But I didn’t bother, as it probably only would have put it on par with the Rackspace server, but still suffering in write speed.
So my choice is rather simple: Rackspace Cloud.