Only the Good Die Young (or Move Off-heap?)

Having some spare time over the Easter weekend I thought I’d spend a little time trying out a few things, including:

  • Java 8
  • Direct memory allocation with sun.misc.Unsafe

So I made a hash-map implementation that stores map entries on memory allocated outside the typical JVM heap: BinaryOffheapHashMap

First of all: The GC process in Java is a lifesaver, allowing us all to focus on writing functionality rather than obsessing over managing the mundane task of allocating and freeing memory. Yet, it doesn’t mean you shouldn’t be memory-aware. Failing to understand the memory impact of your code can lead to horrible performance, as you’re simply creating too much garbage. You can potentially try to avoid an ill-timed GC pause by having a bigger heap, but you’re just postponing the inevitable.

Then there’s the case when you want to store huge amounts of data inside your JVM app. The lifecycle of this data is well understood and not part of any fancy distributed/parallel algorithm. You push your data in, then get it back out, potentially deleting it at some point. Isn’t it a bit wasteful to allocate a 200G heap, use it mainly for this purpose, and have the GC process scan through all that? For GC, the good die young, and it doesn’t scale that well with increasing heap space.

It’s for these reasons we’re beginning to see more and more focus on off-heap storage structures in Java. Off-heap here relates to memory not managed as part of the standard JVM heap, thereby not being part of what the garbage collector cares to look at. Products like Hazelcast or Oracle Coherence both provide this option for their distributed maps.

Several of these off-heap maps still keep their keys on-heap, for various reasons. The BOHMap as I call it, which stands for Binary Off-heap HashMap, does not. This allows it to keep only a few private fields, 2 ints + 2 longs + 2 object references (Unsafe and a hash function), for the duration of the instance. The rest are short lived or managed off-heap.

Performance testing BOHMap against the standard HashMap is a little tricky. I could simple allocate just enough heap to store the test data, thereby forcing the JVM GC to spend a lot of time doing its job. That would surely make BOHMap look very impressive, but I could just as well argue the heap space wasn’t correctly configured for the test case. One thing is for sure, BOHMap must do more work than the standard HashMap, since it needs to copy data between on-heap and off-heap memory.

If we start by looking at the fantasy case, where the JVM GC doesn’t need to go mental, here’s what I see:

On a AWS r3.8xlarge instance with 244 GiB of system memory, I give the JVM a 200G heap using -Xms200G -Xmx200G when testing the traditional HashMap. This is then able to put/update about 1 million entries per second. If I’d continued the test longer, the GC would likely have impacted this significantly.

However, running with standard heap settings on this machine, forcing the GC to kick in earlier while the test is running, this drops to around 200k-300k per second.

On the BOHMap case I see around 500k+ put/updates per second, with no/little impact on the heap and GC. I set the partition count equal to “expected map size” divided by 5 for those results, but had similar results when dividing by 25 instead of 5 up to about 10 million map entries.

For the record: I was using 64-bit Java 8u5 on Ubuntu 13.10, with each key and value being a 64 byte big Binary object. Their content was random.

So there you go. I hope some of you might find this useful, especially if you need to store a lot of data in memory. I sure enjoyed creating it, playing around with sun.misc.Unsafe 🙂

8 comments

  1. Nice play with sun.misc.Unsafe, I would like to try it myself. Similar thought crossed my mind before when I was dealing with huge analytic process when required to aggregate some million+ records in the heap. I’ve used ByteBuffer.directAllocate (Java 6) it has proven quite tricky at the time since I was still limited by the process native memory.

    I am curious about the bench mark results, neglecting the GC pauses, I would have thought HashMap put/update will be faster than BOHMap considering the overhead of malloc (and potentially serialization)?

    1. Without GC HashMap is faster at 1 million puts per second, about the double of BOHMap. With GC impact HashMap drops to about half of BOHMap. With GC it’s a different story naturally, but there’s always a lot of GC tuning options I didn’t really explore much for these micro-benchmarks.

      I think the point is that we all know that a large heap (8G, 16G, 32G, 64G+ ?) is going to have a GC performance impact, something one could argue is a waste when the data lifecycle is as simple as it is for a hashmap.

      How much malloc is to blame vs on-heap to off-heap copy I’m not sure. For the serialization cost, yes, there’s a cost when using OHMap (the serialization wrapper for BOHMap). Serialization is again something that likely can be optimized by not using the default Object[Input|Output]Stream classes, rather opting for something more specialized. I was considering looking to see if something like SBE (https://github.com/real-logic/simple-binary-encoding) could be used, but ran out of time basically.

  2. My bad, since we are on the topic of “Unsafe” my brain tricked me into making the distinction between BOHMap and HashMap.

    malloc is known to be expensive compared to Java Object allocation which in fact made Java perform better in some benchmarks compared with C++.

    Thanks for sharing the code, It will be fun to play with it at some point.

    1. But I’d assume Unsafe.allocateMemory (http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/sun/misc/Unsafe.java#Unsafe.allocateMemory%28long%29) as being pretty much a direct call to malloc. It is as far as I can tell: http://hg.openjdk.java.net/jdk7/hotspot/hotspot/file/9b0ca45cd756/src/share/vm/prims/unsafe.cpp.

      So your point about malloc would be valid no matter what in isolation. But then again, the on-heap to off-heap copy would rather be the first place I’d look to optimize if possible, like for example here: https://github.com/cfelde/BinaryOffheapHashMap/blob/master/src/com/cfelde/bohmap/BOHMap.java#L446

      I could envision Binary (https://github.com/cfelde/BinaryOffheapHashMap/blob/master/src/com/cfelde/bohmap/Binary.java) to potentially play a part in this, but it depends on the source data, i.e. is the source data already on-heap or can it come from off-heap?

      If it can come from off-heap, some mem copies could potentially be avoided. Example: Network data is already located off-heap. Then it is copied on-heap. If it then is placed in BOHMap, it goes back off-heap. That is 2 copy operations. However, if Binary was changed to allow, say, an off-heap direct buffer pointer as part of the constructor, you could potentially decrease this to a single copy operation.

  3. It’s understood that the point of BOHMap was is not for performance gain, but rather to extend to big memory outside the JVM heap boundry.

    If the source data can be directly transferred that will benefit performance, which I believe what DirectByteBuffer in Java was intended for, to help interactions with native I/O.

    With that in mind the guys at OpenHFT have discussed in details their Off-heap (HasMap http://www.infoq.com/articles/Open-JDK-and-HashMap-Off-Heap), it’s worth checking out.

    1. Feel free to compare both for your use case and let me know. I think README.md should stay focused however..

  4. Nice experiment.

    Check out the OffHeap maps of https://github.com/RuedigerMoeller/fast-serialization (version >2.0). As my serialization is faster you’ll have better put rate. Also there is support for memory mapped file based persistence.

    It should also be possible to replace default serialization with fst in your implementation to speed it up (just contact me in case you have trouble on how to do that efficiently)

Comments are closed.