Space efficient Bloom filter index, 2x performance gain

By | August 7, 2013

To quote Wikipedia: A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.

This is often used by systems in order to avoid accessing slow media, like a disk. Take HBase or Cassandra for instance: Instead of reading their data files in order to figure out if a particular key is present in a file, a file which might be huge, they can first consult their Bloom filters and get a true/false on if the key is likely to be in there. I say likely because a Bloom filter isn’t guaranteed to be correct. More formally: False positives are possible, but false negatives are not.

Given that false negatives can’t happen, the worst we’ll do is read a file we didn’t have to read. But we’re also guaranteed to read the file if the key is present, which is the type of guarantee we need.

Using a similar technique I managed to get almost a 2x performance gain on Oracle Coherence using a custom Bloom filter based index implementation (versus no index), using just a fraction of the memory a standard index instance would use (a few kB at most). But before we conclude on that let’s present a use case and why indexing in an in-memory key-value store makes sense to begin with.

Assume you have a large Coherence cluster, storing your internal business objects. They are keyed on some internal ID which makes sense for your organization. So looking up entries by this key is efficient, given Coherence’s hash map like behaviour. But let’s also assume that this internal business object maintains some connection with the outer world. The outer world couldn’t care less about your internal key structure, so whenever they talk about the same business objects in their context it is in the form of their key. We’ll call this key the external key and sometimes you need to look up entries on this external key.

Looking up entries on their external key is a costly operation. It basically forces you to do a “full table scan” using a filter to find the particular entry you’re after. But that shouldn’t be a very heavy operation for an in-memory system, right? Well, it is. It is costly because we store our entries in serialized form. Since Oracle Coherence allows us to form clusters with several nodes spread across machines it must be able to send data between the nodes. To do that everything must be serializable, so objects are also internally stored in serialized (or binary) form. So whenever we want to read a value in object form, it must first be deserialized. There are certain technical things we can do to make this more efficient, like using their POF serialization framework. That way we can make sure we only work with binary data if/when possible, but that’s outside the scope of this post. Besides, it’s still going to be full table scan, even with POF.

So besides POF, what’s the other way of make this “full table scan” more efficient? We can simply add a standard index, naturally. The index will create a reversed view of the map content, allowing us to ask questions like: Which keys have entries that contain the value X? This is fast because it’s reversed and X becomes the key to a set of values, where the values are the keys pointing to the entries. So with an index we can quickly eliminate all keys that don’t contain X.

As an example: On my machine I had three nodes that together contained 100,000 entries spread out evenly. Then I asked the cache, using an EqualsFilter to find the entry with value X. Without an index this “full table scan” took close to 140 seconds, with an index it took just over 1 second. 140 seconds is horrible performance given the simple thing we’re trying to do, so 1 second is a pretty big improvement.

But back to our external key use case: Assume we can with more than sufficient probability guarantee that each external key is unique (but different to our internal key). An index on external keys will then contain one entry for each entry in our cache. Each reversed view entry will only contain one key in its value set, pointing to the one cache entry with this external key. This might be fine if we are prepared to pay the price and have enough memory available. Memory in itself doesn’t cost much, but we also have GC overhead, etc. So assume you don’t want to pay that price, but still would like to speed up external key based lookups. Is there anything you can do?

What we could do is to apply the same sort of pattern used by HBase/Cassandra using a Bloom filter to quickly exclude entries we know don’t contain the external key. As mentioned, doing this I managed to get almost a 2x performance speed-up, but at the fraction of the memory required when using a full standard index implementation.

I’ve dumped the code I wrote to test this on GitHub, so feel free to play around. Besides requiring the Coherence jar (I used 3.7.1.9) it also uses Magnus Skjegstad’s BloomFilter implementation. Given its license I included it for you, but check out his repository for the latest and greatest.

I’m not going to cover all of it, only highlighting key parts, specifically the triangular interplay between the index, extractor and filter.

Starting with the BloomIndex map index implementation, inside the insert method we find:

  1. public void insert(Entry entry) {
  2. BinaryEntry be = (BinaryEntry) entry;
  3.  
  4. String key = be.getBinaryKey().toString();
  5. String value = (String) InvocableMapHelper.extractFromEntry(extractor, entry);
  6. bloomFilter.add(getBloomValue(key, value));
  7. }

All we’re doing is adding the concatenated representation of the key (which we avoid deserializing) with the value (which we in the example assume to be a string, but could have been other types if needed). Note that for the example the delete and update methods are not implemented. Removing from a Bloom filter is not that easy, so this is not a general purpose index in that sense. Various strategies/algorithms can be used to deal with this in a full implementation.

Next up is the extractor. Besides working with the particular CacheObject used as cache entry values in the cache it is also an index aware extractor responsible for creating the BloomIndex instance when the index is added. Doing this is straight forward:

  1. public MapIndex createIndex(boolean ordered, Comparator comparator, Map mapIndex, BackingMapContext context) {
  2. MapIndex index = (MapIndex) mapIndex.get(this);
  3.  
  4. if (index != null || ordered) {
  5. }
  6.  
  7. index = new BloomIndex(this);
  8.  
  9. mapIndex.put(this, index);
  10.  
  11. return index;
  12. }

Finally we have a special equals filter called BloomEqualsFilter that know how to utilize the index:

  1. public boolean evaluateEntry(Entry entry) {
  2. return evaluateExtracted((entry instanceof QueryMap.Entry) ? ((QueryMap.Entry) entry)
  3. .extract(extractor) : InvocableMapHelper.extractFromEntry(
  4. extractor, entry));
  5. }
  6.  
  7. private boolean evaluateExtracted(Object target) {
  8. return value.equals(target);
  9. }
  10.  
  11. @Override
  12. public boolean evaluate(Object target) {
  13. }
  14.  
  15. @SuppressWarnings({"rawtypes", "unchecked"})
  16. @Override
  17. public Filter applyIndex(Map indexMap, Set keySet) {
  18. BloomIndex index = (BloomIndex) indexMap.get(extractor);
  19.  
  20. if (index == null) {
  21. return this;
  22. }
  23.  
  24. Set removeKeys = new HashSet();
  25. for (Object oKey : keySet) {
  26. String key = ((Binary) oKey).toString();
  27. if (!index.getBloomFilter().contains(index.getBloomValue(key, value))) {
  28. removeKeys.add(oKey);
  29. }
  30. }
  31.  
  32. keySet.removeAll(removeKeys);
  33.  
  34. if (keySet.isEmpty()) {
  35. return null;
  36. }
  37.  
  38. return this;
  39. }

There’s nothing special about the evaluateEntry/evaluateExtracted methods, behaving just as we would expect an “is value A equal to value B” filter to do (minus a few null checks). The logic for applying the index is within the applyIndex method. All we’re doing here is to recreate the expected concatenation between the binary key and the value we’re after and check if it’s part of the Bloom filter. If it’s not we remove that particular key, hoping to eliminate as many keys as possible. Any remaining entries are then tested by the filter using the evaluateEntry method.

During my testing I had 3x nodes (StartNode.java), 1x proxy (StartProxy.java) and ran one of Run*Client.java to test the various test scenarios (no index, standard index, Bloom filter based index). Remember to start with a fresh cache between each client run, given the lacking index implementation.

I ran this on a 8 core 8 GB machine using JDK 1.7.0_07 64-bit with Coherence 3.7.1.9 (Build 44754 Grid Edition: Development mode). I didn’t experiment too much with tuning the Bloom filter sticking to a false positive probability of 0.1 and expected number of elements set to 40,000. Given three nodes 40,000 seemed like a safe value.

The run time for each of the three cases are shown below. This is the time it took to run the get loop by the client for 1000 random entries.

Standard, no index: 138 seconds
Bloom filter based index: 72 seconds
Standard full index: 1046 milliseconds

To recap: It’s clearly a lot slower than the standard full index, but the standard full index is also using a lot more memory. So it’s all about trade-offs. If you have some external ID you query sometimes, want to speed up, but don’t want to spend all that memory on it? Then this might be an interesting technique. Feel free to leave me a message/comment if you have feedback, thanks..