To quote Wikipedia: A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.
This is often used by systems in order to avoid accessing slow media, like a disk. Take HBase or Cassandra for instance: Instead of reading their data files in order to figure out if a particular key is present in a file, a file which might be huge, they can first consult their Bloom filters and get a true/false on if the key is likely to be in there. I say likely because a Bloom filter isn’t guaranteed to be correct. More formally: False positives are possible, but false negatives are not.
Given that false negatives can’t happen, the worst we’ll do is read a file we didn’t have to read. But we’re also guaranteed to read the file if the key is present, which is the type of guarantee we need.
Using a similar technique I managed to get almost a 2x performance gain on Oracle Coherence using a custom Bloom filter based index implementation (versus no index), using just a fraction of the memory a standard index instance would use (a few kB at most). But before we conclude on that let’s present a use case and why indexing in an in-memory key-value store makes sense to begin with.
For the TLDRs:
This blog post presents the reasoning behing a project called ResourcePriorityBlockingQueue which is a blocking queue implementation that:
- Allows you to assign priority on tasks, just as with a PriorityBlockingQueue.
- Tasks may belong to task groups where each group may have a different priority. In that case, tasks are prioritized by group first, then task priority second.
- For a particular instance of a ResourcePriorityBlockingQueue you give it implementations of a ResourcePrioritizer and a ResourceAllocator, which further defines to which resources a particular task is made available.
- Focus as been put on making the code highly concurrent with as little synchronized code as possible. Lock free code is used where applicable, with efficient use of internal data structures. This gives ResourcePriorityBlockingQueue a performance characteristic which is comparable to that of a pure PriorityBlockingQueue when used in a similar fashion.
- It's available on GitHub, so feel free to poke around. Any constructive feedback is always welcome.
I spend a lot of my time working with Oracle Coherence. If you've never heard of Coherence it can briefly be described as a linearly scalable in-memory HashMap. By linearly scalable I mean a distributed HashMap, where each cluster member is responsible for storing a portion of the complete map. As everything is in-memory you maintain data availability by ensuring there are nodes in the cluster whom also contain a backup for the primary partition. So should a node die, one or more backups are readily available to become primary storage locations.
So as you add nodes you increase the total available storage space, with each node serving a smaler fraction of the total data content, thereby ensuring linear and horisontal scalability for getting and putting data. But getting and putting data is just a simple little use case for Coherence. Instead I find it more interesting to focus on it's parallel aggregation capabilities. As each node owns fractions of the total data population you're able to write aggregation logic which, if fanned out, allow each node to perform a partial aggregation on only its data fraction.