3 ws of Bloomfilter

May 4, 2018

—

Hey guys, today I wanna talk about a new data structure used in hbase for fast random scan. I categorized this topics into the hbase subdirectory because it caught my attention when I try to split regions in hbase manually. In this blog, I will be explaining what is bloomfilter, why it is useful for hbase for fast random access and most importantly, when to use it for our own application. After this post, you will learn a new hashing technique and hopefully how to implement it with some codes.

WHAT

In a nutshell, bloomfilter is a binary array with initial value of all 0s. For a certain info, we use different hashing functions to generate different index that this info should be reside and then mark the position value with 1. The following pictures shows the process. Quoted from wikipedia which is really like a bloom
As you can see, for different info x, y, z, applying different hash function will generate different position that the target info should be reside. x is hashed to store in the position with index of 1, 5, 13, y is hashed to store in the position with index of 4, 11, 16, z is hashed to store in the position of 3, 5, 11. From my point of view, the goal of this technology is to prevent unnecessary visit or query.

At position 5 and 11 are both hit twice but both by different piece of data. That is to say, if you get a new piece of data that has hash value of 1, 3, 4, 5, 11, 13, 16, doesn’t mean the new piece of data already reside in the database. On the contrary, you can say the new piece of data is definitely not in the database. This technique will avoid actual query to the database in many cases depending on the size of your bloomfilter. That is why bloomfilter is also called a probabilistic data structure.

Think of a scenario where you want to know which person is living in which district in a city that contains only 4 cities. You don’t know the exact location where it is but you do know that he is not in which 2 cities. Even though you don’t know exactly which city he lives in with this information. However, you can narrow down the searching scope from 4 cities to only 2 cities which. Sounds like a piece of data structure storing metadata right? Indeed, hbase use this tech for boosting the random access for sure.

WHY

First of all, you must understand that hbase has rowkey and column family. When performing a query action in hbase, a user will provide rowkey only or more fine grained info with column family. In order to reduce the I/O operation which is literally the goal of boosting the performence of hbase, some kind of extra meta info should be recorded in somewhere for faster lookup. Think about the use case that I talk about in part WHAT. HBase uses Bloomfilter for letting readers know if a cell(which is the smallest unit of data consisting rowkey and column family) is absent in the block(the smallest unit of read and write data into HFile). Below is the relationship between different terms that are used in hbase:

Table
| Region
|---- Store
|---- MemStore
|-------- StoreFile
|------------ Block

QUOTED FROM http://hbase.apache.org/0.94/book/regions.arch.html

Everytime before searching for specified data with given cell, BlockCache will be queried by user to make sure if the data exists in the block or not using the bloomfilter. Below is the seudo code of querying a piece of data.

if hash value in the bloomfilter are all 1:
    required rowkey could be in this region server
else:
    required rowkey is not in this region server

You can see this if...else... statement can only provide a certain probability that this rowkey data is reside in certain region server but definetely not in some region server. Obviously, hhbase provide some parameters to tune so that the probability is adjustable.
Quoted from wikipedia

WHEN

You might be thinking in which scenario will this technique be used. Take some time to think about some creative way for using it. Below are some use cases that others suggest:

password check
ip check
visted records check ……
In a word, bloomfilter could be used as a gate way checker to reduce the io operation to the real databases. I think scenario like this worth a thought of using bloomfilter.

3 ws of Bloomfilter

WHAT

WHY

WHEN

Reference