Bloom Filters

by dokuDoku system design

🎥 Video

## What is a Bloom Filter?

A Bloom Filter is a probabilistic data structure to check membership of an item within a set. It answers the question *Is proposed element x in my set S?*. To this a Bloom Filter can either say No (definitely) or Yes (probabilistically). What does this mean though? It means that if a Bloom Filter states that $x \in S$ where x is an element and S is a set, then x *may* be in that set S. However if the Bloom Filter states that $x \notin S$, then x is *definitely not* in set S.

A Bloom Filter does this by keeping track of what elements are in the set through use of a bit array. This bit array, in combination with hash functions (as we will describe later) works to maintain this data structure. A key benefit of a Bloom Filter is that the runtime of the Bloom Filter does not depend on the input size or the set size. We will see this in more detail later, but the way that a Bloom Filter checks membership is by putting element x through k hash functions ${h_1,...,h_k}$ (each with runtime $O(1)$) and then takes the modulo of each (with respect to m, the size of the bit array). This gives a total runtime of $O(k)$, irrespective of the size of the input or number of items in the set (i.e. number of previously inserted items). This makes Bloom Filters very lucrative for testing if items are not in large sets.

## How do Bloom Filters Work?

In this section, we are going to describe how Bloom Filters work. For this there are four main things we need to discuss including Insertion, Queries, Probability of a False Positive (element in set), and why Bloom Filters deterministically state items are not in the set. Bloom Filters are quite clever and it is not surprise why this is such a classic data structure.

Assume that you have element $x$ that you would like to insert along with $k$ chosen hashes ${h_1,...h_k}$ and a bit array $B$ of length $m$. $B$ will be initialized with all 0's.

### Bloom Filter Insertion Algorithm
The Bloom Filter Insertion Algorithm is as follows. 
1. Compute $k$ hash values: $h_1(x),...,h_k(x)$
2. For each hash value, take the modulo with respect to $m$. 
   $bit_i = h_i(x) \% m$
3. Set the corresponding bit in $B$ to 1
   $B[bit_i] = 1$

The runtime of this algorithm is $O(k)$. The time it takes to compute each hash function $h_i(x)$ is $O(1)$. There are $k$ hash functions giving us a runtime of $O(k \cdot 1) \rightarrow O(k)$ runtime. The time to access the bits in bit array $B$ is $O(1)$ for each of the $k$ bits and there are $k$ of these giving Step 3 a runtime of $O(k \cdot 1) \rightarrow O(k)$ runtime. Combining these runtimes, since they are sequential is: $O(k) + O(k) \rightarrow O(2k) \rightarrow O(k)$.

### Bloom Filter Query Algorithm (Membership Test)
To check if element $x$ is in the set, we use the following algorithm. 
1. Compute $k$ hash values: $h_1(x),...,h_k(x)$
2. For each hash value, take the modulo with respect to $m$. 
   $bit_i = h_i(x) \% m$
3. For each of the corresponding bits in $B$ check if $B[bit_i] = 1 \forall i \in k$
4. If $\forall i \in k, B[bit_i] =1$, x *may* be in the set. 
   Else x is *definitely* not in the set.

The time it takes to compute each hash function $h_i(x)$ is $O(1)$. There are $k$ hash functions giving us a runtime of $O(k \cdot 1) \rightarrow O(k)$ runtime. The time to access the bits in bit array $B$ is $O(1)$ for each of the $k$ bits and there are $k$ of these giving Step 3 a runtime of $O(k \cdot 1) \rightarrow O(k)$ runtime. Combining these runtimes, since they are sequential is: $O(k) + O(k) \rightarrow O(2k) \rightarrow O(k)$.

### Probability of False Positive
We are going to derive here what the probability of getting a False Positive is, which is the probability that given an input $x$, that the Query Algorithm returns that $x \in S$ but $x \notin S$. This is the reason why we state that x *may* be in the set.

To see intuitively why there can be collisions with different elements activating the same bits in the bit array, let's do a thought experiment. Let's suppose that we have a bit array $B$ of size 1. Obviously, any hash $h_i(x) \% 1 = 0$, meaning that element 0 of the hash will be set to 1. Since there's only 1 element of the bit array $B$ that element *has* to have index 0. Thus no matter what our element $x$ is, given the Query Algorithm, we will get that $x$ may be in the set. This is the extreme edge case, but is illustrative for thinking why we can only guarantee that $x$ may be in the set, given a finite bit map.

Let's assume that each hash output with modulo m is uniform over $[0, m-1]$ and is independent across functions and insertions.

The probability that a specific bit $b \in [0, m-1]$ in the bit array B, remains 0 is the following. The probability that a hash i lands on b is $P(hash_i\ b=1) = \frac{1}{m}$, which we get from our uniformity assumption. Thus the probability hash i does not land on b is $P(hash_i\ b=0) = 1-\frac{1}{m}$.

However, we know that we use $k$ hashes, so when inserting an element $x$ we want to see that a given bit $b$ is 0, so since all $h_i(x) \in [0,k-1]$ are independent, we have that the probability all k hashes do not land on b is: $P(hash_i\ \forall i\in K,\ b=0) = (1-\frac{1}{m})^k$.

We can extend this reasoning to say that given that we have inserted $n$ elements into our set (represented by our Bit Map), each of which is independent, we now have $n$ such instances of $P(hash_i\ \forall i\in K,\ b=0)$, i.e.
$P(hash_i(x)\ \forall i\in K, \forall x\in n,\ b=0) = (1-\frac{1}{m})^{kn}$
This is quite a mouthful, so I'm going to just say that probability a bit $b=0$ (for notational reasons, but all the same assumptions apply):
$$P(b=0) = (1-\frac{1}{m})^{kn}$$

Note that we can also use this identity to help simplify our expression where as m goes to infinity the following approximation applies:
$$ (1 - \frac{1}{m})^m \approx e^{-1}$$

Thus we can do some algebra and write an approximation for $P(b=0)$:
$(1-\frac{1}{m})^{kn} = ((1-\frac{1}{m})^m)^{kn/m} \approx (e^{-1})^{kn/m} = e^{-kn/m}$

Giving the result:
$$ P(b=0) \approx e^{-kn/m}$$

What does this mean? This is the probability that a randomly chosen bit in the array is still 0, this shows bit occupancy. As you increase $m$, more bits stay at 0 and the bit array becomes sparser. If you increase $k$ the bits become less sparse. We have no control over $n$, but the same applies. For best intuition, please plot on [Desmos](www.desmos.com/calculator).

It also means we are not finished yet. The probability of a False Positive is when we query an element x that's not inserted, all $k$ queried bits read 1. Thus we need to find $P(b=1)$ the probability a random bit is 1 after all insertions. Knowing $P(b=0)$ we have: $P(b=1) = 1 - P(b=0) = 1 - (1-\frac{1}{m})^{kn} \approx 1 - e^{-kn/m}$.

Given that we know $P(b=1)$ we now need the probability that all k queried bits are 1, i.e. $P(\forall k, b=1)$. Since we assumed that each of the k hash outputs are independent, we can say $P(\forall k, b=1) = P(b=1)^k$. Thus
$P(\forall k, b=1) = P(b=1)^k \rightarrow (1 - (1-\frac{1}{m})^{kn})^k \approx (1 - e^{-kn/m})^k$
Restated:
$$P(False\ Positive) = (1 - (1-\frac{1}{m})^{kn})^k \approx (1 - e^{-kn/m})^k $$

What does this mean? It means that as we increase the number $n$ of inputs False Positives increase. As $m$ increases the number of False Positives decrease. As $k$ increases we see a U shaped function where it first decreases then increases, signaling there's an optimal $k$.

### Why Bloom Filters Always Answer No Correctly

Here we are going to show by a Proof by Contradiction why Bloom Filters can always say deterministically if an element $x$ is **not** in set S.

Let's suppose, for the sake of contradiction, that despite $x$ failing the membership test, detailed by the query algorithm, $x \in S$. This would mean that in the insertion algorithm that $\forall i \in k$ hashes that $B[hash_i(x) \% m] =1$. However, because $x$ failed the membership test, that means $\exists i \in k$ such that $B[hash_i(x) \% m] =0$. Hence we have reached a contradiction! If $x$ was added to the Bloom Filter, then $\forall i \in k,\ B[hash_i(x) \% m] =1$, but obviously by failing the membership test we know that $\exists i \in k$ such that $B[hash_i(x) \% m] =0$. Thus we have proved that Bloom Filters can always deterministically show if $x \notin S$.

## Who uses Bloom Filters?

Bloom Filters are great for determining if an object is not in a set. With configurable parameters such as number of hashes $k$ and size of the bit array $m$, this is a great choice for big data problems. Additionally, the runtime of $O(k)$ is not dependent on input size and $m$ is tunable by users for their setup. This being said, some products that use Bloom Filters include Google Bigtable, Apache Cassandra, [LSM-Tree](https://doku88-github-io.onrender.com/blog/lsm-trees) Based Stores, Google Chrome (filters malicious websites), email spam filtering systems and more.

To recap, the primary benefits are: 
- Fast, constant time lookups
- Low memory footprint

However, the downside is a false positive rate, that luckily is controllable.