B-Trees by dokuDoku system design databases 🎥 Video ### What are B-Trees? (Overview) B-Trees are balanced search trees designed to work well on disks or other direct access secondary storage devices. They are optimized to minimize disk I/O operations. B-Tree nodes can have many children, ranging from a few to a thousand. Every branch in the B-Tree roughly has a **height of O(log(m))** such that the base of the log is the branching factor (how many children each node has). For each node with n keys there are n+1 children. Typically the node sizes are optimized so that they can consume an entire page when they are read in from disk. This makes loading the data structure more efficient. We should also note that a B+ Tree stores all the satellite information in the leaves and stores only keys and child pointers in the internal nodes. This is a good idea since when we search for our data, it forces all of the nodes to be at the leaves, meaning that the tree will be as shallow as possible since when traversing the tree we will only feed in keys for comparison. These design choices will become clearer when we discuss B-Tree Height and Search along with hardware factors. Note that the video included on this post also goes over B-Tree insertions and deletions. ### B-Tree Search Algorithm This section states given a B-Tree how do we algorithmically find an element in the tree. We will prove the height of the B-Tree in the subsequent section, so do not worry about runtime guarantees right now. Just know that since the height of the tree is O(log(m)) (with base of the log as the branching factor) that this is the runtime of B-Tree search. The algorithm is as follows: 0. Determine what value v we want to find. 1. Start at the root node of the tree. 2. Given that there are n keys in this node, find between which two of the n keys v is. Go to the node associated with said gap. Note: In between every two keys there is a pointer to a node in the subsequent layer of the tree. Exception is for keys at beginning & end where there's a pointer on the outside. 3. Repeat this process until we reach a leaf node, or encounter v in one of the keys. Due to that we stop when we reach a leaf node and we know that the height of the tree is O(log(m)), we have that the runtime of this algorithm is O(log(m)). Further note that although with a maximum of n keys to compare at each level, Step 2, should take O(n). However, as we will see in the Hardware Factors section, this step is negligible with respect to going from one level of tree to the next. Hence, our runtime is O(log(m)), which is the height of the tree we traverse. ### Height of B-Tree As stated previously, the **height** of a B-Tree is **O(log(m))** (base of log is branching factor) meaning that as we increase the number of nodes in the B-Tree, the depth of said tree increases logarithmically. We will derive this using a simple proof by induction. For the proof by induction, we are proving that given a tree with branching factor n, that the number of nodes at depth d will be: $n^d$. Note that we are going to assume that each level is fully populated, so this will be an upper bound on number of nodes. **Base Case**: Root Node only, Depth d = 0 For the Base Case we know that if there's only the root node that by definition there's only one node. Thus as we can see, since $n^0 = 1$, we have shown that this works for the base case. **Inductive Step**: Given we're at Depth d with $n^d$ nodes, show that the next layer d+1 has $n^{d+1}$ nodes Ok so given that at depth d we have $n^d$ nodes, we must show that at depth d+1 we will have $n^{d+1}$ nodes. So our branching factor is n, so this means that every node i has n children. Thus if we have $n^d$ nodes, each of the nodes will have n children. Thus for each node i, we can multiply it by n. Basically: $n^d \cdot n$ which is equal to $n^{d+1}$ i.e. $n^d \cdot n = n^{d+1}$. This we have proven the inductive case. Thus we have shown that given that a tree has depth d, that it will have $n^d$ leaf nodes. To get the depth of the tree given m children nodes, we have: $n^d = m$. If we take the logarithm of both sides (with respect to base n) we get: $log_n(n^d) = log_n(m) \rightarrow d \cdot log_n(n) = log_n(m) \rightarrow d = log_n(m)$. This thus proves the maximum depth (height) of the tree is $O(log_n(m))$ which is exactly what we stated. ### Hardware Factors This section on hardware factors is purely to motivate that **accessing a page of information and reading it from disk takes longer than examining all information read**. This has the practical effect that for a tree with m nodes, branching factor n and depth d, i.e. $n^d = m$ that we want to have the *highest* branching factor n possible to obtain the *lowest* depth d as possible for a *constant* set m nodes. Disk access times are much slower than CPU computing time. When we access information from disk, what is required is that we need to do a disk seek (requires actually spinning the disk), get said page of information, and then load this page of information into RAM so that the CPU can access it. This takes *much* longer than the time it takes to work with data on CPU. To break down the costs (from Grok 1/17/2026) we have: 1. Disk Seek (Hard Disk Drive) + Load Page into RAM - Seek time: 5-10 ms - Rotational latency: 3-5 ms - Transfer time (4-16 KB page): 0.05-0.2 ms - Total: 10 ms per access 2. Scan & Search Keys in RAM - Node size: 4 KB - Keys: 8 bytes each - Pointers: 8 bytes each - Linear Scan: - ~200 comparisons - Each comparison 1-3 ns - Total: 200-600 ns - Binary Search - 8-9 comparisons - 10-30 ns As we can see we want to optimize for Disk Seeks, especially given if we are using a Hard Disk Drive. Even if we have SSDs (Solid State Drives) we still want to optimize for access times on SSD. 3. SSD Seek - Access Latency: 50-100 $\mu$s - Transfer time: Negligible relative to latency - Total: 0.05-0.15 ms per access ### How are B-Trees Used B-Trees are used in mostly SQL databases, [DynamoDB](https://doku88-github-io.onrender.com/blog/dynamodb-for-system-design), and MongoDB. They are used primarily used to search through databases on indexed (SQL) or sort keys (DyanmoDB). This data structure appears to be great when **read** speed is more important than **write** speed, due to the benefits we stated above, with searching through the disk reads for data efficiently. For instance, in DynamoDB, we first search through the partition key, and then once we are on said partition, we use a B-Tree to find which of the items we want. However, a weakness of B-Trees is that they are not write optimized. Typically these problems occur with write heavy systems (Discord for example) because of the maintenance it takes to maintain the B-Tree. As you can see in the YouTube video on this post, whenever there is an insertion or deletion of an entry this can affect (within the B-Tree): - Leaf node - Parent node - Splitting nodes This translates to the B-Tree taking time to update new changes to it. On disk and SSD, this also means that poorly placed writes on the storage device can have negative effects amplified. Additionally, problems are exacerbated for concurrent writes and random input/output across replicas. These require solutions such as transactions and locks, which SQL databases implement giving ACID guarantees. ### What are B-Trees? (Overview) B-Trees are balanced search trees designed to work well on disks or other direct access secondary storage devices. They are optimized to minimize disk I/O operations. B-Tree nodes can have many children, ranging from a few to a thousand. Every branch in the B-Tree roughly has a **height of O(log(m))** such that the base of the log is the branching factor (how many children each node has). For each node with n keys there are n+1 children. Typically the node sizes are optimized so that they can consume an entire page when they are read in from disk. This makes loading the data structure more efficient. We should also note that a B+ Tree stores all the satellite information in the leaves and stores only keys and child pointers in the internal nodes. This is a good idea since when we search for our data, it forces all of the nodes to be at the leaves, meaning that the tree will be as shallow as possible since when traversing the tree we will only feed in keys for comparison. These design choices will become clearer when we discuss B-Tree Height and Search along with hardware factors. Note that the video included on this post also goes over B-Tree insertions and deletions. ### B-Tree Search Algorithm This section states given a B-Tree how do we algorithmically find an element in the tree. We will prove the height of the B-Tree in the subsequent section, so do not worry about runtime guarantees right now. Just know that since the height of the tree is O(log(m)) (with base of the log as the branching factor) that this is the runtime of B-Tree search. The algorithm is as follows: 0. Determine what value v we want to find. 1. Start at the root node of the tree. 2. Given that there are n keys in this node, find between which two of the n keys v is. Go to the node associated with said gap. Note: In between every two keys there is a pointer to a node in the subsequent layer of the tree. Exception is for keys at beginning & end where there's a pointer on the outside. 3. Repeat this process until we reach a leaf node, or encounter v in one of the keys. Due to that we stop when we reach a leaf node and we know that the height of the tree is O(log(m)), we have that the runtime of this algorithm is O(log(m)). Further note that although with a maximum of n keys to compare at each level, Step 2, should take O(n). However, as we will see in the Hardware Factors section, this step is negligible with respect to going from one level of tree to the next. Hence, our runtime is O(log(m)), which is the height of the tree we traverse. ### Height of B-Tree As stated previously, the **height** of a B-Tree is **O(log(m))** (base of log is branching factor) meaning that as we increase the number of nodes in the B-Tree, the depth of said tree increases logarithmically. We will derive this using a simple proof by induction. For the proof by induction, we are proving that given a tree with branching factor n, that the number of nodes at depth d will be: $n^d$. Note that we are going to assume that each level is fully populated, so this will be an upper bound on number of nodes. **Base Case**: Root Node only, Depth d = 0 For the Base Case we know that if there's only the root node that by definition there's only one node. Thus as we can see, since $n^0 = 1$, we have shown that this works for the base case. **Inductive Step**: Given we're at Depth d with $n^d$ nodes, show that the next layer d+1 has $n^{d+1}$ nodes Ok so given that at depth d we have $n^d$ nodes, we must show that at depth d+1 we will have $n^{d+1}$ nodes. So our branching factor is n, so this means that every node i has n children. Thus if we have $n^d$ nodes, each of the nodes will have n children. Thus for each node i, we can multiply it by n. Basically: $n^d \cdot n$ which is equal to $n^{d+1}$ i.e. $n^d \cdot n = n^{d+1}$. This we have proven the inductive case. Thus we have shown that given that a tree has depth d, that it will have $n^d$ leaf nodes. To get the depth of the tree given m children nodes, we have: $n^d = m$. If we take the logarithm of both sides (with respect to base n) we get: $log_n(n^d) = log_n(m) \rightarrow d \cdot log_n(n) = log_n(m) \rightarrow d = log_n(m)$. This thus proves the maximum depth (height) of the tree is $O(log_n(m))$ which is exactly what we stated. ### Hardware Factors This section on hardware factors is purely to motivate that **accessing a page of information and reading it from disk takes longer than examining all information read**. This has the practical effect that for a tree with m nodes, branching factor n and depth d, i.e. $n^d = m$ that we want to have the *highest* branching factor n possible to obtain the *lowest* depth d as possible for a *constant* set m nodes. Disk access times are much slower than CPU computing time. When we access information from disk, what is required is that we need to do a disk seek (requires actually spinning the disk), get said page of information, and then load this page of information into RAM so that the CPU can access it. This takes *much* longer than the time it takes to work with data on CPU. To break down the costs (from Grok 1/17/2026) we have: 1. Disk Seek (Hard Disk Drive) + Load Page into RAM - Seek time: 5-10 ms - Rotational latency: 3-5 ms - Transfer time (4-16 KB page): 0.05-0.2 ms - Total: 10 ms per access 2. Scan & Search Keys in RAM - Node size: 4 KB - Keys: 8 bytes each - Pointers: 8 bytes each - Linear Scan: - ~200 comparisons - Each comparison 1-3 ns - Total: 200-600 ns - Binary Search - 8-9 comparisons - 10-30 ns As we can see we want to optimize for Disk Seeks, especially given if we are using a Hard Disk Drive. Even if we have SSDs (Solid State Drives) we still want to optimize for access times on SSD. 3. SSD Seek - Access Latency: 50-100 $\mu$s - Transfer time: Negligible relative to latency - Total: 0.05-0.15 ms per access ### How are B-Trees Used B-Trees are used in mostly SQL databases, [DynamoDB](https://doku88-github-io.onrender.com/blog/dynamodb-for-system-design), and MongoDB. They are used primarily used to search through databases on indexed (SQL) or sort keys (DyanmoDB). This data structure appears to be great when **read** speed is more important than **write** speed, due to the benefits we stated above, with searching through the disk reads for data efficiently. For instance, in DynamoDB, we first search through the partition key, and then once we are on said partition, we use a B-Tree to find which of the items we want. However, a weakness of B-Trees is that they are not write optimized. Typically these problems occur with write heavy systems (Discord for example) because of the maintenance it takes to maintain the B-Tree. As you can see in the YouTube video on this post, whenever there is an insertion or deletion of an entry this can affect (within the B-Tree): - Leaf node - Parent node - Splitting nodes This translates to the B-Tree taking time to update new changes to it. On disk and SSD, this also means that poorly placed writes on the storage device can have negative effects amplified. Additionally, problems are exacerbated for concurrent writes and random input/output across replicas. These require solutions such as transactions and locks, which SQL databases implement giving ACID guarantees. 🔍 Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!