LSM Trees by dokuDoku system design 🔍 ### What is an LSM Tree? A LSM (Log Structured Merge) Tree is a data structure that is optimized for fast, high throughput writes over speed of reading data. It does this by cleverly, and intuitively utilizing RAM and disk. This is a storage engine / data structure used inside many NoSQL (and some other) databases, optimized for primary key access patterns. Other data access patterns require you scan through all the data. These two aspects are why LSM Trees are *Query Access Pattern* driven over Entity Relationship Driven (like SQL). The key insight of why LSM writes are fast is because they turn many *small random writes* into *large sequential appends* on disk. LSM trees can do this since they hold recent writes in RAM, and when ready flush these entries to Disk in a SSTable (Sorted String Table). Additionally, (as we will see) we only append to the Write Ahead Log (WAL, Commit Log in image) and appends are a fast way to write data. Every create, update, and delete is a new entry in the table, since LSM Trees are an append only data structure. Tombstone entries are used for deletes, since this is append only. ### How do LSM Trees Work? LSM Trees are fast because they write to memory (RAM) in the Memtable initially. Writing to RAM is much faster than writing to disk, and for some time breakdowns of RAM vs Disk/SSD, I suggest you read the [B-Trees Note Blog post](https://doku88-github-io.onrender.com/blog/b-trees). Once these writes are made to RAM, another process can then flush those writes to disk, while leaving the original process able to write to RAM. This feature also makes it easier for LSM Trees to be used in a distributed fashion, since you could theoretically have a bunch of local machines writing to their RAM, and then once this flushes out to disk, it's scalable (can compress SSTable after first write to disk). Alas, I have gotten ahead of myself. Let us first go over the LSM Tree Write and Read algorithms for more context. **LSM Write Algorithm** 1. Write Issued 2. Write written to the WAL (Write Ahead Log) 3. Write written to Memtable 4. Periodically Memtable flushes to disk as immutable SSTable 5. When Memtable flushed, WAL cleared Over time, SSTables are compacted. What compaction is it merges files, removing duplicates and deletes. So in the diagram you may notice: SSTable1, SSTable2, etc. these different SSTables in fact are different levels of compression of the SSTable, with 1 being the most recent, and 4 being the least. So as time goes on the SSTables become more compacted and dense. We will see that for every level of SSTable, this will affect our LSM Tree Read Algorithm. Note that the Write Ahead Log is an append only log. **LSM Read Algorithm** 1. Look in Memtable - If found return entry 2. Determine candidate SSTables - Skip by [Bloomfilter](https://doku88-github-io.onrender.com/blog/bloom-filters) - Skip by Key Range metadata - Min and max key stored in that SSTable 3. For remaining SSTables search from newest to oldest Note that for each level of SSTables, that the nature of the search changes based on the compaction it went through. For Step 3. we search from newest updates to oldest updates because since it's an append only log, the most recent update will be the most accurate (and can change the state of previous updates such as a delete). Furthermore, as we can see from the write algorithm, this is also the mostly likely to be in RAM or the least compacted on disk, making it easier to find. This is great for use cases like chat messages, since we are more interested in recent messages over old ones. Note that for Step 3 that we can also use the SSTable index to jump to a specific block, then search within that block. The SSTable index is a sparse in memory index that maps key boundaries to offsets of data blocks on disk. This is great, since it enables us to narrow down our search of where to look on disk. The reason why this index is sparse is to keep memory usage small and the block granularity is usually enough to get 1 disk read. The index is stored on disk as part of the SSTable, but is loaded into memory when the SSTable is opened. Note that SSTables are of configurable size and are in the range of **64 MB to 256 MB**, and most systems read data into memory in **4 KB pages**. ### Who uses LSM Trees? Several databases use LSM Trees to quickly ingest large throughputs of data Including Cassandra, RocksDB, InfluxDB, and LevelDB. LSM Trees are chosen for their high write throughput, sequential disk writes, and efficient scaling for distributed systems. LSM Trees are preferred over B-Trees for writes (but not reads!), due to the simplicity and efficiency of writes. Often companies that need to keep track of many writes including Discord, Gaming, and social media companies. ### What is an LSM Tree? A LSM (Log Structured Merge) Tree is a data structure that is optimized for fast, high throughput writes over speed of reading data. It does this by cleverly, and intuitively utilizing RAM and disk. This is a storage engine / data structure used inside many NoSQL (and some other) databases, optimized for primary key access patterns. Other data access patterns require you scan through all the data. These two aspects are why LSM Trees are *Query Access Pattern* driven over Entity Relationship Driven (like SQL). The key insight of why LSM writes are fast is because they turn many *small random writes* into *large sequential appends* on disk. LSM trees can do this since they hold recent writes in RAM, and when ready flush these entries to Disk in a SSTable (Sorted String Table). Additionally, (as we will see) we only append to the Write Ahead Log (WAL, Commit Log in image) and appends are a fast way to write data. Every create, update, and delete is a new entry in the table, since LSM Trees are an append only data structure. Tombstone entries are used for deletes, since this is append only. ### How do LSM Trees Work? LSM Trees are fast because they write to memory (RAM) in the Memtable initially. Writing to RAM is much faster than writing to disk, and for some time breakdowns of RAM vs Disk/SSD, I suggest you read the [B-Trees Note Blog post](https://doku88-github-io.onrender.com/blog/b-trees). Once these writes are made to RAM, another process can then flush those writes to disk, while leaving the original process able to write to RAM. This feature also makes it easier for LSM Trees to be used in a distributed fashion, since you could theoretically have a bunch of local machines writing to their RAM, and then once this flushes out to disk, it's scalable (can compress SSTable after first write to disk). Alas, I have gotten ahead of myself. Let us first go over the LSM Tree Write and Read algorithms for more context. **LSM Write Algorithm** 1. Write Issued 2. Write written to the WAL (Write Ahead Log) 3. Write written to Memtable 4. Periodically Memtable flushes to disk as immutable SSTable 5. When Memtable flushed, WAL cleared Over time, SSTables are compacted. What compaction is it merges files, removing duplicates and deletes. So in the diagram you may notice: SSTable1, SSTable2, etc. these different SSTables in fact are different levels of compression of the SSTable, with 1 being the most recent, and 4 being the least. So as time goes on the SSTables become more compacted and dense. We will see that for every level of SSTable, this will affect our LSM Tree Read Algorithm. Note that the Write Ahead Log is an append only log. **LSM Read Algorithm** 1. Look in Memtable - If found return entry 2. Determine candidate SSTables - Skip by [Bloomfilter](https://doku88-github-io.onrender.com/blog/bloom-filters) - Skip by Key Range metadata - Min and max key stored in that SSTable 3. For remaining SSTables search from newest to oldest Note that for each level of SSTables, that the nature of the search changes based on the compaction it went through. For Step 3. we search from newest updates to oldest updates because since it's an append only log, the most recent update will be the most accurate (and can change the state of previous updates such as a delete). Furthermore, as we can see from the write algorithm, this is also the mostly likely to be in RAM or the least compacted on disk, making it easier to find. This is great for use cases like chat messages, since we are more interested in recent messages over old ones. Note that for Step 3 that we can also use the SSTable index to jump to a specific block, then search within that block. The SSTable index is a sparse in memory index that maps key boundaries to offsets of data blocks on disk. This is great, since it enables us to narrow down our search of where to look on disk. The reason why this index is sparse is to keep memory usage small and the block granularity is usually enough to get 1 disk read. The index is stored on disk as part of the SSTable, but is loaded into memory when the SSTable is opened. Note that SSTables are of configurable size and are in the range of **64 MB to 256 MB**, and most systems read data into memory in **4 KB pages**. ### Who uses LSM Trees? Several databases use LSM Trees to quickly ingest large throughputs of data Including Cassandra, RocksDB, InfluxDB, and LevelDB. LSM Trees are chosen for their high write throughput, sequential disk writes, and efficient scaling for distributed systems. LSM Trees are preferred over B-Trees for writes (but not reads!), due to the simplicity and efficiency of writes. Often companies that need to keep track of many writes including Discord, Gaming, and social media companies. 🎥 Video Comments (0) Please log in to comment. No comments yet. Be the first to comment! ← Back to Blog
Comments (0)
Please log in to comment.
No comments yet. Be the first to comment!