Cassandra Database by dokuDoku system design 🔍 ## What is a Cassandra Database? [Cassandra](https://cassandra.apache.org/_/index.html) was originally developed by Facebook for their inbox search feature and was inspired by AWS's [DynamoDB](https://doku88-github-io.onrender.com/blog/dynamodb-for-system-design) and Google BigTable databases. Eventually, it was open sourced as an Apache project. Cassandra is a Wide Column, NoSQL Database that is great for distributed database systems that need to prioritize writes over reads. Due to this, Cassandra fundamentally uses the [LSM (Log Structured Merge) Tree](https://doku88-github-io.onrender.com/blog/lsm-trees) data structure to handle heavy writes. Additionally, Cassandra is a distributed database system and uses several nodes, with voting & quorum to determine what the most up-to-date value in the database is. The number of nodes needed for reads and writes are configurable, making it such that Cassandra is adjustable between Consistency and Availability. The fundamental query access pattern for Cassandra is through primary keys look-ups. Similar to DynamoDB, a primary key can be composed of a *partition key* (which node stores the data) and a *cluster key* to sort the data within a partition (sort key in DynamoDB). ## How does a Cassandra Database Work? From a high level, there are many nodes within a Cassandra database, where each node represents a different machine. Each node is identical in terms of their roles and responsibilities. Given that you contact one of the nodes in the ring of nodes (consistent hashing), you can use that coordinator node to write/read data from a configured number of replication nodes. The nodes to write to are determined by a consistent hashing algorithm. Each node individually maps to a machine and each machine uses a LSM Tree to store and retrieve writes. ### What is the Cassandra Data Model? The Cassandra data model is defined by the following definitions. Firstly the **key space** is the top level organizational unit (i.e. database). Additionally, here replication strategies for data redundancy and availability are defined. The **table** is in the key space and organizes the data into rows. It has a schema that defines its columns and primary key structure. A **row** is a single record in the table (similar to SQL) with a unique primary key. **Columns** are the actual storage units, which is where the bytes actually live on disk for the data model, not just a label in a table schema. It is the atomic unit that holds data. Columns have a name, type, and value. Note that **not** all columns need to be specified per row in the Cassandra table. In fact, it's a core feature of a **Wide Column Database** that the populated columns can vary per row in the table, making it much more flexible than SQL. Additionally, every column entry has a write time-stamp, which makes sense since LSM Trees are used for storage. Note that since Cassandra is a NoSQL database, relations and joins are not features of this database. **Primary Keys** are the main data access pattern for Cassandra, uniquely identifying entries. Primary keys are composed of two components, a **partition key** and a **clustering key**, extremely similar to [DynamoDB](https://doku88-github-io.onrender.com/blog/dynamodb-for-system-design). The partition key denotes which partition the row is in, and the clustering key indicates zero or more columns used to determine the sorted order of rows within a table. Other components of the data model include **partitioning** and **replication**. Partitioning is the mechanism that divides (shards) data across the nodes of a distributed cluster. It's the foundation of how Cassandra achieves massive *horizontal scalability, data distribution, and high availability.* Partitioning splits rows of a table into discrete chunks where all rows with the same partition key belong to the same partition. A single partition is stored entirely on one node, with replication on other nodes. Note that partitions are an atomic unit for many core operations including data distribution, replication, reads, writes, compaction, repair and indexing. How does partitioning actually distribute the keys though? This is done through consistent hashing, where given some hash of the partition key, it is mapped to a node on a hash ring of the nodes. Consistent hashing is a great technique for when you need to dynamically increase/decrease the number of nodes, since the number of reassignments is minimal compared to a standard modulo assignment. Furthermore, we can place virtual nodes such that larger servers can have proportionally more virtual nodes. Note that there are also strategies to assign partitions to additional backup nodes for redundancy. Partition keys should ideally have many well distributed, distinct values to spread the data evenly across the nodes. Replication is where we choose what nodes to replicate data to. One simple strategy is that we start from the hashed node, and walk clockwise along the hash ring until we encounter the number of specified nodes we need to replicate to but not on the same machine as the original node. This is useful for smaller deployments and testing for its simplicity. Another strategy is the Network Topology Strategy. This strategy is recommended for production and is data center and rack aware. The goal is to establish enough physical separation between replicas to avoid many replies dying due to a singular incident. We will not go more into this here, but I recommend you check it out. ### Write Model per Node Before we discuss how multiple nodes work together, we should first answer "How does a node work?". A **node** is one running instance of the Cassandra process. Typically this is one per physical or virtual machine. Every node is completely identical in its capabilities. There's no notion of hierarchy. In fact, any node can accept client connections, coordinate reads & writes, store data, replicate data, gossip, and handle failures. This is a *strength* of Cassandra since there's no single points of failure. For writes, what happens is that if a node receives a request from a coordinator node (more on that later) the receiving node appends a mutation to an on-disk log (Write Ahead Log, WAL), and to the in memtable (please reference [LSM-Trees](https://doku88-github-io.onrender.com/blog/lsm-trees)). Following this LSM-Tree data structure pattern, when the memtable reaches a size or time threshold it flushes to the SSTable (Sorted String Table) on the main disk and clears entries from the WAL. Subsequently, the node performs compaction on its SSTables, compressing the table each time, removing duplicates. A LSM-Tree is used here because we prioritize writes over reads, so each node can immediately take in a write. Given the independence of each node, each node can perform compaction on its own schedule, making nodes independent. Similarly, we can read from a single node using the LSM-Tree pattern, where first the memtable is checked, then SSTables from most to least recent. Please reference [LSM-Trees](https://doku88-github-io.onrender.com/blog/lsm-trees) for more details of how this data structure works. ### Multiple Nodes Now that we know how a single node works, naturally we want to know "How do we use multiple nodes?". For this, let's start with the query algorithm. **Query Algorithm** 1. Client connects to any node, and sends query statement 2. Contacted node (coordinator) determines where data lives using partition key 3. Coordinator sends requests to nodes with data (Fan-out phase) - This is where single node operations, previously discussed happen 4. Coordinator applies Consistency Settings - Wait for specified number of nodes to reply - Compare responses from said nodes 5. Return result to client Similarly, the **Write Algorithm** follows the same steps, except that per node there's a write command instead of a read command. The individual node writes are previously described using the LSM Tree pattern. Note, a **coordinator** node is a node that has been contacted by the client and manages writes/reads and coordinating with the necessary nodes. The **Gossip** protocol is used for Cassandra nodes to communicate information to each other, without quadratic runtimes. If there are $n$ nodes, and each node communicates with every other node, the number of connections is $O(n^2)$. The Gossip protocol is a way for distributed systems to spread information quickly and scalably without relying on a central coordinator. This works by each node periodically talking to a small, random set of other nodes. Note that typically there are seed nodes, to make sure everyone gets updates. Each node contains a generation & version number. The *generation* number is a timestamp of when that node was bootstrapped. The *version* number is a logical clock value that increments every second. The generation + version numbers give us a versioning system. This versioning system is useful for comparing which nodes have the latest information. The Gossip protocol typically only sends to a subset of the total set of nodes and, each communication passes state information along. **Hinted Handoffs** are for when a node is considered offline by a coordinator node attempting to write to it, the coordinator temporarily store the write data. The temporary data stored is called a *hint*. When the node is back online, the *hint* is *handed off* to this desired node. This is a short term solution for nodes failing to make their heart beat check. This ensures that the system stays stable amidst minor fluctuations. Note a node that doesn't respond is called a convicted node. Hinted Handoffs are part of Fault Tolerance within Cassandra. Part of this is the Phi Accrual Failure Detector which is a probabilistic, adaptive failure detector designed for unreliable networks and variable latency. This outputs a suspicion level, from a model, which we will not go deeper into here. However, the important aspect to note is that if a certain threshold is reached, then a node is marked as down i.e. convicted. If a node is down then we can do a *hinted handoff*, if the node is "truly down" (decommissioned) then we have to readjust where that data goes using the consistent hashing hash circle. ### CAP Theorem As you may know, the CAP Theorem states you can only have 2/3 of Consistency, Availability, and Partition Tolerance. I will not go into that here, but you can read this [resource](https://www.hellointerview.com/learn/system-design/core-concepts/cap-theorem). The balance between choosing Consistency and Availability is parametrizable in Cassandra. This is done by choosing how many nodes need to respond to the coordinator for writes and reads. The higher of each, the more we favor consistency. If we require a high number of nodes for writes to be written to and a low number of reads, this will create a highly available system. One strategy is to have *Quorum* which requires a majority ($\frac{n}{2}+1$) of replicas to respond. If applied to reads and writes, this guarantees that writes are visible to reads since at least one overlapping node is guaranteed to participate in a read and write. Note that reads may involve read repair and reconciliation using timestamps. In general, Cassandra's eventual consistency comes from replication, asynchronous writes, and read repair/hinted handoff. ## Who uses Cassandra Database? Cassandra database is an open source NoSQL distributed data trusted for its scalability and high availability. Additionally, it is great for high write systems and has proven fault tolerance. There are no master nodes, making it so that there are no single points of failure. Cassandra is not good for if you need relational data or joins across your tables. Discord used Cassandra for their primary message store for its ability to handle a high amount of writes. Netflix uses Cassandra for personalization, viewing history, etc. In general, when lots of different sources are writing data that needs to be written quickly, Cassandra is a great choice. Additionally, if the data (like a chat) needs to be accessed from most to least recent, it's a great choice due to the LSM Tree architecture choice. ## What is a Cassandra Database? [Cassandra](https://cassandra.apache.org/_/index.html) was originally developed by Facebook for their inbox search feature and was inspired by AWS's [DynamoDB](https://doku88-github-io.onrender.com/blog/dynamodb-for-system-design) and Google BigTable databases. Eventually, it was open sourced as an Apache project. Cassandra is a Wide Column, NoSQL Database that is great for distributed database systems that need to prioritize writes over reads. Due to this, Cassandra fundamentally uses the [LSM (Log Structured Merge) Tree](https://doku88-github-io.onrender.com/blog/lsm-trees) data structure to handle heavy writes. Additionally, Cassandra is a distributed database system and uses several nodes, with voting & quorum to determine what the most up-to-date value in the database is. The number of nodes needed for reads and writes are configurable, making it such that Cassandra is adjustable between Consistency and Availability. The fundamental query access pattern for Cassandra is through primary keys look-ups. Similar to DynamoDB, a primary key can be composed of a *partition key* (which node stores the data) and a *cluster key* to sort the data within a partition (sort key in DynamoDB). ## How does a Cassandra Database Work? From a high level, there are many nodes within a Cassandra database, where each node represents a different machine. Each node is identical in terms of their roles and responsibilities. Given that you contact one of the nodes in the ring of nodes (consistent hashing), you can use that coordinator node to write/read data from a configured number of replication nodes. The nodes to write to are determined by a consistent hashing algorithm. Each node individually maps to a machine and each machine uses a LSM Tree to store and retrieve writes. ### What is the Cassandra Data Model? The Cassandra data model is defined by the following definitions. Firstly the **key space** is the top level organizational unit (i.e. database). Additionally, here replication strategies for data redundancy and availability are defined. The **table** is in the key space and organizes the data into rows. It has a schema that defines its columns and primary key structure. A **row** is a single record in the table (similar to SQL) with a unique primary key. **Columns** are the actual storage units, which is where the bytes actually live on disk for the data model, not just a label in a table schema. It is the atomic unit that holds data. Columns have a name, type, and value. Note that **not** all columns need to be specified per row in the Cassandra table. In fact, it's a core feature of a **Wide Column Database** that the populated columns can vary per row in the table, making it much more flexible than SQL. Additionally, every column entry has a write time-stamp, which makes sense since LSM Trees are used for storage. Note that since Cassandra is a NoSQL database, relations and joins are not features of this database. **Primary Keys** are the main data access pattern for Cassandra, uniquely identifying entries. Primary keys are composed of two components, a **partition key** and a **clustering key**, extremely similar to [DynamoDB](https://doku88-github-io.onrender.com/blog/dynamodb-for-system-design). The partition key denotes which partition the row is in, and the clustering key indicates zero or more columns used to determine the sorted order of rows within a table. Other components of the data model include **partitioning** and **replication**. Partitioning is the mechanism that divides (shards) data across the nodes of a distributed cluster. It's the foundation of how Cassandra achieves massive *horizontal scalability, data distribution, and high availability.* Partitioning splits rows of a table into discrete chunks where all rows with the same partition key belong to the same partition. A single partition is stored entirely on one node, with replication on other nodes. Note that partitions are an atomic unit for many core operations including data distribution, replication, reads, writes, compaction, repair and indexing. How does partitioning actually distribute the keys though? This is done through consistent hashing, where given some hash of the partition key, it is mapped to a node on a hash ring of the nodes. Consistent hashing is a great technique for when you need to dynamically increase/decrease the number of nodes, since the number of reassignments is minimal compared to a standard modulo assignment. Furthermore, we can place virtual nodes such that larger servers can have proportionally more virtual nodes. Note that there are also strategies to assign partitions to additional backup nodes for redundancy. Partition keys should ideally have many well distributed, distinct values to spread the data evenly across the nodes. Replication is where we choose what nodes to replicate data to. One simple strategy is that we start from the hashed node, and walk clockwise along the hash ring until we encounter the number of specified nodes we need to replicate to but not on the same machine as the original node. This is useful for smaller deployments and testing for its simplicity. Another strategy is the Network Topology Strategy. This strategy is recommended for production and is data center and rack aware. The goal is to establish enough physical separation between replicas to avoid many replies dying due to a singular incident. We will not go more into this here, but I recommend you check it out. ### Write Model per Node Before we discuss how multiple nodes work together, we should first answer "How does a node work?". A **node** is one running instance of the Cassandra process. Typically this is one per physical or virtual machine. Every node is completely identical in its capabilities. There's no notion of hierarchy. In fact, any node can accept client connections, coordinate reads & writes, store data, replicate data, gossip, and handle failures. This is a *strength* of Cassandra since there's no single points of failure. For writes, what happens is that if a node receives a request from a coordinator node (more on that later) the receiving node appends a mutation to an on-disk log (Write Ahead Log, WAL), and to the in memtable (please reference [LSM-Trees](https://doku88-github-io.onrender.com/blog/lsm-trees)). Following this LSM-Tree data structure pattern, when the memtable reaches a size or time threshold it flushes to the SSTable (Sorted String Table) on the main disk and clears entries from the WAL. Subsequently, the node performs compaction on its SSTables, compressing the table each time, removing duplicates. A LSM-Tree is used here because we prioritize writes over reads, so each node can immediately take in a write. Given the independence of each node, each node can perform compaction on its own schedule, making nodes independent. Similarly, we can read from a single node using the LSM-Tree pattern, where first the memtable is checked, then SSTables from most to least recent. Please reference [LSM-Trees](https://doku88-github-io.onrender.com/blog/lsm-trees) for more details of how this data structure works. ### Multiple Nodes Now that we know how a single node works, naturally we want to know "How do we use multiple nodes?". For this, let's start with the query algorithm. **Query Algorithm** 1. Client connects to any node, and sends query statement 2. Contacted node (coordinator) determines where data lives using partition key 3. Coordinator sends requests to nodes with data (Fan-out phase) - This is where single node operations, previously discussed happen 4. Coordinator applies Consistency Settings - Wait for specified number of nodes to reply - Compare responses from said nodes 5. Return result to client Similarly, the **Write Algorithm** follows the same steps, except that per node there's a write command instead of a read command. The individual node writes are previously described using the LSM Tree pattern. Note, a **coordinator** node is a node that has been contacted by the client and manages writes/reads and coordinating with the necessary nodes. The **Gossip** protocol is used for Cassandra nodes to communicate information to each other, without quadratic runtimes. If there are $n$ nodes, and each node communicates with every other node, the number of connections is $O(n^2)$. The Gossip protocol is a way for distributed systems to spread information quickly and scalably without relying on a central coordinator. This works by each node periodically talking to a small, random set of other nodes. Note that typically there are seed nodes, to make sure everyone gets updates. Each node contains a generation & version number. The *generation* number is a timestamp of when that node was bootstrapped. The *version* number is a logical clock value that increments every second. The generation + version numbers give us a versioning system. This versioning system is useful for comparing which nodes have the latest information. The Gossip protocol typically only sends to a subset of the total set of nodes and, each communication passes state information along. **Hinted Handoffs** are for when a node is considered offline by a coordinator node attempting to write to it, the coordinator temporarily store the write data. The temporary data stored is called a *hint*. When the node is back online, the *hint* is *handed off* to this desired node. This is a short term solution for nodes failing to make their heart beat check. This ensures that the system stays stable amidst minor fluctuations. Note a node that doesn't respond is called a convicted node. Hinted Handoffs are part of Fault Tolerance within Cassandra. Part of this is the Phi Accrual Failure Detector which is a probabilistic, adaptive failure detector designed for unreliable networks and variable latency. This outputs a suspicion level, from a model, which we will not go deeper into here. However, the important aspect to note is that if a certain threshold is reached, then a node is marked as down i.e. convicted. If a node is down then we can do a *hinted handoff*, if the node is "truly down" (decommissioned) then we have to readjust where that data goes using the consistent hashing hash circle. ### CAP Theorem As you may know, the CAP Theorem states you can only have 2/3 of Consistency, Availability, and Partition Tolerance. I will not go into that here, but you can read this [resource](https://www.hellointerview.com/learn/system-design/core-concepts/cap-theorem). The balance between choosing Consistency and Availability is parametrizable in Cassandra. This is done by choosing how many nodes need to respond to the coordinator for writes and reads. The higher of each, the more we favor consistency. If we require a high number of nodes for writes to be written to and a low number of reads, this will create a highly available system. One strategy is to have *Quorum* which requires a majority ($\frac{n}{2}+1$) of replicas to respond. If applied to reads and writes, this guarantees that writes are visible to reads since at least one overlapping node is guaranteed to participate in a read and write. Note that reads may involve read repair and reconciliation using timestamps. In general, Cassandra's eventual consistency comes from replication, asynchronous writes, and read repair/hinted handoff. ## Who uses Cassandra Database? Cassandra database is an open source NoSQL distributed data trusted for its scalability and high availability. Additionally, it is great for high write systems and has proven fault tolerance. There are no master nodes, making it so that there are no single points of failure. Cassandra is not good for if you need relational data or joins across your tables. Discord used Cassandra for their primary message store for its ability to handle a high amount of writes. Netflix uses Cassandra for personalization, viewing history, etc. In general, when lots of different sources are writing data that needs to be written quickly, Cassandra is a great choice. Additionally, if the data (like a chat) needs to be accessed from most to least recent, it's a great choice due to the LSM Tree architecture choice. Comments (1) Please log in to comment. Saveit Forthering January 23, 2026 at 09:21 PM 💬 1 dokuDoku January 23, 2026 at 09:22 PM 💬 1 Saveit Forthering January 23, 2026 at 09:23 PM 💬 0 ← Back to Blog
Comments (1)
Please log in to comment.