What Is the CAP Theorem?

Today's distributed and cloud-native applications naturally have more complex architectures than older systems. For developers and system architects, building the framework for these systems can seem like a delicate balancing act, especially when it comes to designing data storage. While the goal is to ensure that data is consistent for all users, always available, and can withstand network disruptions, the CAP theorem states that any distributed data system can, at most, guarantee only two of these three conditions simultaneously.

So, whether you're configuring a Redis Cluster or managing distributed application state across multiple data centers, the CAP theorem says that, among consistency, availability, and partition tolerance, you must pick the two that are most important to you. However, the theorem also provides a framework for understanding how your system will respond under stress, network failures, and high-traffic conditions. In that way, the CAP theorem is more than a set of rules; it provides a framework for making informed architectural trade-offs in your system design.

Consistency, Availability, and Partition Tolerance

The theorem is named after the acronym CAP, which stands for consistency, availability, and partition tolerance. Here's what all three of those mean in the context of a distributed data store:

  • Consistency (C): In a consistent system, every read request receives the most recent write or an error. When multiple nodes exist in a distributed cluster, a piece of data written to one node must be instantly and fully replicated across all other nodes before the write is considered successful. This means that no matter which node a client connects to, they will always see the same data. There is no concept of "stale" data in a strictly consistent system.

  • Availability (A): An available system ensures that every non-failing node returns a (non-error) response to a read or write request in a reasonable amount of time. This response is not guaranteed to contain the most recent write. As long as the node is up and running, it will serve the request, promoting high uptime and an uninterrupted user experience.

  • Partition tolerance (P): A partition occurs when a network failure prevents some nodes in a distributed system from communicating with others. The term "partition tolerance" means the system continues to operate and uphold its consistency or availability guarantees despite some messages being dropped or delayed.

From Conjecture to Proven Theorem

The CAP concept originated as a hypothesis in a 1999 paper co-authored by computer scientist Eric Brewer. At the time, he was making a practical observation about the operational limitations faced by early web applications, search engines, and distributed databases. However, it wouldn't be long before Brewer's observation was proven to be a truism.

By 2002, researchers Seth Gilbert and Nancy Lynch of MIT had published a formal mathematical proof of Brewer's conjecture, elevating it to the status of a proven theorem. Gilbert and Lynch definitively demonstrated that in an asynchronous network model in which messages can be lost or delayed, it is impossible to implement a read/write data object that simultaneously guarantees both absolute availability and atomic consistency.

CP vs. AP Systems

Because network partitions are an unavoidable reality in modern distributed networks — routers fail, cables get severed, cloud zones experience outages, latency spikes occur, etc. — system architects can't realistically choose to forgo partition tolerance. Therefore, the CAP theorem effectively dictates that when a network partition occurs, you must choose between canceling the operation to ensure data consistency and accepting the risk of serving inconsistent, stale data.

This reality of the CAP theorem has influenced how we categorize database and infrastructure technologies. Depending on whether they prioritize consistency or availability, they will be categorized as CP or AP systems. Here's a look at the differences:

  • CP (Consistency and Partition Tolerance): When a network partition occurs, a CP system will proactively shut down non-synchronized nodes (or the entire system) to prevent inconsistent data from being served. It explicitly chooses consistency over availability. This model is commonly used in financial systems, banking ledgers, and inventory management, where a bad read is much more dangerous than an error screen. Examples of CP systems include MongoDB, HBase, and traditional relational databases configured for strict multi-node synchronous replication.

  • AP (Availability and Partition Tolerance): An AP system continues to serve requests across all available nodes, even if those nodes cannot communicate with one another. This results in the system serving potentially stale data to end users, leaning heavily on a concept known as "eventual consistency." Social media feeds, e-commerce product catalogs, and real-time gaming architectures often rely on AP systems, where uptime is king. Popular examples of AP systems include Cassandra, CouchDB, and DynamoDB.

The PACELC Extension

Over the years, other computer professionals have adapted the CAP theorem to various use cases. These offshoots are known as extensions to the theorem. One of the most notable is the PACELC extension, first proposed by computer scientist Daniel Abadi in 2010. Whereas the CAP theorem describes system behavior during a network partition, the PACELC extension addresses how systems behave under normal, non-partitioned conditions.

PACELC stands for: If Partition, trade off between Availability and Consistency. Else (when the system is running normally without partitions), trade off between Latency and Consistency.

The value of PACELC lies in the fuller picture it provides of database architectures. It acknowledges that even when a distributed network is perfectly healthy, software architects must still make a conscious choice between delivering lightning-fast responses to users (Latency) and forcing users to wait until all nodes have fully synchronized on the exact same data (Consistency).

Valkey/Redis, Redisson, and the CAP Theorem

When you apply the CAP theorem and the PACELC extension to Valkey or Redis, it becomes clear that these in-memory data stores are AP-leaning systems.

Because Valkey and Redis emphasize extreme speed and low latency, they default to asynchronous replication. If a client writes data to a master node, the master acknowledges the write immediately without waiting for its replica nodes to synchronize. If a network partition occurs and the master node crashes before replication completes, a replica node might be promoted to master without having received the latest data. Therefore, Valkey and Redis prioritize availability and partition tolerance (AP) while providing eventual consistency.

Concurrency control can be challenging in AP systems. For example, consider an app that uses a Redis lock. In distributed environments, locks ensure that only one client can modify a shared resource at a time.

If a client acquires a distributed lock and experiences an extended network delay, the lock might time out, allowing another client to acquire the lock. To solve this issue, developers can utilize a robust Java distributed lock that enforces strict consistency and correctness (a CP approach to locking) atop an otherwise AP-leaning database. Redisson implements this through its RFencedLock object. Here's how to use it:

RFencedLock lock = redisson.getFencedLock("myLock");

// acquire the lock and obtain the current fencing token
Long token = lock.lockAndGetToken();

// or wait up to 100 seconds to acquire it,
// then auto-unlock after 10 seconds
token = lock.tryLockAndGetToken(100, 10, TimeUnit.SECONDS);
if (token != null) {
    try {
        // pass the token to the protected resource, which
        // rejects any operation carrying an older (lower) token
    } finally {
        lock.unlock();
    }
}

Similar articles