DBMS Musings: NoSQL

Showing posts with label NoSQL. Show all posts

Wednesday, October 28, 2015

Why MongoDB, Cassandra, HBase, DynamoDB, and Riak will only let you perform transactions on a single data item

(This post is co-authored by Daniel Abadi and Jose Faleiro and cross-posted on Jose's blog)

NoSQL systems such as MongoDB, Cassandra, HBase, DynamoDB, and Riak have made many things easier for application developers. They generally have extremely flexible data models, that reduce the burden of advance prediction of how an application will change over time. They support a wide variety of data types, allow nesting of data, and dynamic addition of new attributes. Furthermore, on the whole, they are relatively easy to install, with far fewer configuration parameters and dependencies than many traditional database systems.

On the other hand, their lack of support for traditional atomic transactions is a major step backwards in terms of ease-of-use for application developers. An atomic transaction enables a group of writes (to different items in the database) to occur in an all-or-nothing fashion --- either they will all succeed and be reflected in the database state, or none of them will. Moreover, in combination with appropriate concurrency control mechanisms, atomicity guarantees that concurrent and subsequent transactions either observe all of the completed writes of an atomic transaction or none of them. Without atomic transactions, application developers have to write corner-case code to account for cases in which a group of writes (that are supposed to occur together) have only partially succeeded or only partially observed by concurrent processes. This code is error-prone, and requires complex understanding of the semantics of an application.

At first it may seem odd that these NoSQL systems, that are so well-known for their developer-friendly features, should lack such a basic ease-of-use tool as an atomic transaction. One might have thought that this missing feature is a simple matter of maturity --- these systems are relatively new and perhaps they simply haven't yet gotten around to implementing support for atomic transactions. Indeed, Cassandra's "batch update" feature could be viewed as a mini-step in this direction (despite the severe constraints on what types of updates can be placed in a "batch update"). However, as we start to approach a decade since these systems were introduced, it is clear that there is a more fundamental reason for the lack of transactional support in these systems.

Indeed, there is a deeper reason for their lack of transactional support, and it stems from their focus on scalability. Most NoSQL systems are designed to scale horizontally across many different machines, where the data in a database is partitioned across these machines. The writes in a (general) transaction may access data in several different partitions (on several different machines). Such transactions are called "distributed transactions". Guaranteeing atomicity in distributed transactions requires that the machines that participate in the transaction coordinate with each other. Each machine must establish that the transaction can successfully commit on every other machine involved in the transaction. Furthermore, a protocol is used to ensure that no machine involved in the transaction will fail before the writes that it was involved in for that transactions are present in stable storage. This avoids scenarios where one set of nodes commit a transaction's writes, while another set of nodes abort or fail before the transaction is complete (which violates the all-or-nothing guarantee of atomicity).

This coordination process is expensive, both, in terms of resources, and in terms of adding latency to database requests. However, the bigger issue is that other operations are not allowed to read the writes of a transaction until this coordination is complete, since the all-or-nothing nature of transaction execution implies that these writes may need to be rolled-back if the coordination process determines that some of the writes cannot complete and the transaction must be aborted. The delay of concurrent transactions can cause further delay of other transactions that have overlapping read- and write-sets with the delayed transactions, resulting in overall "cloggage" of the system. The distributed coordination that is required for distributed transactions thus has significant drawbacks for overall database system performance --- both in terms of the throughput of transactions per unit time that the system can process, and in terms of the latency of transactions as they get caught up in the cloggage (this cloggage latency often dominates the latency of the transaction coordination protocol itself). Therefore, most NoSQL systems have chosen to disallow general transactions altogether rather than become susceptible to the performance pitfalls that distributed transactions can entail.

MongoDB, Riak, HBase, and Cassandra all provide support for transactions on a single key. This is because all information associated with a single key is stored on a single machine (aside from replicas stored elsewhere). Therefore, transactions on a single key are guaranteed not to involve the types of complicated distributed coordination described above.

Given that distributed transactions necessitate distributed coordination, it would seem that there is a fundamental tradeoff between scalable performance and support for distributed transactions. Indeed, many practitioners assume that this is the case. When they set out to build a scalable system, they immediately assume that they will not be able to support distributed atomic transactions without severe performance degradation.

This is in fact completely false. It is very much possible for a scalable system to support performant distributed atomic transactions.

In a recent paper, we published a new representation of the tradeoffs involved in supporting atomic transactions in scalable systems. In particular, there exists a three-way tradeoff between fairness, isolation, and throughput (FIT). A scalable database system which supports atomic distributed transactions can achieve at most two out of these three properties. Fairness corresponds to the intuitive notion that the execution of any given transaction is not deliberately delayed in order to benefit other transactions. Isolation provides each transaction with the illusion that it has the entire database system to itself. In doing so, isolation guarantees that if any pair of transactions conflict, then one transaction in the pair will always observe the writes of the other. As a consequence, it alleviates application developers from the burden of reasoning about complex interleavings of conflicting transactions' reads and writes. Throughput refers to the ability of the database to process many concurrent transactions per unit time (without hiccups in performance due to clogging).

The FIT tradeoff dictates that there exist three classes of systems that support atomic distributed transactions:

Those that guarantee fairness and isolation, but sacrifice throughput,
Those that guarantee fairness and throughput, but sacrifice isolation, and
Those that guarantee isolation and throughput, but sacrifice fairness.

In other words, not only is it possible to build scalable systems with high throughput distributed transactions, but there actually exist two classes of systems that can do so: those that sacrifice isolation, and those that sacrifice fairness. We discuss each of these two alternatives below.

(Latency is not explicitly mentioned in the tradeoff, but systems that give up throughput also give up latency due to cloggage, and systems that give up fairness yield increased latency for those transactions treated unfairly.)

Give up on isolation

As described above, the root source of the database system cloggage isn't the distributed coordination itself. Rather, it is the fact that other transactions that want to access the data that a particular transaction wrote have to wait until after the distributed coordination is complete before reading or writing the shared data. This waiting occurs due to strong isolation, which guarantees that one transaction in a pair of conflicting must observe the writes of the other. Since a transaction's writes are not guaranteed to commit until after the distributed coordination process is complete, concurrent conflicting transactions cannot make progress for the duration of this coordination.

However, all of this assumes that it is unacceptable for transactions writes to not be immediately observable by concurrent conflicting transactions If this "isolation" requirement is dropped, there is no need for other transactions to wait until the distributed coordination is complete before executing and committing.

While giving up on strong isolation seemingly implies that distributed databases cannot guarantee correctness (because transactions execute against potentially stale database state), it turns out that there exists a class of database constraints that can be guaranteed to hold despite the use of weak isolation among transactions. For more details on the kinds of guarantees that can hold on constraints despite weak isolation, Peter Bailis's work on Read Atomic Multi-Partition (RAMP) transactions provides some great intuition.

Give up on fairness

The underlying motivation for giving up isolation in systems is that distributed coordination extends the duration for which transactions with overlapping data accesses are unable to make progress. Intuitively, distributed coordination and isolation mechanisms overlap in time. This suggests that another way to circumvent the interaction between isolation techniques and distributed coordination is to re-order distributed coordination such that its overlap with any isolation mechanism is minimized. This intuition forms the basis of Isolation-Throughput systems (which give up fairness). In giving up fairness, database systems gain the flexibility to pick the most opportune time to pay the cost of distributed coordination. For instance, it is possible to perform coordination outside of transaction boundaries so that the additional time required to do the coordination does not increase the time that conflicting transactions cannot run. In general, when the system does not need to guarantee fairness, it can deliberately prioritize or delay specific transactions in order to benefit overall throughput.

G-Store is a good example of an Isolation-Throughput system (which gives up fairness). G-Store extends a (non-transactional) distributed key-value store with support for multi-key transactions. G-Store restricts the scope of transactions to an application defined set of keys called a KeyGroup. An application defines KeyGroups dynamically based on the set of keys it anticipates will be accessed together over the course of some period of time. Note that the only restriction on transactions is that the keys involved in the transaction be part of a single KeyGroup. G-Store allows KeyGroups to be created and disbanded when needed, and therefore effectively provides arbitrary transactions over any set of keys.

When an application defines a KeyGroup, G-Store moves the constituent keys from their nodes to a single leader node. The leader node copies the corresponding key-value pairs, and all transactions on the KeyGroup are executed on the leader. Since all the key-value pairs involved in a transaction are stored on a single node (the leader node), G-Store transactions do not need to execute a distributed commit protocol during transaction execution.

G-Store pays the cost of distributed coordination prior to executing transactions. In order to create a KeyGroup, G-Store executes an expensive distributed protocol to allow a leader node to take ownership of a KeyGroup, and then move the KeyGroup's constituent keys to the leader node. The KeyGroup creation protocol involves expensive distributed coordination, the cost of which is amortized across the transactions which execute on the KeyGroup.

The key point is that while G-Store still must perform distributed coordination, this coordination is done prior to transaction execution --- before the need to be concerned with isolation from other transactions. Once the distributed coordination is complete (all the relevant data has been moved to a single master node), the transaction completes quickly on a single node without forcing concurrent transactions with overlapping data accesses to wait for distributed coordination. Hence, G-Store achieves both high throughput and strong isolation.

However, the requirement that transactions restrict their scope to a single KeyGroup favors transactions that execute on keys which have already been grouped. This is "unfair" to transactions that need to execute on a set of as yet ungrouped keys. Before such transactions can begin executing, G-Store must first disband existing KeyGroups to which some keys may belong, and then create the appropriate KeyGroup --- a process with much higher latency than if the desired KeyGroup already existed.

Conclusions

The fundamental reason for the poor performance of conventional distributed transactions is the fact that the mechanisms for guaranteeing atomicity (distributed coordination), and isolation overlap in time. The key to enabling high throughput distributed transactions is to separate these two concerns. This insight leads to two ways of separating atomicity and isolation mechanisms. The first option is to weaken isolation such that conflicting transactions can execute and commit in parallel. The second option is to re-order atomicity and isolation mechanisms so that they do not overlap in time, and in doing so, give up fairness during transaction execution.

(Edit: MongoDB and HBase both have (or will soon have) limited support for multi-key transactions as long as those keys are within the same partition. However, hopefully it is clear to the reader that this post is discussing the difficulties of implementing distributed --- cross-partition --- transactions).

Monday, October 29, 2012

IEEE Computer issue on the CAP Theorem

Due to Hurricane Sandy, Yale gave me a day off from teaching today and I have finally been able to get to a few things on my "to-do" list. One of them is to write a blog post about the IEEE Computer CAP Retrospective edition and make my paper that appeared inside of it publicly available.

Earlier this year, the IEEE Computer magazine came out with an issue largely devoted to a 12-year retrospective of the CAP theorem and contains several articles from distributed systems researchers that contribute various opinions and thoughts about CAP. The first article is from Eric Brewer, who coined the CAP theorem 12 years ago (though he points out in his article that it was actually 14 years ago). A PDF of Brewer’s article is available for free from: http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed. The second article is from Seth Gilbert and Nancy Lynch (the same Gilbert and Lynch that proved the CAP theorem 10 years ago).

The third article is from me, and contains my criticisms of CAP that long-time readers of my blog will be familiar with. In particular, I point out that many people assume that modern NoSQL systems relax consistency guarantees in order to gain availability due to the constraints of the CAP theorem, when the reality is that these systems give up on consistency even in the absence of network partitions, which is not required according to the CAP theorem. The reason why they give up on consistency is because of a desire to improve system latency, an increasingly important requirement in the modern impatient world. I then describe the latency-consistency tradeoff in more detail, and end the article with the PACELC reformulation of CAP that debuted on my blog over two years ago. With the permission of the IEEE, I am making a free version of this article available today. This article is the first time that the PACELC formulation and my thoughts on CAP appear in a scholarly article, which gives people a venue to refer to (bibtex code available here) when citing this work (you can stop citing a blog post!)

The fourth article is from Raghu Ramakrishnan, entitled “CAP and Cloud Data Management” and describes the PNUTS system that I have mentioned in the past as a good example of a system for which the consistency-latency tradeoff has had a more direct impact on the system design than the consistency-availability tradeoff of CAP. The fifth article is from Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell of Cornell University on overcoming CAP with soft-state replication. Unfortunately, I cannot find a free link to Raghu’s article, but if you have an IEEE account, you can access it at at: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6122007&tag=1. The Birman et. al. article can be found for free at: http://www.cs.cornell.edu/Projects/mrc/CAP.pdf.

If you have enjoyed my thoughts on CAP on this blog, I highly recommend you read each of these five articles.

The Brewer article in particular acknowledges my past criticism of CAP not actually being about picking two of three out of C (consistency), A (availability), and P (partition tolerance) due to the fact that it does not make sense to reason about a system that is ‘CA’. (If there is no partition, any system can be both consistent and available --- the only question is what happens when there is a partition --- does consistency or availability get sacrificed?) Brewer uses this observation to lead into a nice generalization of consistency-availability tradeoff. In particular, when a partition occurs, the system does three things: (1) detect that the partition occurred, (2) enter a partition mode that may or may not limit some operations, and (3) initiate some sort of reconciliation algorithm when the partition is fixed. Depending on how these three things are implemented, it is possible to obtain much of the spectrum between CP systems and AP systems. The article also contains a nice reference to the CRDT work by Shapiro et. al. at INRIA. Overall, I strongly support Brewer’s approach to navigating this tradeoff. It also fits nicely with Mehul Shah’s talk at HPTS in the way that the spectrum between consistency and availability is explicitly considered at system design time, rather than trying to bolt consistency on top of an AP (eventually consistent) system after the fact (a wildly suboptimal endeavor).

While most of Brewer’s article focused on the consistency-availability tradeoff, Brewer also briefly acknowledges that “in its classic interpretation, the CAP theorem ignores latency”, and that some systems reduce consistency for latency (he even refers to the PNUTS example I used in my original blog post). I remain convinced that PACELC is the best way to reason about both of these tradeoffs in a single formulation: if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?

Wednesday, May 16, 2012

If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

(This post is coauthored by Alexander Thomson and Daniel Abadi)
In the last decade, database technology has arguably progressed furthest along the scalability dimension. There have been hundreds of research papers, dozens of open-source projects, and numerous startups attempting to improve the scalability of database technology. Many of these new technologies have been extremely influential---some papers have earned thousands of citations, and some new systems have been deployed by thousands of enterprises.

So let’s ask a simple question: If all these new technologies are so scalable, why on earth are Oracle and DB2 still on top of the TPC-C standings? Go to the TPC-C Website with the top 10 results in raw transactions per second. As of today (May 16th, 2012), Oracle 11g is used for 3 of the results (including the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with various versions of DB2. How is technology designed decades ago still dominating TPC-C? What happened to all these new technologies with all these scalability claims?

The surprising truth is that these new DBMS technologies are not listed in the TPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

To understand why this is the case, one must understand that scalability does not come for free. Something must be sacrificed to achieve high scalability. Today, there are three major categories of tradeoff that can be exploited to make a system scale. The new technologies basically fall into two of these categories; Oracle and DB2 fall into a third. And the later parts of this blog post describes research from our group at Yale that introduces a fourth category of tradeoff that provides a roadmap to end the dominance of Oracle and DB2.

These categories are:

(1) Sacrifice ACID for scalability. Our previous post on this topic discussed this in detail. Basically we argue that a major class of new scalable technologies fall under the category of “NoSQL” which achieves scalability by dropping ACID guarantees, thereby allowing them to eschew two phase locking, two phase commit, and other impediments to concurrency and processor independence that hurt scalability. All of these systems that relax ACID are immediately ineligible to enter the TPC-C competition since ACID guarantees are one of TPC-C’s requirements. That’s why you don’t see NoSQL databases in the TPC-C top 10---they are immediately disqualified.

(2) Reduce transaction flexibility for scalability. There are many so-called “NewSQL” databases that claim to be both ACID-compliant and scalable. And these claims are true---to a degree. However, the fine print is that they are only linearly scalable when transactions can be completely isolated to a single “partition” or “shard” of data. While these NewSQL databases often hide the complexity of sharding from the application developer, they still rely on the shards to be fairly independent. As soon as a transaction needs to span multiple shards (e.g., update two different user records on two different shards in the same atomic transaction), then these NewSQL systems all run into problems. Some simply reject such transactions. Others allow them, but need to perform two phase commit or other agreement protocols in order to ensure ACID compliance (since each shard may fail independently). Unfortunately, agreement protocols such as two phase commit come at a great scalability cost (see our 2010 paper that explains why). Therefore, NewSQL databases only scale well if multi-shard transactions (also called “distributed transactions” or “multi-partition transactions”) are very rare. Unfortunately for these databases, TPC-C models a fairly reasonable retail application where customers buy products and the inventory needs to be updated in the same atomic transaction. 10% of TPC-C New Order transactions involve customers buying products from a “remote” warehouse, which is generally stored in a separate shard. Therefore, even for basic applications like TPC-C, NewSQL databases lose their scalability advantages. That’s why the NewSQL databases do not enter TPC-C results --- even just 10% of multi-shard transactions causes their performance to degrade rapidly.

(3) Trade cost for scalability. If you use high end hardware, it is possible to get stunningly high transactional throughput using old database technologies that don’t have shared-nothing horizontally scalability. Oracle tops TPC-C with an incredibly high throughput of 500,000 transactions per second. There exists no application in the modern world that produces more than 500,000 transactions per second (as long as humans are initiating the transactions---machine-generated transactions are a different story). Therefore, Oracle basically has all the scalability that is needed for human scale applications. The only downside is cost---the Oracle system that is able to achieve 500,000 transactions per second costs a prohibitive $30,000,000!

Since the first two types of tradeoffs are immediate disqualifiers for TPC-C, the only remaining thing to give up is cost-for-scale, and that’s why the old database technologies are still dominating TPC-C. None of these new technologies can handle both ACID and 10% remote transactions.

A fourth approach...

TPC-C is a very reasonable application. New technologies should be able to handle it. Therefore, at Yale we set out to find a new dimension in this tradeoff space that could allow a system to handle TPC-C at scale without costing $30,000,000. Indeed, we are presenting a paper next week at SIGMOD (see the full paper) that describes a system that can achieve 500,000 ACID-compliant TPC-C New Order transactions per second using commodity hardware in the cloud. The cost to us to run these experiments was less than $300 (of course, this is renting hardware rather than buying, so it’s hard to compare prices --- but still --- a factor of 100,000 less than $30,000,000 is quite large).

Calvin, our prototype system designed and built by a large team of researchers at Yale that include Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuffrida, and Aaron Segal (in addition to the authors of this blog post), explores a tradeoff very different from the three described above. Calvin requires all transactions to be executed fully server-side and sacrifices the freedom to non-deterministically abort or reorder transactions on-the-fly during execution. In return, Calvin gets scalability, ACID-compliance, and extremely low-overhead multi-shard transactions over a shared-nothing architecture. In other words, Calvin is designed to handle high-volume OLTP throughput on sharded databases on cheap, commodity hardware stored locally or in the cloud. Calvin significantly improves the scalability over our previous approach to achieving determinism in database systems.

Scaling ACID

The key to Calvin’s strong performance is that it reorganizes the transaction execution pipeline normally used in DBMSs according to the principle: do all the "hard" work before acquiring locks and beginning execution. In particular, Calvin moves the following stages to the front of the pipeline:

Replication. In traditional systems, replicas agree on each modification to database state only after some transaction has made the change at some "master" replica. In Calvin, all replicas agree in advance on the sequence of transactions that they will (deterministically) attempt to execute.
Agreement between participants in distributed transactions. Database systems traditionally use two-phase commit (2PC) to handle distributed transactions. In Calvin, every node sees the same global sequence of transaction requests, and is able to use this already-agreed-upon information in place of a commit protocol.
Disk accesses. In our VLDB 2010 paper, we observed that deterministic systems performed terribly in disk-based environments due to holding locks for the 10ms+ duration of reading the needed data from disk, since they cannot reorder conflicting transactions on the fly. Calvin gets around this setback by prefetching into memory all records that a transaction will need during the replication phase---before locks are even acquired.

As a result, each transaction’s user-specified logic can be executed at each shard with an absolute minimum of runtime synchronization between shards or replicas to slow it down, even if the transaction’s logic requires it to access records at multiple shards. By minimizing the time that locks are held, concurrency can be greatly increased, thereby leading to near-linear scalability on a commodity cluster of machines.

Strongly consistent global replication

Calvin’s deterministic execution semantics provide an additional benefit: replicating transactional input is sufficient to achieve strongly consistent replication. Since replicating batches of transaction requests is extremely inexpensive and happens before the transactions acquire locks and begin executing, Calvin’s transactional throughput capacity does not depend at all on its replication configuration.

In other words, not only can Calvin can run 500,000 transactions per second on 100 EC2 instances in Amazon’s US East (Virginia) data center, it can maintain strongly-consistent, up-to-date 100-node replicas in Amazon’s Europe (Ireland) and US West (California) data centers---at no cost to throughput.

Calvin accomplishes this by having replicas perform the actual processing of transactions completely independently of one another, maintaining strong consistency without having to constantly synchronize transaction results between replicas. (Calvin’s end-to-end transaction latency does depend on message delays between replicas, of course---there is no getting around the speed of light.)

Flexible data model

So where does Calvin fall in the OldSQL/NewSQL/NoSQL trichotomy?

Actually, nowhere. Calvin is not a database system itself, but rather a transaction scheduling and replication coordination service. We designed the system to integrate with any data storage layer, relational or otherwise. Calvin allows user transaction code to access the data layer freely, using any data access language or interface supported by the underlying storage engine (so long as Calvin can observe which records user transactions access). The experiments presented in the paper use a custom key-value store. More recently, we’ve hooked Calvin up to Google’s LevelDB and added support for SQL-based data access within transactions, building relational tables on top of LevelDB’s efficient sorted-string storage.

From an application developer’s point of view, Calvin’s primary limitation compared to other systems is that transactions must be executed entirely server-side. Calvin has to know in advance what code will be executed for a given transaction. Users may pre-define transactions directly in C++, or submit arbitrary Python code snippets on-the-fly to be parsed and executed as transactions.

For some applications, this requirement of completely server-side transactions might be a difficult limitation. However, many applications prefer to execute transaction code on the database server anyway (in the form of stored procedures), in order to avoid multiple round trip messages between the database server and application server in the middle of a transaction.

If this limitation is acceptable, Calvin presents a nice alternative in the tradeoff space to achieving high scalability without sacrificing ACID or multi-shard transactions. Hence, we believe that our SIGMOD paper may present a roadmap for overcoming the scalability dominance of the decades-old database solutions on traditional OLTP workloads. We look forward to debating the merits of this approach in the weeks ahead (and Alex will be presenting the paper at SIGMOD next week).

Tuesday, October 4, 2011

Overview of the Oracle NoSQL Database

Oracle is the clear market leader in the commercial database community, and therefore it is critical for any member of the database community to pay close attention to the new product announcements coming out of Oracle’s annual Open World conference. The sheer size of Oracle’s sales force, entrenched customer base, and third-party ecosystem instantly gives any new Oracle product the potential for very high impact. Oracle’s new products require significant attention simply because they’re made by Oracle.

I was particularly eager for this year’s Oracle Open World conference, because there were rumors of two separate new Oracle products involving Hadoop and NoSQL --- two of the central research focuses of my database group at Yale --- one of them (Hadoop) also being the focus of my recent startup (Hadapt). Oracle’s Hadoop announcements, while very interesting from a business perspective (everyone is talking about how this “validates” Hadoop), are not so interesting from a technical perspective (the announcements seem to revolve around (1) creating a “connector” between Hadoop and Oracle, where Hadoop is used for ETL tasks, and the output of these tasks are then loaded over this connector to the Oracle DBMS and (2) packaging the whole thing into an appliance, which again is very important from a business perspective since there is certainly a market for anything that makes Hadoop easier to use, but does not seem to be introducing any technically interesting new contributions).

In contrast, the Oracle NoSQL database is actually a brand new system built by the Oracle BerkeleyDB team, and is therefore very interesting from a technical perspective. I therefore spent way too much time trying to find out as much as I could about this new system from a variety of sources. There is not yet a lot of publicly available information about the system; however there is a useful whitepaper written by the illustrious Harvard professor Margo Seltzer, who has been working with Oracle since they acquired her start-up in 2006 (the aforementioned BerkeleyDB).

Due to the dearth of available information on the system, I thought that it would be helpful to the readers of my blog if I provided an overview of what I’ve learned about it so far. Some of the facts I state below have been directly made by Oracle; other facts are inferences that I’ve made, based on my understanding of the system architecture and implementation. As always, if I have made any mistakes in my inferences, please let me know, and I will fix them as soon as possible.

The coolest thing about the Oracle NoSQL database is that it is not a simple copy of a currently existing NoSQL system. It is not Dynamo or SimpleDB. It is not Bigtable or HBase. It is not Cassandra or Riak. It is not MongoDB or CouchDB. It is a new system that has a chosen a different point (actually --- several different points) in the system-design tradeoff space than any of the above mentioned systems. Since it makes a different set of tradeoffs, it is entirely inappropriate to call it “better” or “worse” than any of these systems. There will be situations where the Oracle solution will be more appropriate, and there will be situations where other systems will be more appropriate.

Overview of the system:
Oracle NoSQL database is a distributed, replicated key-value store. Given a cluster of machines (in a shared-nothing architecture, with each machine having its own storage, CPU, and memory), each key-value pair is placed on several of these machines depending on the result of a hash function on the key. In particular, the key-value pair will be placed on a single master node, and a configurable number of replica nodes. All write and update operations for a key-value pair go to the master node for that pair first, and then all replica nodes afterwards. This replication is typically done asynchronously, but it is possible to request that it be done synchronously if one is willing to tolerate the higher latency costs. Read operations can go to any node if the user doesn’t mind incomplete consistency guarantees (i.e. reads might not see the most recent data), but they must be served from the master node if the user requires the most recent value for a data item (unless replication is done synchronously). There is no SQL interface (it is a NoSQL system after all!) --- rather it supports simple insert, update, and delete operations of key-value pairs.

The following is where the Oracle NoSQL Database falls in various key dimensions:

CAP
Like many NoSQL databases, the Oracle NoSQL Database is configurable to be either C/P or A/P in CAP. In particular, if writes are configured to be performed synchronously to all replicas, it is C/P in CAP --- a partition or node failure causes the system to be unavailable for writes. If replication is performed asynchronously, and reads are configured to be served from any replica, it is A/P in CAP --- the system is always available, but there is no guarantee of consistency. [Edit: Actually this configuration is really just P of CAP --- minority partitions become unavailable for writes (see comments about eventual consistency below). This violates the technical definition of "availability" in CAP. However, it is obviously the case that the system still has more availability in this case than the synchronous write configuration.]

Eventual consistency
Unlike Dynamo, SimpleDB, Cassandra, or Riak, the Oracle NoSQL Database does not support eventual consistency. I found this to be extremely amusing, since Oracle’s marketing material associates NoSQL with the BASE acronym. But the E in BASE stands for eventual consistency! So by Oracle’s own definition, their lack of support of eventual consistency means that their NoSQL Database is not actually a NoSQL Database! (In my opinion, their database is really NoSQL --- they just need to fix their marketing literature that associates NoSQL with BASE). My proof for why the Oracle NoSQL Database does not support eventual consistency is the following: Let’s say the master node for a particular key-value pair fails, or a network partition separates the master node from its replica nodes. The key-value pair becomes unavailable for writes for a short time until the system elects a new master node from the replicas. Writes can then continue at the new master node. However, any writes that had been submitted to the old master node, but had not yet been sent to the replicas before the master node failure (or partition) are lost. In an eventually consistent system, these old writes can be reconciled with the current state of the key-value pair after the failed node recovers its log from stable storage, or when the network partition is repaired. Of course, if replication had been configured to be done synchronously (at a cost of latency), there will not be data loss during network partitions or node failures. Therefore, there is a fundamental difference between the Oracle NoSQL database system and eventually consistent NoSQL systems: while eventually consistent NoSQL systems choose to tradeoff consistency for latency and availability during failure and network partition events, the Oracle NoSQL system instead trades of durability for latency and availability. To be clear, this difference is only for inserts and updates --- the Oracle NoSQL database is able to trade-off consistency for latency on read requests --- it supports similar types of timeline consistency tradeoffs as the Yahoo PNUTs/Sherpa system.

[Two of the members of the Oracle NoSQL Database team have commented below. There is a little bit of a debate about my statement that the Oracle NoSQL Database lacks eventual consistency, but I stand by the text I wrote above. For more, see the comments.]

Joins
Like most NoSQL systems, the Oracle NoSQL database does not support joins. It only supports simple read, write, update, and delete operations on key-value pairs.

Data Model
The Oracle NoSQL database actually has a more subtle data model than simple key-value pairs. In particular, the key is broken down into a “major key path” and “minor key path” where all keys with the same “major key path” are guaranteed to be stored on the same physical node. I expect that the way minor keys will be used in the Oracle NoSQL database will map directly to the way column families are used in Bigtable, HBase and Cassandra. Rather than trying to gather together every possible attribute about a key in a giant “value” for the single key-value pair, you can separate them into separate key-value pairs where the “major key path” is the same for all the keys in the set of key-value pairs, but the “minor key path” will be different. This is similar to how column families for the same key in Bigtable, HBase, and Cassandra can also be stored separately. Personally, I find the major and minor key path model to be more elegant than the column family model (I have ranted against column-families in the past).

ACID compliance
Like most NoSQL systems, the Oracle NoSQL database is not ACID compliant. Besides the durability and consistency tradeoffs mentioned above, the Oracle NoSQL database also does not support arbitrary atomic transactions (the A in ACID). However, it does support atomic operations on the same key, and even allows atomic transactions on sets of keys that share the same major key path (since keys that share the same major key path are guaranteed to be stored on the same node, atomic operations can be performed without having to worry about distributed commit protocols across multiple machines).

Summary
The sweet spot for the Oracle NoSQL database seems to be in single-rack deployments (e.g. the Oracle Big Data appliance) with a low-latency network, so that the system can be set up to use synchronous replication while keeping latency costs of this type of replication small (and the probability of network partitions are small). Another sweet spot is for wider area deployments, but the application is able to work around reduced durability guarantees. It therefore seems to present the largest amount of competition for NoSQL databases like MongoDB which have similar sweet spots. However, the Oracle NoSQL database will need to add additional “developer-friendly” features if it wants to compete head-to-head with MongoDB. Either way, there are clearly situations where the Oracle NoSQL database will be a great fit, and I love that Oracle (in particular, the Oracle BerkeleyDB team) built this system from scratch as an interesting and technically distinct alternative to currently available NoSQL systems. I hope Oracle continues to invest in the system and the team behind it.

Friday, April 23, 2010

Problems with CAP, and Yahoo’s little known NoSQL system

Over the past few weeks, in my advanced database system implementation class I teach at Yale, I’ve been covering the CAP theorem, its implications, and various scalable NoSQL systems that would appear to be influenced in their design by the constraints of CAP. Over the course of my coverage of this topic, I am convinced that CAP falls far short of giving a complete picture of the engineering tradeoffs behind building scalable, distributed systems.

My problems with CAP

CAP is generally described as the following: when you build a distributed system, of three desirable properties you want in your system: consistency, availability, and tolerance of network partitions, you can only choose two.

Already there is a problem, since this implies that there are three types of distributed systems one can build: CA (consistent and available, but not tolerant of partitions), CP (consistent and tolerant of network partitions, but not available), and AP (available and tolerant of network partitions, but not consistent). The definition of CP looks a little strange --- “consistent and tolerant of network partitions, but not available” --- the way that this is written makes it look like such as system is never available --- a clearly useless system. Of course, this is not really the case; rather, availability is only sacrificed when there is a network partition. In practice, this means that the roles of the A and C in CAP are asymmetric. Systems that sacrifice consistency (AP systems) tend to do so all the time, not just when there is a network partition (the reason for this will become clear by the end of this post). The potential confusion caused by the asymmetry of A and C is my first problem.

My second problem is that, as far as I can tell, there is no practical difference between CA systems and CP systems. As noted above, CP systems give up availability only when there is a network partition. CA systems are “not tolerant of network partitions”. But what if there is a network partition? What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical. So in reality, there are only two types of systems: CP/CA and AP. I.e., if there is a partition, does the system give up availability or consistency? Having three letters in CAP and saying you can pick any two does nothing but confuse this point.

But my main problem with CAP is that it focuses everyone on a consistency/availability tradeoff, resulting in a perception that the reason why NoSQL systems give up consistency is to get availability. But this is far from the case. A good example of this is Yahoo’s little known NoSQL system called PNUTS (in the academic community) or Sherpa (to everyone else).

(Note, readers from the academic community might wonder why I’m calling PNUTS “little known”. It turns out, however, that outside the academic community, PNUTS/Sherpa is almost never mentioned in the NoSQL discussion --- in fact, as of April 2010, it’s not even categorized in the list of 35+ NoSQL systems at the nosql-database.org Website).

PNUTS and CAP

If you examine PNUTS through the lens of CAP, it would seem that the designers have no idea what they are doing (I assure you this is not the case). Rather than giving up just one of consistency or availability, the system gives up both! It relaxes consistency by only guaranteeing “timeline consistency” where replicas may not be consistent with each other but updates are guaranteed to be applied in the same order at all replicas. However, they also give up availability --- if the master replica for a particular data item is unreachable, that item becomes unavailable for updates (note, there are other configurations of the system with availability guarantees similar to Dynamo/Cassandra, I’m focusing in this post on the default system described in the original PNUTS paper). Why would anyone want to give up both consistency and availability? CAP says you only have to give up just one!

The reason is that CAP is missing a very important letter: L. PNUTS gives up consistency not for the goal of improving availability. Instead, it is to lower latency. Keeping replicas consistent over a wide area network requires at least one message to be sent over the WAN in the critical path to perform the write (some think that 2PC is necessary, but my student Alex Thomson has some research showing that this is not the case --- more on this in a future post). Unfortunately, a message over a WAN significantly increases the latency of a transaction (on the order of hundreds of milliseconds), a cost too large for many Web applications that businesses like Amazon and Yahoo need to implement. Consequently, in order to reduce latency, replication must be performed asynchronously. This reduces consistency (by definition). In Yahoo’s case, their method of reducing consistency (timeline consistency) enables an application developer to rely on some guarantees when reasoning about how this consistency is reduced. But consistency is nonetheless reduced.

Conclusion: Replace CAP with PACELC

In thinking about CAP the past few weeks, I feel that it has become overrated as a tool for explaining the design of modern scalable, distributed systems. Not only is the asymmetry of the contributions of C, A, and P confusing, but the lack of latency considerations in CAP significantly reduces its utility.

To me, CAP should really be PACELC --- if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?

Systems that tend to give up consistency for availability when there is a partition also tend to give up consistency for latency when there is no partition. This is the source of the asymmetry of the C and A in CAP. However, this confusion is not present in PACELC.

For example, Amazon’s Dynamo (and related systems like Cassandra and SimpleDB) are PA/EL in PACELC --- upon a partition, they give up consistency for availability; and under normal operation they give up consistency for lower latency. Giving up C in both parts of PACELC makes the design simpler --- once the application is configured to be able to handle inconsistencies, it makes sense to give up consistency for both availability and lower latency.

Fully ACID systems are PC/EC in PACELC. They refuse to give up consistency, and will pay the availability and latency costs to achieve it.

However, there are some interesting counterexamples where the C’s of PACELC are not correlated. One such example is PNUTS, which is PC/EL in PACELC. In normal operation they give up consistency for latency; however, upon a partition they don’t give up any additional consistency (rather they give up availability).

In conclusion, rewriting CAP as PACELC removes some confusing asymmetry in CAP, and, in my opinion, comes closer to explaining the design of NoSQL systems.

(A quick plug to conclude this post: the PNUTS guys are presenting a new benchmark for cloud data serving which compares PNUTS vs. other NoSQL systems at the first annual ACM Symposium on Cloud Computing 2010 (ACM SOCC 2010) in Indianapolis on June 10th and 11th. SOCC 2010 is held in conjunction with SIGMOD 2010 and the recently released program looks amazing.)

DBMS Musings