September 10, 2019
Gregory Chockler
Transaction commit protocols play a pivotal role in supporting scalability and availability guarantees of today’s large-scale transactional databases. Their theoretical properties are traditionally captured and analysed through Atomic Commitment Problem (ACP), introduced by Gray in the early 70s. Roughly, ACP is concerned with ensuring all-or-nothing (atomicity) semantics for transactions (the ‘A’ in the famous ACID acronym). It is formulated as a one-shot agreement problem in which a single COMMIT or ABORT decision must be output for a given transaction depending on the COMMIT or ABORT votes provided by a collection of fail-prone sites holding the data objects involved in the transaction. We argue that ACP is too simplistic to adequately capture complexities of transaction commit in modern transactional data stores. In particular, its one-shot nature ignores the fact that a decision to commit or abort an individual transaction is not taken in isolation, but rather influenced by how conflicts with other simultaneously executing transactions are handled by the concurrency control mechanism, which ensures the isolation of transactions - ‘I’ in ACID. The lack of a formal framework capturing such intricate interdependencies between mechanisms for atomicity and isolation in real-world databases makes it difficult to understand them and prove correct. We address this by proposing a new problem, called Transaction Certification Service (TCS), that formalises how the classical Two-Phase Commit (2PC) protocol interacts with concurrency control in many practical transaction processing systems in a way parametric in the isolation level provided. In contrast to ACP, TCS is a multi-shot problem in the sense that the outcome of a transaction must be justified by the entire history of the past transactions rather than a single set of input votes. We then use TCS to identify core algorithmic insights underlying transaction commit in several widely used transactional data stores (such as Google Spanner), and leverages these to develop a new fault-tolerant multi-shot transaction commit protocol. This protocol is theoretically optimal in terms of its time complexity, and can serve as a reference implementation for future systems. Joint work with Alexey Gotsman, IMDEA Software Institute, Spain.