Shahzad Bhatti Welcome to my ramblings and rants!

February 14, 2020

Review of the “Database Internals”

Filed under: Design,Technology — admin @ 8:06 pm

The database internals is an excellent resource for deep dive into storage engines and distributed systems. The first chapter introduces OLTP, OLAP and HTAP databases. It reviews database architecture and components including transport, query processor, storage engine, transaction manager, lock manager, access methods, buffer manager and recovery manager. The storage may use in-memory store or disk store and some in-memory database use disk for backup, which is updated asynchronously. The chapter reviews row-oriented and column-oriented databases along with data files and index files.

The second chapter covers B-Trees that is often used with disk based storage engines. The chapter introduces binary search trees (BST) and balance trees. However, such BST data structures use add elements in random order and are not optimized for disk storage as parent and child nodes can be stored in different regions of memory. Also, height of BST may limit the search in O(log N) operations. The chapter reviews architecture of hard drives such as SSD and B-Tree data structures where each node can hold up to N keys and N + 1 pointers to the child nodes. The nodes are grouped into root-node, leaf-nodes and internal nodes where each node is used for fixed-size page. Keys in B-Tree nodes are called index entries, separator keys or divider cells and they split the tree into subtrees holding key ranges. B-Trees are based on N logarithm base and there are K times more nodes on each new level. During lookup at most logk (M) (where M is total number of items) pages are fetched to find a searched key. In order to insert a value, it finds target leaf and key/value are appended to it. The node may need to be split if there isn’t enough room. Similarly, deletions find target leaf and key/value are removed. The deletion may result in node merges if neighboring nodes are too few.

The chapter three covers file format for B-Trees for disk. It reviews binary encoding and primitive types, strings and general principles of file format such as header, page-data and trailer. The page format can be fixed or variable size but variable size may incur more overhead. Also, variable size page must reclaim space when records are removed and reference records in page without regard to their exact locations. The variable-size pages generally use slotted page structure that has headers, list of pointers and list of variable size cells where each cell stores flags, key/data size, page-id and byte data. Removing an item may just mark the cell as deleted and reclaim later. The insertion may use first-fit or best-fit strategy to find the free blocks. Also, headers may store version and checksum for data validation.

The chapter four shows how to implement B-Trees, e.g. page-header may store flags, number of cells, magic number, etc. Some implementations of B-Trees store sibling pointers (forward/backward) to locate neighboring nodes but it adds complexity in split/merge. BTrees also store one additional pointer to child pages than the number of keys:

+--------------------------+------------+
|                          |  Separator | ---> Ks >= K3
+--------------------------+------------+
|  K1       |  K2            |    K3    |
+--------------------------+------------+
Ks < K1    K1<=Ks<=K2     K2 <= Ks < K3

Alternatively, you can store rightmost pointer in the cell along with high key. Each node in B-Tree is designed to keep a specific number of items and resizing may require copying data so in order to avoid copying, they can use extension/overflow page and link it form the original page. B-Trees keeps keys in order so that they can use binary search and insertion point is index of the first element that is greater than the given key. Some implementations may store parent pointers in nodes or use breadcrumbs to store path of leaf node in case they need to split/merge. B-Tree implementations may postpone split/merge later, create a new right-most node or use other algorithms to improve re-balancing. B-Trees may also apply compression at various granularity levels and perform maintenance to fix fragmented data or garbage collect non-addressable data (vacuum).

The chapter five reviews transaction processing and introduces concepts of ACID and page caching so that modifications can be done in memory. The pages can be brought in if they are not in memory and evicted/flushed to disk when there isn’t enough memory (with O_DIRECT lag to bypass kernel cache). After page modifications, it’s marked as dirty so that it can be flushed for durability. These modifications are coordinated with the write-ahead-log (WAL) so that data can be recovered if the server crashes (referred as checkpoint). As splits/merge may require multiple writes, B-Tree  can lock pages that have high probability of being used, called pinning and pinned pages are kept in memory. The I/O operations can be buffered to reduce disk I/O. Based on available memory, B-Tree may need to evict old pages when new pages cannot fit in memory and there are a variety of algorithms for eviction policies (page replacement) such as FIFO, LRU, CLOCK (references in circular buffer), LFU, etc. B-Trees use write-ahead log (WAL) to buffer changes to page-contents. These changes to WAL are flushed with fsync, but due to certain error conditions in fsync it may not report errors if they were cleared and it can result in loss of data. B-Tree implementations may use seat/no-steal and force/no-force policies to determine when changes are flushed on disk and they impact undo/redo behavior. The steal policy allows flushing a page without committing a transaction and a force policy requires all pages modified by the transaction to be flushed before the transaction commits.  The chapter explains ARIES algorithm, which is steal/no-force recovery algorithm, uses physical redo to improve performance and logical undo to improve concurrency and uses WAL records to implement repeating history. ARIES uses LSN (log sequence numbers) to identify log records, track pages in dirty page table and use physical undo/logical undo. The chapter reviews concurrency controls such as optimistic concurrency control, multi-version concurrency control and pessimistic concurrency control (using lock and no lock). The chapter reviews transaction isolation and read/write anomalies such as dirty read (uncommitted updates), non-repeatable read (querying again), phantom read (range queries), lost update (last writer wins), dirty write (takes dirty reads), write skew (double spending). Th isolation include read-uncommitted that allows dirty, phantom and non-repeatable reads; read-committed that prevent dirty reads; repeatable that prevent non-repeatable reads but allow phantom reads; serializable level that executes transactions serially and prevent phantom reads. Serializable isolation is difficult to implement and some databases use snapshot isolation to observe all transaction committed since the start time. The snapshot isolation prevents lost update but it’s still susceptible to write skew. Optimistic concurrency control validates transaction before writing and works if retries can be prevented, but it still needs to manage a critical section. Multi-version concurrency control uses monotonically incremented transaction IDs or timestamps and is used to prevent access to uncommitted values. Pessimistic concurrency control can use locks or simple timestamps that it checks to ensure that no other transaction has been committed with higher timestamp. The database maintains max_read_timestamp/max_write_timestamp and read operations with older timestamp are aborted and write operations with lower than max_read_timestamp would conflict but write operations with older than max_write_timestamp are allowed (Thomas Write Rule). Lock-based concurrency control uses locks such as two-phase locking where growing phase all locks are acquired and shrinking phase, where all locks are released after the transaction. Locks can lead to deadlocks so you need timeout to abort long running transactions. The chapter describes distinction between locks and latches where locks are used to isolate and schedule overlapping transactions and latches guard physical B-tree contents (leaf/non-leaf). The latches can use reader-write locks (busy-wait/CAS) and latch crabbing determines to minimize holding time.

The chapter six goes over different types of B-Tree design and implementations. For example, some B-Trees use copy-on-write to copy contents in new shadow tree instead of using synchronization and latches and the pointer to top most page is atomically updated after the update (LMDB). In order to update the page on disk, the in memory representation is updated first using cached version, native pointers (unmanaged languages), language specific structures or using wrapper object to update disk as soon as B-Tree is updated. Lazy B-Trees reduce cost of updating, e.g. WiredTiger different format for in-memory and on-disk pages and updates are first saved in update buffer to reduce I/O. Lazy-Adaptive Tree group nodes into subtrees and attach buffer for batch operations to each subtree. FD-Trees append all changes to a small mutable head tree and multiple immutable sorted runs and use fractional cascading to maintain pointers between levels along with logarithmically sized sorted runs. In order to reduce write amplification, Buzzword-Tree (Bw) uses batch updates using append-only storage. Bw-Tree use compare-and-swap operations instead of synchronization. Cache-Oblivious B-Tree use cache-oblivious structures that give asymptotically optimal performance regardless of underlying memory structure. Cache-oblivious algorithms optimize two levels of hierarchy: page ache and disk and partition disk into blocks that page-location is cache aware. It uses platform parameters so that transfer between page-cache and disk is within constant factor.

The chapter seven discusses Log-Structured storage such as immutable LSM Trees that use append-only storage and merge trees. As B-Trees have high write amplification, LSM trees provide an alternative by using buffering and append-only storage. LSM Trees write immutable files and merge them together over time. LSM Trees use smaller in-memory buffer (memtable) and large disk. A separate write-ahead-log is appended and committed before in-memory operation is acknowledged to the client. After the disk flush, memory and disk sub-trees are discarded and replaced with the result of their merge. In LSM trees, redundant records are reconciled during the read and tombstones are used to mark deleted records. Some implementations use predicate deletes for range of keys to remove records. LSM may use compaction to optimize access such as leveled compaction used by RocksDB where level-0 tables are created by flushing memtable contents and then contents are merged later to create level-1. Some LSM trees use size-tiered compaction that group disk tables based on size or use time window for compaction (used by Cassandra). As opposed to B-Trees that are read-optimized, LSM trees do not require locating the record on disk during write but reads are more expensive with default configuration. The chapter then reviews sorted string tables (SSTables) that are often used to implement disk-resident tables. SSTables consists of index files and data files where index files use B-Trees or hash tables and data files holds data in key order and uses hash tables or other similar data structures for lookup/range queries. During compaction, data files can be read sequentially and merge iteration is order preserving so merge table can be created in a single run. The chapter introduces bloom filters test whether an element is a member of the set. The chapter then reviews Skiplist for keeping sorted data and use probabilistic balancing. A skip list builds hierarchy of linked-list at different heights where each node has more than one successor that point to nodes at lower-levels.

The chapter eight is part of second half of the book that focuses on distributed system. It introduces concepts of concurrency and parallelism where concurrent executions can interleave and shared state must be protected whereas parallel operations are executed by multiple processors. The chapter defines system reliability in terms of presence of fault tolerance and discusses fallacies of distributed computing (published by Peter Deutsch). In real applications, processing and latency time is not instantaneous and queue capacity is not infinite that also requires back-pressure. The queue size is determined by measuring task processing time and average time each task spends in the queue. Distributed system also have to deal with clock/time differences on multiple machines and state consistency such as read-time data repair or eventually consistent systems. Detecting failures in distributed systems is hard and requires heartbeat protocols and network partitions can result in partial failures. The chapter explains cascading failures that can propagate from one part of the system to another. You can use exponential backoff strategy and jitter to avoid amplifying problems. The messages can get lost, delayed or reordered in a distributed systems and sender may retry but it does not know if the message is already delivered, e.g. in fair-loss link a sender keeps retrying send infinitely; finite duplication won’t send messages finitely; and no-creation link won’t send the message the was never sent. Distributed systems use acknowledgments to notify the sender using sequence numbers and sender may re-transmit in absence of ack (stubborn link resend messages indefinitely). In order to prevent duplicate processing as a result of re-transmission, you can use idempotent operations. In distributed systems, messages can arrive out of order and recipient may use sequence to detect out of order message and put it in a buffer until earlier message arrives. The perfect link guarantees reliable delivery without duplication and no-creation (only deliver messages that were actually sent). Exactly-once delivery is very hard in distributed systems and most real applications use at-least-once delivery (at-most-once is not reliable). The chapter describes two-general’s problem to show link failures when communication is asynchronous even with perfect delivery as participants may not be alive or connected. This problem shows that no matter how many ACK you use, you can never be sure if message was delivered to both parties. This was further proved by FLP Impossibility problem that you can never guarantee consensus in a bounded time with asynchronous communication. The chapter finally discusses failure models such as crash faults, omission faults (skips execution of certain steps), arbitrary faults (byzantine faults), etc and you can use process groups and redundancy to mask these failures from user.

The chapter nine discusses failure detection, where a failure detector identifies failed or unreachable processes to exclude them  from the algorithm and guarantee liveness while preserving safety. Most distributed systems use heartbeats to detect failures, where the process notify its status to peers in response to heartbeat. Each process maintains a list of other processes and updates it with last response time. Some distributed systems use a deadline failure detector that uses heartbeat to detect if a process has failed to register within a fixed time interval. Alternatively, other systems use outsourced heartbeat to improve reliability using information from external perspective. Phi-Accural failure detector use phi-accrual failure detector to calculate probability of process’s crash based on sampling arrival time. Other approaches gossips by maintaining a heartbeat counter and sending heartbeat counter to random neighbor periodically. Another approach arranges active processes into groups where a process failure is detected by participants and the failure is propagated as a group failure.

The chapter ten goes over leader election while maintaining liveness, stability and safety. It starts with bully algorithm that uses process rank (e.g. biggest ip-address) to identify the new leader. However, it can be subjected to split brain and create problems if highest rank node is unstable. Next-in-line failover is another alternative where leader provides a list of failover nodes and next highest-ranked node is selected in case of leader failure. Candidate/Ordinary algorithm splits nodes into groups of candidates and ordinary, where one of the candidate node becomes a leader (picking highest-ranked alive node). Invitation algorithm allows processes to invite other processes to join their groups and smaller groups merged with bigger groups. Ring algorithm use ring topology where each process contacts its successor passing a set of nodes until one of the nodes respond. The highest-ranked node from live set is chosen as a leader. Lastly, you may use consensus algorithms to elect a leader along with failure detection algorithm.

The chapter eleven examines replication and consistency properties such as availability, fault tolerance and redundancy. The chapter reviews CAP theorem where availability requires non failing nodes to deliver results and linearizable consistency preserves the original operation order. In asynchronous system, you cannot guarantee both consistency and availability in presence of network partition so you either have to choose best effort availability or best effort consistency (or sacrifice latency). Also, Cap theorem discusses network partition where a node may serve incorrect response and not node crashes that doesn’t respond at all. The chapter reviews concepts of harvest and yield in context of CAP conjecture where harvest may return partial results and yield compares the number of requests succeeded  against the number of requests attempted. Thus, these properties focus on trade-offs as opposed to the absolute numbers. The distributed systems may abstract message passing and represent state as a shared memory where each unit of storage is called a register.  Each operation is tracked with invocation and completion event and the operation is considered failed if the process crashes before completing the operation. Also, some operations may overlap with other operations and are called concurrent operations. The registers can be categorized into safe (dirty/non-repeatable read), regular (repeatable), atomic (linearizable). The consistency model provide different semantics and guarantees from the perspective of state and operations in distributed/concurrent systems. For example, strict consistency provides complete replication transparency as if you hold a global lock but it’s impractical in real-life. Linarizability guarantees visibility of the writes to all readers exactly once without exposing partial state. If two operations overlap, all read operations occur after write operation can observe the effect of the operation. It provides total order of operations running concurrently so that every read of the shared value returns latest value written to the shared variable. The linearization point provides atomic guarantee such that the effect of operation becomes visible. Linearizability is expensive to implement as it requireds coordination and ordering but you can use compare-and-swap where you first prepare result and then use CAS for swapping pointers and publish the state. Sequential consistency is a step below Linearizability that executes operations in some sequential order where operations of each individual processes are executed in the same order. In causal consistency, all process see causally related operations in the same order. It can add logical clocks with each message and the operation is processed only if preceding operation is completed. The chapter defines vector clock as a structure for establishing a partial order between the events. Processes maintain vectors of logical clocks, with one clock per process and is incremented every time a new event arrives. In order to resolve conflict, you check duplicate value with same key and append a new version to the version vector and establish the causal relationships. The chapter discusses session models that evaluate consistency from the perspective of client and assume all client operations are sequential. It may use read-own-writes consistency model and monotonic read model that guarantees that you cannot read old value once you have seen new value. The monotonic write model guarantees that write of v2 follows write of v1. The write-follows-read ensures that writes ordered after writes that were observed by previous read operations. Eventual consistency propagates updates asynchronously and latest value is resolved using lat-write-wins or vector clocks. The eventually consistent systems provide parameters to tweak availability and consistency such as replication-factor (N), write-consistency (W) and read-consistency (R) and you can guarantee most recent value by using (R+W > N). You can optimize replication by grouping nodes into copy and witness subsets where witness replicas may store updates if copy replicas are running behind. The chapter ends with discussion of strong eventual consistency and CRDTs that are specialized data structures to guarantee consistency in any order. However, allowed operations have to be side-effect free, commutative, and causally ordered.

The chapter twelve discusses anti-entropy and dissemination of updates in context of broadcast, peer-to-peer and cooperative broadcast. The broadcast to all processes is expensive with large number of nodes and unreliable from a single process. The anti-entropy brings nodes back in sync in case of failures. Entropy measure disorder in the system and anti-entropy brings the nodes back up-to-date when delivery fails. The read repair detects and eliminate inconsistencies. It can be implemented as a blocking or asynchronous operation. Blocking read repairs ensures read monotonicity for quorum reads. Instead of issuing full read request from each node, the coordinator can issue one full read and send digest request to other replicas and then repair reads in case of inconsistencies. Another alternative is hinted-handoff, which is write-side repair mechanism where write coordinator stores hint record and replays to target node when it comes back. Some databases use sloppy quorum along with hinted-handoff where write operations use additional nodes that update crashed node when it comes back. Merkle Trees provide a compact hash representation of the local data. The replicas compare root-level hashes to check for inconsistency. Bitmap version vectors can also be used to resolve data conflict based on recency where logs of operations are kept on each node and are compared with other nodes and missing data is replicated to the target node. Gossip Dissemination use gossip protocols that are probabilistic communication procedure based on how rumors/diseases are spread. It use cooperative propagation to disseminate information where infective node spreads to susceptible nodes, which randomly update neighboring processes. Message redundancy metric is used to capture the overhead of repeated delivery and amount of time to reach convergence is called latency. Push/lazy-push multicast trees make a trade-off between epidemic and tree-based broadcast primitive by creating a spanning tree overlay of nodes to actively distribute messages with least overhead. It sends full message to a small subset of nodes and just message-id to rest and the node can query peer if it doesn’t have the data.

The chapter thirteen reviews distributed transactions. In order to make operations appear atomic, you may use atomic commitment algorithm that provides prepare, commit or rollback operations along with a transaction manager. For example, two-phase commit execute in two phases: prepare and commit/abort where a coordinator collects votes and rest of nodes called cohorts operate over disjoint datasets. In case of cohort failure, the coordinator will replicate decision values based on log. In case of coordinator failure, cohorts will not be able to learn the final decision. In order to make atomic commitment more robust against coordinator failure, three-phase commit adds extra step: propose, prepare and commit/abort. The transaction is aborted in case of coordinator failure or operation time-out. Next, the chapter reviews Calvin approach that uses deterministic transaction order to remove the need for coordination (as opposed to non-deterministic transaction in most databases that use two-phase or optimistic locking). For example, Calvin uses a sequencer that determines the order of transactions and establishes a global transaction input sequence and it may split time into epochs to minimize contention. The chapter discusses data partitioning and consistent hashing that map hashes to a ring and each node get its own position on the ring and becomes responsible for the range of values. If serializability is not required, you may use snapshot isolation that guarantees that all reads made within the same transaction are consistent with a snapshot of the database and only first committer wins when there is a write-write conflict. Lastly, the chapter discusses mechanisms to avoid coordination by preserving data integrity constraints.

The chapter fourteen discusses consensus that focus agreement, validity and termination (reach the decision). The chapter introduces concept of broadcast communication However, it may result in in-consistent state if the coordinator crashes while in the middle of broadcast. Atomic broadcast guarantee reliable delivery (atomicity) and total order. For example, virtual synchrony framework organizes processes into groups and messages to all its members are delivered in the same order. In Zookeeper atomic broadcast, a process takes the role of leader or follower and protocol splits timeline into epochs identified with monotonically increasing sequence number. The atomic broadcast is equivalent to consensus in asynchronous systems with crash failure. Paxos is commonly used algorithms that defines three roles: proposers, acceptors, and learners. It is split into two phases: voting (proposers compete for the leadership) and replication (proposer distributes values to acceptors). When acceptor receives prepare request, it can accept the proposal, respond with previously accepted message, notify proposer if local sequence number is higher. During replication phase, proposer can start the replication by sending Accept message to all acceptors. Paxos use quorum to make sure some participants can fail but still proceed as long as minimus number of votes required for the operation are available. Liveness is guaranteed in the presence of f failed processes and so that given 2f + 1 processes, f processes can fail and f + 1 processes can proceed. Multi-Paxos algorithm introduces role of a leader, a distinguished proposer to improve efficiency. The leader periodically contacts the participants to notify them it’s still alive with a lease timeout so that participants won’t select other leader until lease expires. Fast Paxos algorithm reduces a number of messages and let any proposer contact accepts directly rather than voting through the leader with total 3f + 1 processes. Egalitarian Paxos partitions the system into smaller segments and uses a leader for the commit of a specific command. Flexible Paxoes uses intersection of nodes that are used in propose and accept phase, .e.g given N participants, Q1 nodes for the propose phase to succeed and Q2 nodes for the accept phase to succeed, wen can ensure that Q1 + Q1 > N and Q2 can contain N/2 acceptors and Q1 = N – Q2 + 1. Next, the chapter discusses raft algorithm that makes concept of leader a first-class citizen that coordinates state machine manipulation and replication similar to atomic broadcast and Multi-Paxos that replicates multiple values instead of just one (a single leader makes atomic decisions and establishes message order). Each participant in Raft take the role of candidate, leader (for a term) and follower (similar to acceptor/learner). It divides time into terms/epochs to guarantee global partial order without relying on clock synchronization. Terms are monotonically increasing and each command is uniquely identified by the term number. During leader election, candidates send RequestVote message to other processes including candidate’s term and ID of the last log entry it observed. After collecting a majority of votes, the candidate is selected as the leader for the term. The Raft protocol uses periodic heartbeat to ensure the liveness of the participants and it may start new election after election timeout. The leader repeatedly append new values to the replicated log by sending AppendEntries message that include leader’s term, index and term of the log entry. A leader is elected only if it has the higher term ID than the follower. In case of split vote, Raft uses randomized timers to reduce the probability of multiple subsequent election ending up in a split vote. The leader sends heartbeat to the followers to detect failures and new election can be initiated if leader is down. The leader does not remove or reorder its log contents; it only appends new messages to it.

The chapter then reviews Byzantine consensus where distributed systems are deployed in adversarial environments that is prone to byzantine failures such as ill intentions, bugs, misconfiguration and data corruption. Most Byzantine consensus algorithms require N^2 messages to complete an algorithm step, where N is the size of the quorum. It discusses Practical Byzantine Fault Tolerance (PBFT) that assumes independent node failure but entire system cannot be taken over at once. All communication is encrypted and replicas know one another’s public keys to verify identities. PBFT guarantees both safety and liveness, no more than (n – 1) / 3 replicas can be faulty. For a system to sustain f compromised nodes, it is required to have at least n = 3f + 1 nodes. To distinguish between cluster configuration, PBFT uses view where in each view, one of the replica is a primary and the rest are backup. All nodes are numbered consecutively and the index of the primary node is v mod N where v is the view id and N is the number of nodes. The view can change when the primary fails. Clients execute their operations against the primary that broadcasts the request to the backup, which execute the request and send a response back to the client. The client waits for f + 1 replicas to respond with the same result for any operation to succeed. Replicas save accepted messages in a stable log and it is kept until it has been executed by at least f + 1 nodes. This log can be used for recovery in case of network partition but it is verified to prevent the attack vector. After every N requests, the primary makes a stable checkpoint, where it broadcasts the latest sequence number and waits for 2f+1 replicas to respond, which constitutes a proof for this checkpoint.

Powered by WordPress