Facebook Engineering recently published how it built its general-purpose key-value store, known as ZippyDB. ZippyDB is Facebook’s biggest key-value store, which has been in production for more than six years. It offers flexibility to applications in terms of tunable durability, consistency, availability, and latency guarantees. ZippyDB’s use cases include metadata for distributed filesystems, counting events for internal and external purposes, and product data used for various app features.
Sarang Masti, a software engineer at Facebook, provides insight into the motivation for creating ZippyDB:
ZippyDB uses RocksDB as the underlying storage engine. Before ZippyDB, various teams across Facebook used RocksDB directly to manage their data. This resulted, however, in a duplication of efforts in terms of each team solving similar challenges such as consistency, fault tolerance, failure recovery, replication, and capacity management. To address the needs of these various teams, we built ZippyDB to provide a highly durable and consistent key-value data store that allowed products to move a lot faster by offloading all the data and the challenges associated with managing this data at scale to ZippyDB.
A ZippyDB deployment (named “tier”) consists of compute and storage resources spread across several regions worldwide. Each deployment hosts multiple use cases in a multi-tenant fashion. ZippyDB splits the data belonging to a use case into shards. Depending on the configuration, it replicates each shard across multiple regions for fault tolerance, using either Paxos or async replication.
A subset of replicas per shard is part of a quorum group, where data is synchronously replicated to provide high durability and availability in case of failures. The remaining replicas, if any, are configured as followers using asynchronous replication. Followers allow applications to have many in-region replicas to support low-latency reads with relaxed consistency while keeping the quorum size small for lower write latency. This flexibility in replica role configuration within a shard allows applications to balance durability, write performance, and read performance depending on their needs.
ZippyDB provides configurable consistency and durability levels to applications, specified as options in read and write APIs. For writes, ZippyDB persists the data on a majority of replicas’ by default. As a result, a read on primary will always see the most recent write. In addition, it supports a lower-consistency fast-acknowledge mode, where writes are acknowledged as soon as it queues them on the primary for replication.
For reads, ZippyDB supports eventually consistent, read-your-writes and strong read modes. “For read-your-writes, the clients cache the latest sequence number returned by the server for writes and use the version to run at-or-later queries while reading.” ZippyDB implements strong reads by routing the reads to the primary to avoid the need to speak to a quorum. “In certain outlier cases, where the primary hasn’t heard about the lease renewal, strong reads on primary turn into a quorum check and read.”
ZippyDB supports transactions and conditional writes for use cases that need atomic read-modify-write operations on a set of keys. Masti elaborates on ZippyDB’s implementation:
All transactions are serializable by default on a shard, and we don’t support lower isolation levels. This simplifies the server-side implementation and the reasoning about the correctness of concurrently executing transactions on the client-side. Transactions use optimistic concurrency control to detect and resolve conflicts, which works as shown in the figure above.
A shard in ZippyDB, often referred to as physical shard or p-shard, is the unit of data management on the server-side. Applications partition their keyspace into μshards (micro-shards). Each p-shard typically hosts several tens of thousands of μshards. According to Masti, “this additional layer of abstraction allows ZippyDB to reshard the data transparently without any changes on the client.”
ZippyDB achieves further optimization using Akkio mapping between p-shards and μshards. Akkio places μshards in geographical regions where the information is typically accessed. By doing so, Akkio helps reduce data set duplication and provides a more efficient solution for low latency access than placing data in every region.