Design a Distributed Key-Value Store

Intro

KV stores are distributed hashtables. Key is hashed and stored in a location of the structure while a value can be any datatype and the key doesn’t depend on the contents of the value.

Examples of keyvalue store usecase: shopping carts, best seller lists, session management etc.

Functional requirements

service is configurable: consistency, availability trade-offs are configurable
ability to always write: even when faults exist, users should be able to write
hardware heterogenity: servers of any spec could be added to the cluster and it should just work. Load balancing correctly and no distinguished nodes.

Nonfunctional requirements

Scalability: 1000s of servers distributed across the world
fault tolerance: uninterrupted service despite failures

Design API

get: takes a key and returns the object from the store. If data is replicated, all replicas are found and returned based on the consistency model
put: takes in a key and a value tuple and updates the store

The key is often a hash, common hashing algorithms used could be MD5 which results in about 3.4x10^38, 39 decimal digits!

Scalability

Keeping 1 node key-value store is not very useful as it is a single point of failure. Can add more nodes to a cluster. But we need to ensure that load is balanced.

Consistent hashing comes to the rescue. Visualise a ring, in which all the hashed identifiers of the nodes are placed and all the hashed requests are placed as they come in. The request is served by the node closest to it in the clockwise direction.

This could lead to uneven distribution of load. Solution is to create virtual nodes in the ring, using multiple hash functions to hash the same node’s identifier based on the node’s capacity to serve requests.

That’s going to even out the load distribution.

Replication

N nodes, all can be a coordinator for all the writes but we could go with the approach where only some are and the rest are read replicas. This approach results in situations where when the coordinator has a problem then there is no oen to serve write requests. This is the primary secondary approach - like in traditional rdbms databases.

peer to peer replication

Every node can be a coordinator. There is a preferential list of nodes, who write requests go to. Every node replicates to n neighbours, generally keeping n to 3 or 5 instead of having to replicate data to all the nodes in the cluster. The more nodes the data has to replicated to causes latency when it comes to writes. This is configurable by the user.

Replication to the first replica is done syncronously while other replicas are done asyncly.

Now we have to deal with situations where in a peer to peer replication setup, a node has lost connection. They can continue serving data and write independently but how to restore order when the node is back up again?

Handling inconsistency

Versioning data is an option, in which data written to the cluster has a version information. This is however not very helpful when it comes to resolving consistency issues after a failure or partition in the network. In that case, the system could have multiple copies of the same data and different values and hence when reading there has to be some conflict resolution to avoid divergent histories.

A slightly better approach is to store metdata that gives a hint to the conflict resolution. Each object’s copy has a metadata record that stores some causality information. Every time an event happens, the metadata is updated. The metadata is generally implemented using some sort of vector clock.

A vector clock is a datastructure that has an list of a value and counters. Each node would have a counter for the number of operations that happened to the key on the cluster.

Merkle Trees for resolving data differences

Helps in synchornizing data between nodes of a cluster while minimising data move across nodes.

Intro#

Functional requirements#

Nonfunctional requirements#

Design API#

Scalability#

Replication#

peer to peer replication#

Handling inconsistency#

Merkle Trees for resolving data differences#