Database Scalability - Sharding and Partitioning

What is a database partition?

Once you start serving data and your business has grown to a point, you’ll have to figure out ways to scale to meet the read/write demands on your database.

Assuming it is a relational database, you could start optimising it using multiple indices on the most popular table based on the type of queries that are being served most frequently. Then after exhausting your options, you could look further into splitting your data from 1 table to many, assuming you have data that tells you most queries do not need all of the data in the widest of tables and could do with just a few. Oh! That is data partitioning - although you might be still serving data through a single node database.

There are limitations to how much you can optimise in a single node database. You could then start thinking, what could be done to split the data across multiple nodes? But this will cost you the thing that you love about a relational database - like your favourite ACID properties. But the scale demands it. But your application code has to change accordingly as NoSQL Databases have different APIs and data is modelled differently. So we stick to all the ways we can tackle specific problems to gain the most efficiencies, instead of doing something expensive yet general purpose.

Let’s explore this data partitioning in further detail.

Types of partitioning

Partitioning is also called sharding - it is an approach in which, large data is split into smaller chunks and stored in different nodes in your database cluster.

As the purpose of partitioning is to spread load and scale, it is important to ensure that the data split across the cluster are done so in a balanced way such that no node has an unusually high amount of data. The partitioning strategy plays an important role here. Balanced partitions are key to ensuring a highly scalable cluster.

There are broadly two common ways in which data is sharded: Vertical and horizontal.

Vertical Sharding

Splitting data vertically means, attempting to split a table by its columns - columns being the vertical aspect of the table. In case of a relational database, on a single node, this could still benefit slow queries on a popular table, if the table is broken down correctly. If you think about it, this is partly what Database normalisation does - designing a database in order to reduce redundancy and increasing data integrity. Although, database normalisation is not specifically about partitioning data.

This technique is most commonly used in cases where you have some columns that are BLOBs - binary large objects, that aren’t as frequently accessed as some of the other columns in the database. Thus is makes sense to store them in a separate table and fetch them only when they are needed.

Horizontal sharding

Like earlier we discussed the vertical aspects - column based sharing, this one is about dividing data based on rows - understanding what data goes in which shard based on some value.

Horizontal sharding is also popularly done in two different ways: key based and hash based sharding

Key based sharding

This technique uses something called a partition key - this could be a key that you use in your database to identify an object - like an EmployeeId. All data related to Employee 1, 2, 3, 4 could be partitioned into one shard. Also I think most people call this range based sharding, i.e. a range of keys in one shard. I just seem to remember it this way.

In case of a multi-node database, you’d split that into different database shards. And each database shard would have tables of the same key value.

And Where a foreign key based relationship is involved, all data related to the partition key is generally grouped into the same database shard - making it a multi-table shard.

Some common patterns used in multi-table sharding:

Primary keys of tables are still unique - all the shards are still the same universe. So if a primary key uniquely identifies an object, then that object is only in a specific database shard
The Partition key is stored in a mapping table that is duplicated across the cluster on every database shard. Applications can query this table to understand which shard to fetch data for which object uniquely identified by that key.
The partition key is duplicated in all the tables to make fetching data more efficient. This may need more storage, but the benefits in querying efficiency almost always outweighs the cost of storage. Data routing logic in applications tend to use the partition key to map queries to the right database shared

Advantages	Disadvantages
Querying a range of data based on the key is easy to implement as it is evident which shard to look for what key, as in partition key.	Querying based on any key other than the partition key is a nightmare.
Partition keys could be stored in a specific sort order based on the needs of the application.	Partition key determines how balanced the shards are. Bas keys result in unbalanced data stored in shards, creating bottlenecks.

Hash based sharding

As the name says, in this sharding technique a hash function is used. An example as any article about Hash based sharding would give:

hash(employee_id) % 3 = 0 => Shard 1
hash(employee_id) % 3 = 1 => Shard 2

Hash value mod n is an easy way to grasp how this could be used, where n is the number of shards.

You have probably noticed that this might be a bad idea, especially if you want to scale. When the n changes, the shards have to be rebalanced and this means data has to move from pretty shard!

Please don’t shard it based on a hash and modulo. This is because as n changes, the server that a data goes to would also change. This means that adding or removing nodes to the cluster would result in an expensive rebalancing operation, moving data across multiple different shards.

Hashing to distribute data evenly

Consistent hashing is a technique used to distribute data across multiple servers (or storage nodes) in a way that minimises data movement when the number of changes. Minimising rebalancing of the cluster is an optimisation technique to ensure that cluster scaling is not an expensive operation.

In fact, this is one of the foundational pieces of distributed systems.

How does it work?

Map the data and servers to a Circle - called Hash Ring. A special hash function is used ot assign each piece of data and server a position in this ring.

Assign data to the server nearest to it. Moving from data to server in a clockwise direction.

When a new server is added, the data near it is assigned to that server, leaving the remaining data and servers intact.

When a server is removed, the data from it is reassigned to the next server in the circle. This might seem like a simple solution but it could skew the load on the system, especially when there is a large amount of data and very few but pretty beefy servers on the ring that could actually handle upto 50% more load.

In such cases, instead of adding actual servers to the system, the same serverIds are replicated across the ring. This gives us more opportunities to spread data into the next server in the ring.

Visualising this as a circle really helps. Checkout this video for a visual explanation.

Chord: A scalable peer-to-peer Lookup protocol for internet applications

Fixed number of partitions

This may sound odd, but in this method, the number of partitions to be created in the cluster is fixed when the database is setup. The number of partitions would always be higher than the number of nodes in the cluster.

The problem here is that in the early stages when the data is small, the partitions are also small and data is spread out across multiple partitions, the same query might need data from many partitions at once.

Similarly when data grows, the partitions get large ad hence rebalancing nodes and node failures become expensive.

Geographic sharding

This is probably called attribute based sharding. The attribute here is the geographic location of the business domain. E.g. an e-commerce application may serve customers in different parts of the world. Not all regions have the same database and products. Products in one region would be local to that region.

Advantages	Disadvantages
reduced latency as data is stored closer to customers in that region	Some regions may have more data or customers and hence distribution could be even but dependent on the business
this makes it easier to manage region specific laws or requirements	additional complexity involved in handling global customers making purchases across regions.

Dynamic sharding

This is a technique in which the number of shards in a database or distributed system can increase or decrease based on the workload.

Advantages	Disadvantages
number of partitions adapts to the overall data size	rebalancing while there is active traffic to the cluster is going to have a negative impact on the performance of the cluster

To elaborate on the negative impact, rebalancing means moving data across the cluster between nodes, this can cause additional latency and even conflicts. Also hotspots could be created - where some shards are overloaded while others remain underutilised.

The complexity involved in managing the shards means, it is often necessary to have something like a shard controller/manager to track which shard has what data and update routing logic in real time.

How ever complex rebalancing is, it is necessary to evenly distribute data and hence, it must be done. Some approaches to deal with this complexity is to use a combination of the following: consistent hashing, replication, plan and create additional shards ahead of time, gradual rebalancing to reduce fast movement of data across the cluster, use a shard manager to route requests and orchestrate rebalancing.

Advanced Partitioning using a secondary index

So far we have been looking at partitioning using a key, that is the primary identifier of a record. There are cases where that is not enough as our queries might need an index on another attribute to make the query fast enough and usable.

Some example scenarios:

A database of users partitioned by user_id but also gets queries by email or some other attribute.
Complex queries - multiple attributes involved in the query, thus primary key based index isn’t sufficient. Based on the type of joins required, an appropriate index would be used by the DBMS.

Secondary index by document - Local Index

A secondary index is built within the same partition as the primary key. Within a partition, you have local secondary indexes that allows some query optimisation. The idea is to ensure that all the data required to respond to a query is in one partition, such that there is no need to query another partition or get into complex joins between partitions. Thus the alternate name of Local Secondary Index.

Advantages	Disadvantages
Boosts performance for all queries that use attributes related to the same primary partition key as data is available locally in the same partition, reducing other partition lookup and data merge overheads.	Scalability limitations as all data for a primary key is stored within the same partition. Could be solved with the right level of replication

Secondary index by term - Global Index

The secondary index in this case as the name suggests, is created globally, independent of the primary partition key, not related to the partition primary key.

Advantages	Disadvantages
Better optimisation for a wider range of queries - choice of primary or secondary index to choose from	Secondary index is stored and replicated separately demanding additional storage
independent secondary partitions ensure better distribution of load	Eventual consistency - more data to move while writes

What is a database partition?#

Types of partitioning#

Vertical Sharding#

Horizontal sharding#

Key based sharding#

Hash based sharding#

Hashing to distribute data evenly#

How does it work?#

Fixed number of partitions#

Geographic sharding#

Dynamic sharding#

Advanced Partitioning using a secondary index#

Secondary index by document - Local Index#

Secondary index by term - Global Index#