Fault Tolerance

Ability of a system to perform its intended function, even if one or more of its components fail is Fault Tolerance.

These components could be software or hardware. It is practically impossible to build a 100% fault tolerant system.

The attributes of a fault tolerant system

  1. Continuity of service: the system doesn’t crash just because a file wasn’t found or a disk i/o failed
  2. Graceful degradation: if a system cannot perform at 100%, then perhaps drop some of the non-essential features and do just the most essential aspects
  3. No single point of failure: A fault tolerant system is architected so that no individual component is so critical that its demise ends the entire system’s operation

Does fault tolerance make a system highly available

This is nuanced by you could generalise and say - fault tolerance improves availability.

High Availability of a system as a whole is about aiming for maximum uptime. Usually measured using percentage availability represented in number of nines, 99.9, 99.99, 99.999 etc. High availability is generally accomplished using two strategies - redundancy and or rapid failover.

Fault tolerance is about ensuring that part failures, of a system remains transparent to the user - i.e. the user doesn’t even realise something is broken.

What about resilience?

It is easy to generalise and claim that fault tolerance makes a system resilient. In conversation that’s acceptable and understandable. However, in high level software architecture discussions, they represent two different philosophies of handling issues.

Where fault tolerance focuses on avoiding the impact of a failure, resilience focuses on recovering from it. It is like two sides of the same coin.

How to achieve fault tolerance

Let us take a look at some strategies to achieve fault tolerance, all depending on your existing system’s architecture.

Isolation and Containment

The idea here is to prevent a single failure from cascading like a domino effect. A bit like quarantining during the pandemic to avoid spreading the virus.

Circuit Breakers

This you are familiar with at home - you probably circuit breakers near your electrical main switchboard. The idea to prevent the system from repeatedly trying to execute an operation that is known to fail.

A certain threshold of errors is configured into the system, so that a circuit knows when to trip. It trips when the system encounters more than the threshold of errors. By tripping, the system then stops trying to execute the operation that was failing, bypasses that entirely, and immediately gets back to what it was meant to after the failure without wasting any further resources.

Bulkheading

The term comes from shipping - Bulkhead (partition). In shipping it is an upright wall within the hull of a ship. If you are wondering what a hull is - it is the watertight body of a ship. These upright walls partition the hull into smaller spaces like rooms.

Now apply that to our system architecture. Partition parts of the service or resources like thread pool or databases into isolated groups such that if one group fails, the others aren’t impacted.

Sidecars

Another imported term. Know sidecars in motorcycles? It isn’t the main motorcycle but a thing attached to it, close enough to be part of it but still separated by some metal and moves along with the motorcycle. This is exactly what a sidecar pattern is.

Applications often require related functionality like logging, monitoring, configuration and networking services. These are called peripheral services and are excellent candidates to be closely associated to your application but still a bit isolated. That’s what sidecar pattern allows to do. Decoupled just enough to avoid cascading a failure in the sidecar to impact the main application.

Request Management

This strategy is about handling noisy neighbours.

Load shedding

When a system is near capacity, it is better to reject any further requests than to try and handling more and crash the nodes that the application is running on.

Rate limiting and throttling

Restricting the number of requests that a service can make within a certain time window to ensure resources are utilised fairly. You are probably familiar with your broadband provider throttling your broadband speeds.

Retries with exponential backoff and Jitter

Retrying blindly is a guaranteed way to crash the entire system. So retries must be implemented cleverly. A randomised delay called a jitter prevents all applications trying to hit the server that’s failing at the same time. This is a simple concept but helps a lot to avoid failures escalating.

Graceful degradation

The strategy here is to fail with grace. Like a ballet dancer falling, even in moments of fall, it looks like a dance!

Fallback to alternative

If a primary service fails, then respond with a cached response instead of erroring out. This can’t work in all scenarios but in some.

Asynchronous Decoupling

This is event driven architecture. What I mean is introduce asynchrony in the mix - think queues, message bus, etc. Consumers consume the messages when they can. Producers can keep producing when they can. No waiting on each other.

Redundancy and recovery

Failover to the passive instance

Three things I should mention here for your preparation:

  1. Cold standby
  2. Warm standby
  3. Hot standby

Remember all these are redundancy strategies.

Cold standby is a strategy that keeps a secondary system offline or dormant until it is needed. It isn’t operational until the primary fails. It is a cost effective redundancy strategy in non critical scenarios.

Warm standby is a strategy that keeps the standby system partially active and kept updated with critical data form the primary but doesn’t serve live traffic until the primary fails. They are faster than the cold standby strategy.

Hot standby is a strategy that keeps the primary and standby fully operational, constantly in sync with each other and fully prepared to take over in case of a failure. This is an expensive strategy and best for mission critical systems.

When your system has redundancy, you could have your system configured to be ready for a Failover, where-in one system is always the primary, serving requests, while the other was waiting on standby. You could also have a system where all replicas are ACTIVE, aka Active-Active configuration. In this setup, if a node fails, the load balancer redistributes traffic.

However, all these strategies introduce some level of complexity to the system. Redundancy introduces the problem of keeping replicas up to date and how to keep them all consistent. Replication and consistency is a similar conversation to the CAP theorem and which two of the three you need for your system.

Redundancy involves having every node capable of signalling to its neighbours or a central orchestrator that it is alive or ready or dead. These heartbeats and health-checks help keep the system highly available.

Checkpointing and recovery

If you have used Windows as an operating system, you have probably seen the concept of creating system restore checkpoints. This is a really quick way to restore your system to a state before the fault occurred. This is a common feature in most systems that automatically allow restoring from a backup as it would restore the system’s state to what it was before the fault occurred.

You can read more about this online on 1Library - Fault Tolerance Techniques