Fault Tolerance
Ability of a system to perform even if one or more of its components fail is Fault Tolerance.
These components could be software or hardware. It is practically impossible to build a 100 percent fault tolerant system.
A highly available system is generally highly fault tolerant too.
How to achieve fault tolerance
There are some popular ways to achieve fault tolerance, all depending on your existing system’s architecture. We can look at some.
Replication
In this technique the key is to replicate both service and data. Any time there is a faulty node, swap it with healthy ones or replace the data source with a replica, probably from a backup.
But this goes to back to the same old problem of keeping replicas up to date and how to do it based on the requirements. The consistency spectrum and things related to that. Replication and consistency is a similar conversation to the CAP theorem and which two of the three you need for your system.
Checkpointing
If you have used Windows as an operating system, you have probably seen the concept of creating system restore checkpoints. This is a really quick way to restore your system to a state before the fault occurred. This is a common feature in most systems that automatically allow restoring from a backup as it would restore the system’s state to what it was before the fault occurred.
You can read more about this online on 1Library - Fault Tolerance Techniques