Failure Models in Distributed Systems

What are the fallacies of Distributed Systems

Network is reliable
Latency is zero
Bandwidth is infinite
network is secure
Topology doesn’t change - network topology - nodes in the network remain the same
There is one administrator
transport cost is zero
The network is homogenous - all systems in the network uses a single network architecture and OS

Failures are inevitable

The reason I stated the fallacies is to highlight the fact that in distributed systems, failures are inevitable. There are plenty of factors at play and hence, it is importance to ensure your system can handle faults and failures. Yes, I said fault and failure, they mean different things. Faults lead to failures. It is important to understand this cause and effect relationship in mind when talking through your system design interview. While a fault is a node crash, a failure is a service being unavailable. I mention this here just to give some background before we delve into failure models.

The spectrum of failures

Failures are classified from easiest to difficult to deal with spectrum. Listing it out in order of increasing difficulty:

fail stop
crash
omission
temporal
byzantine

Fail stop

This is a scenario where in a distributed system, one node has crashed permanently while others nodes can still detect that node and communicate with it.

Crash

In this scenario, a node has crashed and other nodes in the system cannot detect this crash either, thus they might think that sending data to this node would continue the work but in fact it isn’t the case.

Omission failures

In this scenario a node either fails to send or receive a particular message or a whole range of messages. A failure to respond to an incoming request is called send omission failure whereas a failure to receive a request, is receive omission failure. Now this might be very much caused by a node crash.

Temporal failures

In temporal failures, a node in the system responds correctly but too late for the response to be useful. This could be due to bad algorithms or design strategy or a loss of clock sync between nodes in the distributed system.

Byzantine failures

In this scenario a node in the system exhibits random behaviour like sending random messages at random times, responding with the wrong results, or ignore requests. Due to this nature, it is the hardest to deal with and if often the result of a malicious attack.

Summary

Why know this? I am trying to figure that out too.

What are the fallacies of Distributed Systems#

Failures are inevitable#

The spectrum of failures#

Fail stop#

Crash#

Omission failures#

Temporal failures#

Byzantine failures#

Summary#