What are the fallacies of Distributed Systems
- Network is reliable
- Latency is zero
- Bandwidth is infinite
- network is secure
- Topology doesn’t change - network topology - nodes in the network remain the same
- There is one administrator
- transport cost is zero
- The network is homogenous - all systems in the network uses a single network architecture and OS
Failures are inevitable
The reason I stated the fallacies is to highlight the fact that in distributed systems, failures are inevitable. There are plenty of factors at play and hence, it is importance to ensure your system can handle faults and failures. Yes, I said fault and failure, they mean different things. Faults lead to failures. It is important to understand this cause and effect relationship in mind when talking through your system design interview. While a fault is a node crash, a failure is a service being unavailable. I mention this here just to give some background before we delve into failure models.
The spectrum of failures
Failures are classified from easiest to difficult to deal with spectrum. Listing it out in order of increasing difficulty:
- fail stop
- crash
- omission
- temporal
- byzantine
Fail stop
This is a scenario where in a distributed system, one node has crashed permanently while others nodes can still detect that node and communicate with it.
Crash
In this scenario, a node has crashed and other nodes in the system cannot detect this crash either, thus they might think that sending data to this node would continue the work but in fact it isn’t the case.
Omission failures
In this scenario a node either fails to send or receive a particular message or a whole range of messages. A failure to respond to an incoming request is called send omission failure whereas a failure to receive a request, is receive omission failure. Now this might be very much caused by a node crash.
Temporal failures
In temporal failures, a node in the system responds correctly but too late for the response to be useful. This could be due to bad algorithms or design strategy or a loss of clock sync between nodes in the distributed system.
Byzantine failures
In this scenario a node in the system exhibits random behaviour like sending random messages at random times, responding with the wrong results, or ignore requests. Due to this nature, it is the hardest to deal with and if often the result of a malicious attack.
Summary
Why know this? I am trying to figure that out too.