Introduction

A distributed system is inherently complex and hence when something goes wrong, it is difficult to pinpoint what happened unless there is good level of observability.

Observability is the ability to understand what is happening in a distributed system by looking at its external outputs - in this context - logs, metrics, traces etc.

Failures in distributed systems could be attributed to infrastructure - server, network, electrical, etc or due to the application that we care about. One service going down in the system can take down multiple systems across the board if the system is not resilient.

Cost of downtime

As discussed earlier, failures can cascade easily in a distributed system. A tragic story of a video upload - lost in transit video:

user uploads video from uI UI pushes data to server b B writes data to blob store and saves metadata in a database C syncs data from X to Y replica. Suddenly C fails - sync fails B still updates X Y doesn’t have data in X user fetches video from UI UI fetches data from B B attempts to read video metadata from X X is down, so cannot read data B fails over to Y which also does not have data B responds to user saying video not found!

Cost of downtime

Unplanned failures can cost a lot.

Many big failures:

  • 10/2021: Meta services down for 9 hours, costed $13 million per hour!
  • 12/2021: AWS auto update, triggered unexpected behaviour of clients in the internal network. Resulted in connection surge. This overwhelmed the networking devices in the internal network and main aws network. Resulted in communication delays. This triggered many connection retries. Even more congestion and performance issues. Cost estimate - $66,240 per minute

Types of monitoring

2 broad categories of monitoring

  • server side errors: 5xx errors
  • client side errors: 4xx errors

Prereqs of monitoring

Monitoring involves measuring something. That something should be a meaningful metric. Once you determine the metric, you need to set some thresholds so that you could setup alerts based on a metric’s value going beyond a certain threshold.

Conventional approach to handling failures

Reactive and proactive

Reactive is when action to be taken after the problem has happened. There is probably some downtime, the problem could be worse.

Proactive is when mitigation/resolution actions are taken before the problem has happened. Problem might not even manifest to result in a downtime. This sounds great in principle. But there will be misses and failures will happen. The goal should be to find problems as early as possible so that corrective action can be taken.

Metrics

Objective measure of what is to be observed. They should provide some insight into the system.

  • network performance measured using throughput - mbps, latency, round-trip-time

Gathering metrics should not be expensive - if this is going to be a performance intensive operation, your application could struggle.

Some metrics are physical properties of the system

  • CPU stats
  • cache hits and misses
  • RAM usage
  • page faults
  • disc space
  • read rates
  • write rates
  • swap space usage

Populate the metrics

Should servers push metrics? Should monitoring system pull metrics from the distributed servers? The push and pull is from the monitoring system’s perspective.

What about logging?

Logging is when the app servers log information into a file - it could be any metric or events or some other properties that we deem useful. Metrics that we measure can be based on the values in the logs. Log processing is time-consuming. Logs could be kept temporarily on the server to cushion any data spikes or to decouple data generation and monitoring systems.

Persistence

Large data generated by multiple systems would benefit being stored in a time series database. They store sample data from the servers with a timestamp in chronological sequence - creating an event log.

Application metrics

instrumentation - embedding logging or monitoring code in our applications.

Alerting

Metrics cross a threshold and alarm bells start ringing - pages the person on call ideally.