Once your system is distributed across multiple servers, data centres, or even continents, a new challenge emerges: how do you know what’s happening? Is everything running smoothly? Where is the bottleneck? Which component just failed?

Distributed monitoring is the art and science of observing complex systems. It’s your eyes and ears in production, helping you detect problems before users notice them and debug issues when they inevitably occur.

What You’ll Learn

In this section, we’ll explore:

  • Metrics collection: What to measure and how to measure it
  • Logging: Capturing and aggregating logs from distributed services
  • Tracing: Following requests as they flow through your system
  • Alerting: Knowing when something goes wrong (without crying wolf)
  • Dashboards: Visualising system health at a glance
  • Distributed monitoring architecture: Building scalable monitoring systems

Why Monitoring Matters

In a distributed system, things fail all the time. Servers crash, networks hiccup, databases slow down. Without proper monitoring:

  • You won’t know about problems until users complain
  • You’ll waste hours debugging without the right information
  • You can’t optimise what you can’t measure
  • You have no idea if your “fix” actually fixed anything

Good monitoring is the difference between a 5-minute recovery and a 5-hour outage.

Real-World Applications

Every reliable system includes comprehensive monitoring:

  • Netflix monitors thousands of metrics to ensure smooth streaming
  • Google tracks page load times to optimise search results
  • Amazon monitors checkout flow to detect and fix issues instantly

Let’s learn how to build monitoring systems that actually help!


Introduction A distributed system is inherently complex and hence when something goes wrong, it is difficult to pinpoint what happened unless there is good level of observability. Observability is the ability to understand what is happening in a distributed system by looking at its external outputs - in this context - logs, metrics, traces etc. Failures in distributed systems could be attributed to infrastructure - server, network, electrical, etc or due to the application that we care about. One service going down in the system can take down multiple systems across the board if the system is not resilient. ...

Requirements all critical processes on the servers must be monitored for crashes anomalies and overall values in CPU/Memory/Disk/Network bandwidth, average load and so on hardware component faults - memory failure, slow disk, cpu heat etc network access - server to server comms networking components - switches, load balancers, routers, etc power consumption at server, rack and data centre level … Building blocks Blob storage - to store info about our metrics! ...

Client side errors Example of a case where an ISP accidentally released different internet routes. Google customers couldn’t reach google in this case. This was a result of a BGP leak. Border gateway protocol is a routing protocol that connects the entire internet. In a LAN, routing is easy, but as the network grows, it gets complicated. Thus for routing in the internet, there are special methods. Large organisations and ISPs manage internet connectivity for multipl network sites and locations. Often called Autonomous Systems. ...