Introduction A distributed system is inherently complex and hence when something goes wrong, it is difficult to pinpoint what happened unless there is good level of observability. Observability is the ability to understand what is happening in a distributed system by looking at its external outputs - in this context - logs, metrics, traces etc. Failures in distributed systems could be attributed to infrastructure - server, network, electrical, etc or due to the application that we care about. One service going down in the system can take down multiple systems across the board if the system is not resilient. ...

Requirements all critical processes on the servers must be monitored for crashes anomalies and overall values in CPU/Memory/Disk/Network bandwidth, average load and so on hardware component faults - memory failure, slow disk, cpu heat etc network access - server to server comms networking components - switches, load balancers, routers, etc power consumption at server, rack and data centre level … Building blocks Blob storage - to store info about our metrics! ...

Client side errors Example of a case where an ISP accidentally released different internet routes. Google customers couldn’t reach google in this case. This was a result of a BGP leak. Border gateway protocol is a routing protocol that connects the entire internet. In a LAN, routing is easy, but as the network grows, it gets complicated. Thus for routing in the internet, there are special methods. Large organisations and ISPs manage internet connectivity for multipl network sites and locations. Often called Autonomous Systems. ...