Software Craftsperson

Requirements

all critical processes on the servers must be monitored for crashes
anomalies and overall values in CPU/Memory/Disk/Network bandwidth, average load and so on
hardware component faults - memory failure, slow disk, cpu heat etc
network access - server to server comms
networking components - switches, load balancers, routers, etc
power consumption at server, rack and data centre level
…

Building blocks

Blob storage - to store info about our metrics!

High level design

Data store - timeseries database data collector service - fetch data from each service and saves it in storage query service - api that queries the time-series database and returns the relevant info

Storage

Use a TS DB to save data locally on the server where data collector service is running. Then send that to blob store. If there is monitoring, there is also alerting and in order to do this, we need to store the rules of alerting somewhere, let’s call this a rules and actions database. Also on the same node.

data collector

There are several servers in the datacenter, all have to be monitored. Pull strategy here - the collector fetches data from the logs from these servers.

Push strategy pitfalls

If every server and service starts pushing metrics to a central metric collection platform, it could result in network overload and create bottlenecks. This also needs daemons on every server to send the metric data. Near real time data pushes could bring down datacenters.

Service discoverer

As we collect data form different services, the monitoring system will get metadata to identify services uniquely. This information has to be stored in a system that we call service discoverer. Suggested solution uses tools like EC2, K8s and Consul?

Querying service

We have to access the db and fetch relevant query results. Alerting needs this, and so will any dashboards.

We now have alert manager and dashboards.

Alerts require an alert manager - something that monitors the metrics based on rules in the alert rules and actions database.

Dashboards are used to visually depict the health of a system.

Advantages	Disadvantages
Avoids overloading network traffic using pull	scalability is questionable. Server that runs the monitoring service could be a single point of failure. We could set up a failover, however, maintaining syncronisation and consistency is a challenge

We are missing a service to purge or archive data periodically from the timeseries database.

Design improvement

How to scale better? Maybe we could use a combination of push and pull approach. In a push approach, the application pushes its data to the monitoring system. Earlier we used pull to minimize load on the network.

Use pull approach for servers within a datacenter. Assign several monitoring servers for hundreds or thousands of servers within a data center. 1 monitor server for every 5000. Call them secondary monitoring servers. These servers push monitoring data to a primary data server. The primary server will push data to a global monitoring service responsible for checking all the data centres spread globally.

This hierarchical, tree-like approach enables us to scale better.

Visualizing data in a dashboard

Monitoring several thousands of servers obviously results in a huge amount of data thus showcasing a big picture view on a dashboard can be daunting. One way to simplify this is by displaying overall server health of each and every server in a data-centre in a heatmap.

Requirements#

Building blocks#

High level design#

Storage#

data collector#

Push strategy pitfalls#

Service discoverer#

Querying service#

Design improvement#

Visualizing data in a dashboard#