Requirements
- all critical processes on the servers must be monitored for crashes
- anomalies and overall values in CPU/Memory/Disk/Network bandwidth, average load and so on
- hardware component faults - memory failure, slow disk, cpu heat etc
- network access - server to server comms
- networking components - switches, load balancers, routers, etc
- power consumption at server, rack and data centre level
- …
Building blocks
Blob storage - to store info about our metrics!
High level design
Data store - timeseries database data collector service - fetch data from each service and saves it in storage query service - api that queries the time-series database and returns the relevant info
Storage
Use a TS DB to save data locally on the server where data collector service is running. Then send that to blob store. If there is monitoring, there is also alerting and in order to do this, we need to store the rules of alerting somewhere, let’s call this a rules and actions database. Also on the same node.
data collector
There are several servers in the datacenter, all have to be monitored. Pull strategy here - the collector fetches data from the logs from these servers.
Push strategy pitfalls
If every server and service starts pushing metrics to a central metric collection platform, it could result in network overload and create bottlenecks. This also needs daemons on every server to send the metric data. Near real time data pushes could bring down datacenters.
Service discoverer
As we collect data form different services, the monitoring system will get metadata to identify services uniquely. This information has to be stored in a system that we call service discoverer. Suggested solution uses tools like EC2, K8s and Consul?
Querying service
We have to access the db and fetch relevant query results. Alerting needs this, and so will any dashboards.
We now have alert manager and dashboards.
Alerts require an alert manager - something that monitors the metrics based on rules in the alert rules and actions database.
Dashboards are used to visually depict the health of a system.
Advantages | Disadvantages |
---|---|
Avoids overloading network traffic using pull | scalability is questionable. Server that runs the monitoring service could be a single point of failure. We could set up a failover, however, maintaining syncronisation and consistency is a challenge |
We are missing a service to purge or archive data periodically from the timeseries database.
Design improvement
How to scale better? Maybe we could use a combination of push and pull approach. In a push approach, the application pushes its data to the monitoring system. Earlier we used pull to minimize load on the network.
Use pull approach for servers within a datacenter. Assign several monitoring servers for hundreds or thousands of servers within a data center. 1 monitor server for every 5000. Call them secondary monitoring servers. These servers push monitoring data to a primary data server. The primary server will push data to a global monitoring service responsible for checking all the data centres spread globally.
This hierarchical, tree-like approach enables us to scale better.
Visualizing data in a dashboard
Monitoring several thousands of servers obviously results in a huge amount of data thus showcasing a big picture view on a dashboard can be daunting. One way to simplify this is by displaying overall server health of each and every server in a data-centre in a heatmap.