Intelligent monitoring with a new general-purpose metrics monitor
In the database team at CERN, we have developed a general-purpose metrics monitor, a missing part in our next generation monitoring infrastructure.
In the implemented metrics monitor, metrics can come from several sources like Apache Kafka, new metrics can be defined combining other metrics, different analysis can be applied, notifications, configuration can be updated without restarting, it can detect missing metrics, ...
This tool was introduced at Spark Summit 2017 conference, you can watch the talk here.
The tool has been open-sourced and can be found at GitHub.
The monitoring team at CERN provides a flexible infrastructure to collect and access metrics coming from all machines and services deployed in the organization.
Metrics are collected on the machines using Lemon and Collectd. Metrics are sent to Kafka making use of Apache Flume. From Kafka, data is sunk to different services such as HDFS (archive), Elastic and InfluxDB (visualization and querying). This infrastructure can be observed in the following image.
If provided monitoring/alerting mechanisms are not enough, or metrics are collected with Collectd, users needs to develop a solution which consumes metrics from any of the provided interfaces. This is what we have done, develop a metrics monitor tool that consumes metrics from Kafka.
What about existing tools?
Monitoring tools like Prometheus and Graphite offers an entire monitoring system: collection, transport, visualization, alerting, etc. However, we are just missing a “small” piece, a monitor tool that can make use of the current infrastructure and services provided.
What features the metrics monitor tool should have?
- Notifications/alerts need to be raised when abnormal behaviour is observed, so it needs to offer real-time notifications.
- The tool should be able to aggregate metrics coming from different systems/machines.
- It should be able to consume metrics from different systems. Specially Kafka if real-time is required.
- It should make use of current services.
- Configuration should be refreshed without restarting. Monitor configuration could be frequently updated.
- Ideally, it should have machine-learning capabilities. So that it does not require much configuration.
The answer to these requirements: a new open-source general-purpose metrics monitor
The developed tool offers:
- Several metric sources can be declared.
- Analysis results and notifications can be sunk/sent to external services.
- New metrics can be defined. Mathematical operations can be applied. Value of defined metrics can be computed by aggregating different incoming metrics.
- Multiple monitors can be declared, each monitor has a metric filter, a metric analysis and notificators.
- Modularity. The tool has the following components: properties source, metrics source, analysis, analysis results sink, notificator and notification sink. They can be easily replaced and new can be easily implemented.
- Some built-in components: Kafka source, different analysis, Elastic sinks, notificators, ...
- Monitors and defined metrics configuration can be updated while running. Configuration could come from an external source.
- Detection of missing metrics.
We have based our implementation on Apache Spark, a data analysis engine which have shown a good performance, reliability and scalability. It offers real-time analysis with its Streaming API. Since the implementation is based on based on Spark, it can be deployed as stand-alone or in a distributed fashion on any of the Hadoop (YARN) clusters that the IT department provides.
Thanks to the modularity of the implementation, we can make use of the current services or different ones in the future. A metric source has been implemented to consume from Kafka. Analysis results can be sunk to Elastic and later be visualized with Kibana and Grafana (services provided by IT). Notifications can be sunk to Elastic or a new sink can be implemented to send emails or trigger actions.
In the following image can be observed the data-flow in the implemented Spark job.
Several metric sources can be configured. Metrics collected by these sources joint with new defined metrics are sent to all the configured monitors.
A monitor defines the metric or group of metrics to which it monitors by using a filter. This filter acts on the different attributes the metrics have. An analysis is configured, it is the mechanism that compute thresholds and determines the status of each of the incoming metrics.
In order to understand why a notification has been raised, you may want to look at the results of the analysis. For that, analysis results can be sunk to an external service like Elastic to later visualize the results.
Finally, each monitor has notificators. Notificators implement the logic that determines when a notification is raised. These logic is based in the status of the metric computed by the previous analysis.
Notifications raised by notificators are sent to a notifications sink. This sink can send the notifications to Elastic, send emails, trigger actions, etc.
The aggregation of metrics is a key feature in the tool. You can aggregate different metrics or aggregate historical values of a single metric. This can be done in the tool by defining new metrics.
To introduce this feature, I will use some examples where I show the corresponding configuration. For the following examples, we consider that incoming metrics has the following attributes:
- CLUSTER_NAME: if the machine belongs to a cluster, this attribute contains the cluster name.
- HOSTNAME: host from which the metric is coming.
- METRIC_NAME: type of metric, describes what is being measured.
Compute the ratio read/write for all machines:
metrics.define.ratio_read_write.value = readbytes / writebytes
metrics.define.ratio_read_write.metrics.groupby = CLUSTER_NAME HOSTNAME
metrics.define.ratio_read_write.variables.readbytes.filter.attribute.METRIC_NAME = Read Bytes Per Sec
metrics.define.ratio_read_write.variables.writebytes.filter.attribute.METRIC_NAME = Write Bytes Per Sec
Count running machines per cluster (machines not sending metric expire in 5 minutes):
metrics.define.cluster-machines-running.metrics.groupby = CLUSTER_NAME
metrics.define.cluster-machines-running.when = BATCH
metrics.define.cluster-machines-running.variables.value.filter.attribute.METRIC_NAME = "running"
metrics.define.cluster-machines-running.variables.value.aggregate = count
metrics.define.cluster-machines-running.variables.value.expire = 5m
Average metrics coming from the all the machines of each cluster:
metrics.define.avg-metric-per-cluster.metrics.groupby = CLUSTER_NAME, METRIC_NAME
metrics.define.avg-metric-per-cluster.variables.average-value.aggregate = avg
Count values received during the last 10 minutes (a monitor can be configured to detect if 0, so metric is missing):
metrics.define.missing-metric.metrics.groupby = ALL
metrics.define.missing-metric.when = BATCH
metrics.define.missing-metric.variables.value.aggregate = count
metrics.define.missing-metric.variables.value.expire = 10m
Compute weighted average (latest values higher weight) over the values collected during the last 5 minutes for all the metrics:
metrics.define.avg-5m.metrics.groupby = ALL
metrics.define.avg-5m.variables.value.aggregate = weighted-avg
metrics.define.avg-5m.variables.value.expire = 5m
This example would produce what you can see in the following image:
In each monitor, the analysis determines the status of the incoming metrics, either metrics that come from any source or metrics that have been defined. Thresholds are computed for each metric, if metric value exceeds any threshold, status of the metric becomes warning or error.
Even though new analysis can be developed and integrated, the tool has some built-in ones that can be configured. You can configure a basic analysis where you establish the thresholds or go for other analysis that learn from the previous behaviour of the metric. One of them, recent analysis, compute the average and variance over the recent activity and compute the thresholds from that.
This can be configured with:
monitor.<monitor-id>.analysis.type = recent monitor.<monitor-id>.analysis.period = 1h
Other analysis, seasonal, expect metrics that repeatedly do the same, every hour, day or week.
This can be configured with:
monitor.<monitor-id>.analysis.type = seasonal monitor.<monitor-id>.analysis.season = hour
As you can observe, at the top of the plots, metric status changes when value exceeds the computed thresholds.
Different notificators are also offered with the tool, they provide the logic to raise notifications.
One of them is named "percentage", where you configure a period of time, percentage and unexpected statuses. A notification is raised if the metric has been in configured statuses for the percentage of the configured period.
An example of a raised notification can be observed in the following image.
The final product is a metrics monitor which can interact with different services. Its modularity allows a wide range of possibilities.
Some machine-learning capabilities make the configuration easy while maintaining an intelligent mechanism of alerting for your systems.
The tool is publicly available in GitHub. Feel free to make use of it, report bugs or suggest changes.
Links of interest
GitHub repo: https://github.com/cerndb/metrics-monitor
Presentation at Spark Summit 2017: www.youtube.com/watch?v=1IsMMmug5q0&t=11m17s