Real-time visualisation of Hadoop resources

At CERN we run multiple Hadoop clusters to satisfy demanding requirements from our experiments and accelerator communities. The usage and criticality of the clusters are increasing dramatically as more users are looking at Hadoop to process and archive the vast amounts of data coming out of LHC.

Sometimes, we as Hadoop administrators are faced with questions like 1) Do we have enough capacity (cpu & memory) to add new workloads/users 2) Why are some user applications getting killed? 3) When is preemption kicking in? and 4) Do we have sufficient resources to satisfy a particular user community's SLA under peak conditions? Having this information available as in 'viewing after the fact' would greatly help in right-sizing the cluster, capacity planning, workload patterns and resource consumption.

Hadoop Metrics

YARN, the resource manager for Hadoop 2.0 has REST API’s that allow us to get information about the cluster such as status of the cluster, information about nodes in the cluster, metrics of the cluster, scheduler information and information about applications on the cluster. This information if we can collect over a continuous time interval would answer most of our questions and possibly provide more insights into usage of Hadoop clusters. Since we want this monitoring solution to be flexible, extensible and zero-configuration to the Hadoop cluster, we have decided to use logstash - to collect, parse and transform the metrics, elasticsearch - to store the metrics and kibana - for data visualisation.

The rest of the post explains how we built the solution and will enable you to implement the same if you are faced with similar challenges

Collect, Parse and Transform

We use logstash to collect, process and forward events. We collect the metrics from cluster, nodes, applications and scheduler API, the polling interval can be amended as per needs

logstash_input

The following transformation and enrichment is done on the events that come out of REST API

- split the JSON array
- remove the outer elements of JSON
- data type conversion
- data transformation (MB -> bytes) so that kibana can format the number as bytes

logstash_filter

Finally we forward the events to elasticsearch for indexing.

Store

The events are stored in elasticsearch and events from each endpoint is stored in separate index. Each index has mapping type to define format, datatype and string fields that should be not_analyzed. Following is an example of an index mapping

curl -XPUT myelastic:9200/apps-d/_mapping/apps -d '
{
"properties": {
"[apps][app][startedTime]": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"[apps][app][finishedTime]": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"[apps][app][queue]": {
"type": "string"
},
"[apps][app][amContainerLogs]": {
"type": "string",
"index": "not_analyzed"
},
"[apps][app][name]": {
"type": "string",
"index": "not_analyzed"
}
}
}'

Visualise

This is the interesting part where you can build visualisations on top of the data you index to elasticsearch. You can build your own or load the dashboards and visualisations from the github repository as below

curl -XPOST 'http://myelastic:9200/.kibana/visualization/MyFirstVisual' -d @filename

They look as below

kibana_visual4kibana_visual3

 
kibana_visual1kibana_visual2

and you can identify some interesting stuff

kibana_visual5

All the code is available in the github repository. By changing the Hadoop namenode and elasticsearch endpoint you should be able to get this up and running at your organisation. If you are interested, please create pull request to contribute and collaborate on this work.

Summary

The metrics exposed through YARN and SPARK REST APIs can be successfully exploited to provide more information to Hadoop administrators and management to aid in capacity planning and right-sizing the Hadoop Cluster.

Add new comment

You are here