At CERN we run multiple Hadoop clusters to satisfy demanding requirements from our experiments and accelerator communities. The usage and criticality of the clusters are increasing dramatically as more users are looking at Hadoop to process and archive the vast amounts of data coming out of LHC.
Sometimes, we as Hadoop administrators are faced with questions like 1) Do we have enough capacity (cpu & memory) to add new workloads/users 2) Why are some user applications getting killed? 3) When is preemption kicking in? and 4) Do we have sufficient resources to satisfy a particular user community's SLA under peak conditions? Having this information available as in 'viewing after the fact' would greatly help in right-sizing the cluster, capacity planning, workload patterns and resource consumption.
YARN, the resource manager for Hadoop 2.0 has REST API’s that allow us to get information about the cluster such as status of the cluster, information about nodes in the cluster, metrics of the cluster, scheduler information and information about applications on the cluster. This information if we can collect over a continuous time interval would answer most of our questions and possibly provide more insights into usage of Hadoop clusters. Since we want this monitoring solution to be flexible, extensible and zero-configuration to the Hadoop cluster, we have decided to use logstash - to collect, parse and transform the metrics, elasticsearch - to store the metrics and kibana - for data visualisation.
The rest of the post explains how we built the solution and will enable you to implement the same if you are faced with similar challenges
Collect, Parse and Transform
We use logstash to collect, process and forward events. We collect the metrics from cluster, nodes, applications and scheduler API, the polling interval can be amended as per needs
The following transformation and enrichment is done on the events that come out of REST API
- split the JSON array
- remove the outer elements of JSON
- data type conversion
- data transformation (MB -> bytes) so that kibana can format the number as bytes
Finally we forward the events to elasticsearch for indexing.
The events are stored in elasticsearch and events from each endpoint is stored in separate index. Each index has mapping type to define format, datatype and string fields that should be not_analyzed. Following is an example of an index mapping
curl -XPUT myelastic:9200/apps-d/_mapping/apps -d '
This is the interesting part where you can build visualisations on top of the data you index to elasticsearch. You can build your own or load the dashboards and visualisations from the github repository as below
curl -XPOST 'http://myelastic:9200/.kibana/visualization/MyFirstVisual' -d @filename
They look as below
and you can identify some interesting stuff
All the code is available in the github repository. By changing the Hadoop namenode and elasticsearch endpoint you should be able to get this up and running at your organisation. If you are interested, please create pull request to contribute and collaborate on this work.
The metrics exposed through YARN and SPARK REST APIs can be successfully exploited to provide more information to Hadoop administrators and management to aid in capacity planning and right-sizing the Hadoop Cluster.