Distributed systems always bring new challenges for administrators and users. This is the case with HDFS, the default distributed file system that Hadoop uses for storing data.
In order to face these challenges, tools to facilitate administration and usage of these systems are developed. At CERN, a Hadoop service is provided and we have developed and deployed on our clusters some tools, today we present one of these tools.
This tool aims at showing a detailed view of meta data information about data blocks and replicas stored into HDFS. It also builds a heat map showing the replicas distribution along hosts and disks. The tool we are introducing today is open-sourced and can be found on GitHub at https://github.com/cerndb/hdfs-metadata.
Organisation of data in HDFS
HDFS is meant to store very big files, files bigger than configured block size are split into blocks, smaller files are just contained inside a single block. Since HDFS is a distributed file systems, each of these blocks are replicated (default replication factor is 3) and stored on different machines. From the hardware point of view, data stored into HDFS is distributed along machines where a DataNode process is running. Each of these machines can have attached one or more disks where the actual data is stored. Data on disks are stored in chunks of data called replicas.
When could this tool be useful?
We have developed this tool because in some cases we have missed it when dealing with HDFS. Some of these cases and others scenarios where this tool can help are explained below.
From the user view, we find this tool useful to understand how work is going to be distributed. If you know how Hadoop data processing layer works (MapReduce), it runs tasks on nodes where data needed for these tasks are located, data can be processed remotely but it tries to avoid this behaviour. By using this tool you would be able to see where tasks will be run, if MapReduce job input directory is specified as argument for this tool, a heat map of data distribution along hosts and disks for this directory will be shown, allowing to observe easily how the work will be distributed.
There are tools or frameworks that work on top of HDFS. Some of them make use of data locality (as MapReduce does) for moving processing tasks to the machines where data is located. This is the case of Apache Impala, a distributed SQL database developed by Cloudera. Impala makes use of data locality but it needs HDFS configured in such a way Impala is able to gather the required meta data. If HDFS is not configured properly, you will see a warning when you run your queries: "Unknown disk id. This will negatively affect performance. Check your HDFS settings to enable block location metadata.". We have faced such problem, but, HDFS was configured properly!. How did we solve that? using this tool!. When we run this tool specifying the Impala table directory, you could easily observe how one of the DataNodes was not reporting disk ids for the replicas, after restarting the problematic DataNode and invalidating metadata from Impala shell, the problem disappeared.
From an administration point of view can be also very useful. From the heat map it builds, you can spot if disks or hosts are under-utilized or over-utilized. Disks or hosts that are not utilized are also shown, therefore a failed disk or host can be quickly identified and repaired. You can also noticed some wrong configurations such as DataNodes that are not utilized at all but are included in the configuration (dfs.includes file).
Not only a overall distribution of replicas are shown, it also shows detailed meta data information per block and replica. You can figure out where every single replica is located (computer centre, rack, host and disk), length, if it is cached, offset, etc.
How does it work?
First of all, a list of all files contained recursively into a specified directory (or a single file) is collected and blocks/replicas meta data is collected for every file. A incredibly long list of blocks can be collected here (imagine using this tool with root directory /), so this list is limited to 20000 block locations. Such amount will be representative enough to build a heat map and we do not expect to print detailed meta data for such long list of blocks.
Every replica contains information about the disk where is contained (VolumeId/DiskId), only if "dfs.datanode.hdfs-blocks-metadata.enabled" configuration parameter is set to true on every DataNode. Replicas located in DataNodes where this is not properly configured, will be reported with unknown disk id, therefore host is known for these replicas but not the disks.
Once blocks/replicas meta data is collected, the list of data directories (disks) is got from configuration. We get this from "dfs.data.dir" and "dfs.datanode.data.dir" (compatibility with older versions) parameters. List of data directories are split by coma, and every one get its DiskId assigned sequentially. Worth mentioning that this configuration will be got from the machine where this tool is run and this configuration parameter can be different in every machine. In our case, and in most cases, configuration is equal in all machines, so data directories will correspond to the proper directories.
With information about blocks and replicas, number of replicas per host and disk is computed. This data structure will only contain information for a specific directory, so all hosts and disks may not be represented. To avoid that, data directories information is filled using previous list of data directories and host list is filled with the list of DataNode hosts. The list of DataNode hosts is first collected from the API from where a list of active DataNodes can be gather, note that this operation can only be performed with admin privileges. If the user running the tool does not have such privileges, the list of hosts is got from dfs.includes configuration file, notice that this file contains active and not active DataNodes, so the heat map can contain these not active nodes. In cases where such deeper analysis needs to be performed it is mostly likely done by Hadoop administrator who will have required privileges.
The computed heat map is printed. It shows how replicas are distributed across hosts and disks (notice that if DiskId is greater than 9, numbers should be read vertically for formatting reasons). An average of replicas per disk is calculated (Average column) and total count (Count) per host. If disk does not contain any replica, 0 represents it. If it contains more than 20 percent of the average, it is represented by +. If it contains less than 20 percent of the average, it is represented by -. Else (close to the average), it is represented by =. Colours are used for these symbols, ANSI special characters for colours are used, so a ANSI compatible console should be used to represent properly the heat map.
Finally, detailed information per block and replica is printed. Number of block to print is limited by a second optional argument, by default 20 are printed.
An output example is shown below.
How to use it
CERN Hadoop users can use this tool just by typing "/afs/cern.ch/project/hadoop/etc/hdfs-metadata/bin/hdfs-blkd" command when they are logged in any CERN Hadoop clusters. For external users, or CERN users who wants to use this tool from an external machine, they will need to build the project and make required HDFS libraries available for the tool.
To build the project, you should follow a few steps. Once Git repository is clone (git clone https://github.com/cerndb/hdfs-metadata.git) or downloaded and decompressed, you should run a provided script (bin/compile) to build the project. It will use Maven if available or jacav to compile the tool (cat bin/compile for details).
If you have already compiled the project, you can use it by using bin/hdfs-blkd script. It will run the tool using "hadoop jar" command. Optionally you can specify where HDFS required libraries are located (cat bin/hdfs-blkd for details), in that case Hadoop does not need to be installed.
bin/hdfs-blkd expects a path as first argument, this will be the path from where block metadata will be collected. Optionally, you can use a number as second argument to limit the number of block that will be printed, by default 20 will be printed. Notice that this tool implements Hadoop Tool interface, therefore generic options can be used for configuration.
If your are using this tool from and external machine, you may need to configure the target HDFS cluster. You can achieve that with "-D fs.defaultFS=hdfs://hdfs-cluster-10.cern.ch:9000/" argument.