Offline analysis of HDFS metadata
HDFS is part of the core Hadoop ecosystem and serves as a storage layer for the Hadoop computational frameworks like Spark, MapReduce. Like other distributed file systems, HDFS is based on an architecture where namespace is decoupled from the data. The namespace contains the file system metadata which is maintained by dedicated server called namenode and the data itself resides on other servers called datanodes.
This blogpost is about dumping HDFS metadata into Impala/Hive table for examination and offline analysis using SQL semantics
The namenode keeps the complete namespace in memory for faster metadata operations and persistently stores the namespace image and edits log (WAL) on disk. The namespace image contains HDFS directories, files, their attributes like permissions, quotas, access time and modification time and block IDs for files. Although its out of scope for this blogpost, it's worth mentioning that it is the role of secondary name node to periodically merge on disk fsimage with edits log
Loading HDFS metadata into Impala table
This can be achieved by downloading the latest fsimage file, formatting it and loading into impala table in parquet format as you will see below
Fetch and copy the fsimage file into HDFS
#connect to any hadoop cluster node as hdfs user #downloads the fsimage file from namenode hdfs dfsadmin -fetchImage /tmp #converts the fsimage file into tab delimited file hdfs oiv -i /tmp/fsimage_0000000000450297390 -o /tmp/fsimage.csv -p Delimited #remove the header and copy to HDFS sed -i -e "1d" fsimage.csv hdfs dfs -mkdir /tmp/fsimage hdfs dfs -copyFromLocal /tmp/fsimage.csv /tmp/fsimage
Load the data into impala table in parquet format
#connect to any hadoop cluster node #create the intermediate external table in impala CREATE EXTERNAL TABLE HDFS_META_D ( PATH STRING , REPL INT , MODIFICATION_TIME STRING , ACCESSTIME STRING , PREFERREDBLOCKSIZE INT , BLOCKCOUNT DOUBLE, FILESIZE DOUBLE , NSQUOTA INT , DSQUOTA INT , PERMISSION STRING , USERNAME STRING , GROUPNAME STRING) row format delimited fields terminated by '\t' LOCATION '/tmp/fsimage'; #create the final table in parquet storage format CREATE TABLE HDFS_META ( PATH STRING , REPL INT , MODIFICATION_TIME TIMESTAMP , ACCESSTIME TIMESTAMP , PREFERREDBLOCKSIZE INT , BLOCKCOUNT DOUBLE, FILESIZE DOUBLE , NSQUOTA INT , DSQUOTA INT , PERMISSION STRING , USERNAME STRING , GROUPNAME STRING) STORED AS PARQUETFILE; #load the data into the final table INSERT INTO HDFS_META SELECT PATH, REPL, cast(concat(MODIFICATION_TIME,':00') as timestamp), cast(concat(ACCESSTIME,':00') as timestamp), PREFERREDBLOCKSIZE, BLOCKCOUNT, FILESIZE, NSQUOTA, DSQUOTA, PERMISSION, USERNAME, GROUPNAME FROM HDFS_META_D;
Now that you have the HDFS metadata in an Impala table, it's time to slice and dice the metadata using SQL semantics
I will demonstrate few use cases of this data, there can be many more depending on the problems you are solving
# to get list of files modified between 2 dates for wlcg select path from HDFS_META where path like '/user/wlcg%' and modification_time between '2016-06-19' and '2016-06-20'; # average number of blocks per file for each user, HDFS works well with large files select username,sum(blockcount)/count(1) "avg blks per file" from HDFS_META where permission like '-%' group by username; # number of files created by day select trunc(modification_time,'DD') day,count(1) num_of_files, round(sum(filesize)/1024/1024/1024,2) Total_Size_GB from hdfs_meta group by trunc(modification_time,'DD') order by trunc(modification_time,'DD'); # files that have never been accessed select username, count(1) num_of_files,round(sum(filesize)/1024/1024/1024,2) Total_Size_GB from hdfs_meta where permission like '-%' and accesstime < modification_time group by username order by Total_Size_GB; # The following query works ONLY in hive, as you might know impala tables be accessed using hive and viceversa # histogram showing distribution of HDFS files by blockcount SELECT CONCAT(CONCAT(CAST(hdfs.x as int),","),CAST(hdfs.y as bigint)) FROM (select histogram_numeric(blockcount,40) as hdfs_meta from hdfs_meta) a LATERAL VIEW explode(hdfs_meta) exploded_table as hdfs;
My hdfs_meta table consists of 4 million records and all the above queries complete in less than 200 ms. Even though most of these analyses can be performed using hdfs java api, it can be tedious, time consuming and slow. Further more it may lead to namenode instability by scanning the whole namespace multiple times.
As hadoop service providers from time to time we need answers to these questions during service operation - day to day activities or service design - provision of backup service for HDFS. To this end, this blog post demonstrates how to perform blazingly fast analysis on hdfs metadata in the most non-disruptive way possible using SQL semantics.