Offline analysis of HDFS metadata

Introduction

HDFS is part of the core Hadoop ecosystem and serves as a storage layer for the Hadoop computational frameworks like Spark, MapReduce. Like other distributed file systems, HDFS is based on an architecture where namespace is decoupled from the data. The namespace contains the file system metadata which is maintained by dedicated server called namenode and the data itself resides on other servers called datanodes.

This blogpost is about dumping HDFS metadata into Impala/Hive table for examination and offline analysis using SQL semantics

HDFS Metadata

The namenode keeps the complete namespace in memory for faster metadata operations and persistently stores the namespace image and edits log (WAL) on disk. The namespace image contains HDFS directories, files, their attributes like permissions, quotas, access time and modification time and block IDs for files. Although its out of scope for this blogpost, it's worth mentioning that it is the role of secondary name node to periodically merge on disk fsimage with edits log

Loading HDFS metadata into Impala table

This can be achieved by downloading the latest fsimage file, formatting it and loading into impala table in parquet format as you will see below

Fetch and copy the fsimage file into HDFS


#connect to any hadoop cluster node as hdfs user

#downloads the fsimage file from namenode
hdfs dfsadmin -fetchImage /tmp

#converts the fsimage file into tab delimited file
hdfs oiv -i /tmp/fsimage_0000000000450297390 -o /tmp/fsimage.csv -p Delimited

#remove the header and copy to HDFS
sed -i -e "1d" fsimage.csv
hdfs dfs -mkdir /tmp/fsimage
hdfs dfs -copyFromLocal /tmp/fsimage.csv /tmp/fsimage

Load the data into impala table in parquet format


#connect to any hadoop cluster node

#create the intermediate external table in impala
CREATE EXTERNAL TABLE HDFS_META_D ( 
 PATH STRING , 
 REPL INT , 
 MODIFICATION_TIME STRING , 
 ACCESSTIME STRING , 
 PREFERREDBLOCKSIZE INT , 
 BLOCKCOUNT DOUBLE, 
 FILESIZE DOUBLE , 
 NSQUOTA INT , 
 DSQUOTA INT , 
 PERMISSION STRING , 
 USERNAME STRING , 
 GROUPNAME STRING) 
row format delimited
fields terminated by '\t'
LOCATION '/tmp/fsimage';

#create the final table in parquet storage format
CREATE TABLE HDFS_META ( 
 PATH STRING , 
 REPL INT , 
 MODIFICATION_TIME TIMESTAMP , 
 ACCESSTIME TIMESTAMP , 
 PREFERREDBLOCKSIZE INT ,
 BLOCKCOUNT DOUBLE, 
 FILESIZE DOUBLE , 
 NSQUOTA INT , 
 DSQUOTA INT , 
 PERMISSION STRING , 
 USERNAME STRING , 
 GROUPNAME STRING)
STORED AS PARQUETFILE;

#load the data into the final table
INSERT INTO HDFS_META
SELECT
 PATH, 
 REPL, 
 cast(concat(MODIFICATION_TIME,':00') as timestamp),
 cast(concat(ACCESSTIME,':00') as timestamp),
 PREFERREDBLOCKSIZE,
 BLOCKCOUNT, 
 FILESIZE, 
 NSQUOTA, 
 DSQUOTA, 
 PERMISSION, 
 USERNAME, 
 GROUPNAME
FROM
 HDFS_META_D;

 

Now that you have the HDFS metadata in an Impala table, it's time to slice and dice the metadata using SQL semantics

Offline analysis

I will demonstrate few use cases of this data, there can be many more depending on the problems you are solving


# to get list of files modified between 2 dates for wlcg
select path from HDFS_META where path like '/user/wlcg%' and modification_time between '2016-06-19' and '2016-06-20';

# average number of blocks per file for each user, HDFS works well with large files
select username,sum(blockcount)/count(1) "avg blks per file" from HDFS_META where permission like '-%' group by username;

# number of files created by day
select trunc(modification_time,'DD') day,count(1) num_of_files, round(sum(filesize)/1024/1024/1024,2) Total_Size_GB from hdfs_meta group by trunc(modification_time,'DD') order by trunc(modification_time,'DD');

# files that have never been accessed
select username, count(1) num_of_files,round(sum(filesize)/1024/1024/1024,2) Total_Size_GB from hdfs_meta where permission like '-%' and accesstime < modification_time group by username order by Total_Size_GB;

# The following query works ONLY in hive, as you might know impala tables be accessed using hive and viceversa
# histogram showing distribution of HDFS files by blockcount
SELECT
CONCAT(CONCAT(CAST(hdfs.x as int),","),CAST(hdfs.y as bigint))
FROM (select 
histogram_numeric(blockcount,40) as hdfs_meta
from
hdfs_meta) a
LATERAL VIEW explode(hdfs_meta) exploded_table as hdfs;

 

filebyblockcount_histogram

My hdfs_meta table consists of 4 million records and all the above queries complete in less than 200 ms. Even though most of these analyses can be performed using hdfs java api, it can be tedious, time consuming and slow. Further more it may lead to namenode instability by scanning the whole namespace multiple times.

Conclusion

As hadoop service providers from time to time we need answers to these questions during service operation - day to day activities or service design - provision of backup service for HDFS. To this end, this blog post demonstrates how to perform blazingly fast analysis on hdfs metadata in the most non-disruptive way possible using SQL semantics.

Add new comment

You are here