Upload
hadoop-summit
View
543
Download
0
Embed Size (px)
Citation preview
Analyzing Small Files in HDFS Cluster
Presenters: Rohit JangidPresenters: Raman Goyal
HDFS Analysis for Small Files
Outline▪ What are small files and their problems?▪ Small Files Analysis▪ Architecture▪ FsImage Processing and Aggregation▪ Implementation and tool
▪ Dashboards and Results▪ Dashboards▪ Results▪ Future Work
▪ Conclusions
2
Expedia’s HDFS Cluster
3
Hdfs Doesn’t Like Lots Of Small Files…
4
Problem?
INEFFICIENT DATA ACCESS PATTERN
5
MAKES JOBS SLOW....
6
Trivial Solution?
7
Compaction
Solution?
8
BUT WHERE...?
9
SMALL FILES ANALYSIS
10
ARCHITECTURE
HDFS Cluster RAW FsImage Interpreted FsImage
Attributed Files and Directory
information
Aggregated Files and Directory information
Dashboard
Storage
11
LSR
LSR
FsIMAGE PROCESSING
MeProcessed 20gb FsImage In ~20 Minutes
Custom OIV Interpreter For Reduced Memory Usage
Fetched from Name node OIV to LSR Interpreter
HDFS Cluster RAW FsImage Interpreted FsImage
12
LSR
ARCHITECTURE
HDFS Cluster RAW FsImage Interpreted FsImage
Attributed Files and Directory
information
Aggregated Files and Directory information
Dashboard
Storage
13
Attributes Found Directly
Owner Name
Group Name
Size of File
Replication Factor
Number of Direct File objects
Last Modified Date
Level of File
Is File or Is Directory?
Attribution and AggregationAggregated Attributes
Number of Small File objects
Number of Namespace objects
Smallest, Largest, Avg File size
Difference in Size since Last run
If Directory
14
Attribution and Aggregation
Generate Small Files / Total Files Metrics
Roll-up Attributes to Parent Directories
Custom UDF’s and PIG Scripts Using Sqoop
Stored in HDFS
Attributed Files and Directory
information
Aggregated Files and Directory information
Storage
15
ARCHITECTURE
HDFS Cluster RAW FsImage Interpreted FsImage
Attributed Files and Directory
information
Aggregated Files and Directory information
Dashboard
Storage
16
LSR
STORAGE AND REPORTING
DashboardStorage
Relational Database and Rest API Dashboards
Different Dashboards Showing User Level and Overall Level
REST API
Powered by Cyclotron: http://cyclotron.io
17
Implementation and Tool
Files and Directories Attributed
Small file & Directory information
Download and Interpret HDFS NameNode
At Directory level
Statistics like Smallest File calculated
Using OIV Interpreter
By splitting FsImage rows
Storage, REST API and DashboardsCan easily add new Clusters in Tool
18
DASHBOARDS AND RESULTS
19
Dashboards InformationFor file size less than 10 MB
For file size between 10 MB to 70 MB
For file size between 70 MB to ~100 MB
3 possible bucketing models
Goes upto all levels in HDFS
Distribution of owners of small Top 10
Directories to be investigated fordeletion, re-partition, compaction
3
2
1
20
Overall Dashboard containing all Information
21
Distribution of Owners of Small Files
22
Sample Directories Containing Small Files
23
Top 10: Files vs Small Files
24
Daily Small Files per Directory
25
Doesn’t have real time analysis! with alerting
Cluster has 200+ million namespace objects that we get as memory dump from Hadoop server.
Future Work
Translating and attributing each directory and file is a time consuming process.
Developing Customisable Compaction Utility
1
2
26
Conclusions