27
Analyzing Small Files in HDFS Cluster Presenters: Rohit Jangid Presenters: Raman Goyal HDFS Analysis for Small Files

HDFS Analysis for Small Files

Embed Size (px)

Citation preview

Page 1: HDFS Analysis for Small Files

Analyzing Small Files in HDFS Cluster

Presenters: Rohit JangidPresenters: Raman Goyal

HDFS Analysis for Small Files

Page 2: HDFS Analysis for Small Files

Outline▪ What are small files and their problems?▪ Small Files Analysis▪ Architecture▪ FsImage Processing and Aggregation▪ Implementation and tool

▪ Dashboards and Results▪ Dashboards▪ Results▪ Future Work

▪ Conclusions

2

Page 3: HDFS Analysis for Small Files

Expedia’s HDFS Cluster

3

Page 4: HDFS Analysis for Small Files

Hdfs Doesn’t Like Lots Of Small Files…

4

Problem?

Page 5: HDFS Analysis for Small Files

INEFFICIENT DATA ACCESS PATTERN

5

Page 6: HDFS Analysis for Small Files

MAKES JOBS SLOW....

6

Page 7: HDFS Analysis for Small Files

Trivial Solution?

7

Page 8: HDFS Analysis for Small Files

Compaction

Solution?

8

Page 9: HDFS Analysis for Small Files

BUT WHERE...?

9

Page 10: HDFS Analysis for Small Files

SMALL FILES ANALYSIS

10

Page 11: HDFS Analysis for Small Files

ARCHITECTURE

HDFS Cluster RAW FsImage Interpreted FsImage

Attributed Files and Directory

information

Aggregated Files and Directory information

Dashboard

Storage

11

LSR

Page 12: HDFS Analysis for Small Files

LSR

FsIMAGE PROCESSING

MeProcessed 20gb FsImage In ~20 Minutes

Custom OIV Interpreter For Reduced Memory Usage

Fetched from Name node OIV to LSR Interpreter

HDFS Cluster RAW FsImage Interpreted FsImage

12

Page 13: HDFS Analysis for Small Files

LSR

ARCHITECTURE

HDFS Cluster RAW FsImage Interpreted FsImage

Attributed Files and Directory

information

Aggregated Files and Directory information

Dashboard

Storage

13

Page 14: HDFS Analysis for Small Files

Attributes Found Directly

Owner Name

Group Name

Size of File

Replication Factor

Number of Direct File objects

Last Modified Date

Level of File

Is File or Is Directory?

Attribution and AggregationAggregated Attributes

Number of Small File objects

Number of Namespace objects

Smallest, Largest, Avg File size

Difference in Size since Last run

If Directory

14

Page 15: HDFS Analysis for Small Files

Attribution and Aggregation

Generate Small Files / Total Files Metrics

Roll-up Attributes to Parent Directories

Custom UDF’s and PIG Scripts Using Sqoop

Stored in HDFS

Attributed Files and Directory

information

Aggregated Files and Directory information

Storage

15

Page 16: HDFS Analysis for Small Files

ARCHITECTURE

HDFS Cluster RAW FsImage Interpreted FsImage

Attributed Files and Directory

information

Aggregated Files and Directory information

Dashboard

Storage

16

LSR

Page 17: HDFS Analysis for Small Files

STORAGE AND REPORTING

DashboardStorage

Relational Database and Rest API Dashboards

Different Dashboards Showing User Level and Overall Level

REST API

Powered by Cyclotron: http://cyclotron.io

17

Page 18: HDFS Analysis for Small Files

Implementation and Tool

Files and Directories Attributed

Small file & Directory information

Download and Interpret HDFS NameNode

At Directory level

Statistics like Smallest File calculated

Using OIV Interpreter

By splitting FsImage rows

Storage, REST API and DashboardsCan easily add new Clusters in Tool

18

Page 19: HDFS Analysis for Small Files

DASHBOARDS AND RESULTS

19

Page 20: HDFS Analysis for Small Files

Dashboards InformationFor file size less than 10 MB

For file size between 10 MB to 70 MB

For file size between 70 MB to ~100 MB

3 possible bucketing models

Goes upto all levels in HDFS

Distribution of owners of small Top 10

Directories to be investigated fordeletion, re-partition, compaction

3

2

1

20

Page 21: HDFS Analysis for Small Files

Overall Dashboard containing all Information

21

Page 22: HDFS Analysis for Small Files

Distribution of Owners of Small Files

22

Page 23: HDFS Analysis for Small Files

Sample Directories Containing Small Files

23

Page 24: HDFS Analysis for Small Files

Top 10: Files vs Small Files

24

Page 25: HDFS Analysis for Small Files

Daily Small Files per Directory

25

Page 26: HDFS Analysis for Small Files

Doesn’t have real time analysis! with alerting

Cluster has 200+ million namespace objects that we get as memory dump from Hadoop server.

Future Work

Translating and attributing each directory and file is a time consuming process.

Developing Customisable Compaction Utility

1

2

26

Page 27: HDFS Analysis for Small Files

[email protected]

Conclusions