Download pptx - August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 1

REMOVING THE NAMENODE'S MEMORY LIMITATION Lin Xiao

Intern@Hortonworks

PhD student @ Carnegie Mellon University

04/10/2023 2

About Me: Lin Xiao• Phd Student at CMU

• Advisor: Garth Gibson • Thesis area – scalable distributed file systems

• Intern at Hortonworks• Intern project: removing the Namenode memory limitation

• Email: [email protected]

04/10/2023 3

Big Data • We create 2.5x1018 bytes of data per day [IBM]

• Sloan Digital Sky Survey: 200GB/night• Facebook: 240 billions of photos till Jan,2013

• 250 million photos uploaded daily

• Cloud storage• Amazon: 2 trillion objects, peak1.1 million op/sec

• Need scalable storage systems• Scalable metadata <- focus of this presentation• Scalable storage• Scalable IO

04/10/2023 4

Scalable Storage Systems• Separate data and metadata servers

• More data nodes for higher throughput & capacity• Bulk of work – the IO path - is done by data servers• Not much work added to metadata servers?

04/10/2023 5

Federated HDFS

• Namenodes(MDS) see their own namespace (NS)• Each datanode can serve all namenodes

04/10/2023 6

Single Namenode• Stores all metadata in memory

• Design is simple• Provide low latency and high throughput metadata operations• Support up to 3K data servers

• Hadoop clusters make it affordable to store old data• Cold data is stored in the cluster for a long time• Take up memory space but rarely used• Growth of data size can exceed throughput

• Goal: remove space limits while maintain similar performance

04/10/2023 7

Metadata in Namenode• Namespace

• Stored as a linked tree structure by inodes• Always visit from the top for any operation

• Blocks Map: block_id to location mapping• Handle separately for huge number of blocks

• Datanode status• IPaddress, capacity, load, heartbeat status, Block report status

• Leases

• Namespace and Block map uses the majority of memory• This talk will focus on the Namespace

04/10/2023 8

Problem and Proposed Solution• Problem:

• Remove namespace limit while maintain similar performance when the working set can fit in memory

• Solution• Retain the same namespace tree structure• Store the namespace in persistent store using LSM (LevelDB)

• No separate edit logs nor checkpoints• All Inode and their updates are persistent via LevelDB

• Fast startup, with the cost of slow initial operations• Could prefetch inodes in

• Do not expect customers to drastically reduce the actual heap size• Larger heap benefits transition between different working sets as

applications and workload changes• A customer may occasionally run queries against cold data

04/10/2023 9

New Namenode Architecture• Namespace

• Same as before, but only part of the tree is in memory• On cache miss, read from levelDB

• Edit logs and checkpoints are replaced by LevelDB• Update to LevelDB for every inode change

• Key: <parent_inode_number + name>

04/10/2023 10

Comparison w/Traditional FileSystem• Traditional File Systems

• VFS layer keeps inode and directory entry cache• Goal is to support the work load of single machine

• Relatively large number of files• Support the applications from a single machine or in case of NFS from a

larger number of client machines• Much much smaller workload and size compared to Hadoop use cases

• LevelDB based Namenode• Support very large traffic of Hadoop cluster• Keep a much larger number of INodes in memory

• Cache replacement policies to suite the Hadoop work load

• Data is in Datanodes

04/10/2023 11

LevelDB• A fast key-value storage library written at Google• Basic operations: get, put, delete• Concurrency: single process w/multiple threads• By default, writes are asynchronous

• As long as the machine doesn’t crash, it’s safe.

• Support synchronous writes• No separate sync() operation

• Can be implemented by sync write/delete

• Support batch updates• Data is automatically compressed using the Snappy

04/10/2023 12

Cache Replacement Policy• Only whole directories are replaced in or out

• Hot dirs are all in cache, others will require levelDB scan• Future – don’t cache very large dirs?

• No need to read from disks to check file existence

• LRU replacement policy• Use CLOCK to approximate to reduce cost

• Separate thread for cache replacement• Start replacement when threshold is exceeded• Remove eviction out of sessions with lock

04/10/2023 13

Benchmark description• NNThroughputBenchmark

• No RPC cost, call FileSystem method directly• All operations are generated based on BFS order

• Each thread gets one portion of the work

• NN Load generator using YCSB++ framework (in progress)• Normal HDFS client calls• Thread either works in their own namespace, or choose randomly

• Load generator based on real cluster traces (in progress)• Can you help me get traces from your cluster?

• Traditional Hadoop benchmark(in progress)• E.g. Gridmix Expect little degradation when most work is for data transfer

04/10/2023 14

Categories of tests• Everything fits in memory

• Goal: should be almost the same as the current NN

• Working set does not fit in memory or changes over time • Study various cache replacement policies• Need to get good traces from real cluster to see patterns of hot,

warm and cold data

04/10/2023 15

Experiment Setup• Hardware description (Susitna)

• CPU: AMD Opteron 6272, 64 bit, 16 MB L2, 16-core 2.1 GHz• SSD: Crucial M4-CT064M4SSD2 SSD, 64 GB, SATA 6.0Gb/s• (In progress) Use disks in future experiments

• Heap size is set to 1GB

• NNThroughputBenchmark• No RPC cost, call FileSystem method directly• All operations are generated based on BFS order

• Multiple threads, but each thread gets one portion of the work

• Each directory contains 100 subdirs and 100 files• Named sequentially: ThroughputBenchDir1, ThroughputBench1

• LevelDB NN• Cache monitor thread starts replacement when 90% full

04/10/2023 16

Create & close 2.4M files – all fit in cache

• Note files are not accessed, but clearly parent dirs are• Note: Old NN and LevelDB NN peak at different # threads• Degradation for peak throughput is 13.5%

04/10/2023 17

Create 9.6M files: 1% fits in cache• Old NN with 8 threads and LevelDB NN with 16 threads.• Performance remains about the same using LevelDB

• Namenode’s throughput drops to zero when memory exhausted

04/10/2023 18

GetFileInfo• ListStatus of first 600K of 2.4M files

• Each thread working on different part of tree • Original NN: all fit in memory (of course)• LevelDB NN: 2 cases: (1) all fit, (2) half fit • Half fit: 10%-20% degradation - cache is constantly replaced

2 4 8 16 320

20000

40000

60000

80000

100000

120000

140000

OriginalFitCacheHalfInCache

Number of Threads

Thro

ughp

ut O

ps/s

ec

04/10/2023 19

Benchmarks that remain• NNThroughputBenchmark

• No RPC cost, call FileSystem method directly• All operations are generated based on BFS order

• Each thread gets one portion of the work

• NN Load generator using YCSB++ framework (in progress)• Normal HDFS client calls• Thread either works in their own namespace, or choose randomly

• Load generator based on real cluster traces (in progress)• Can you help me get traces from your cluster?

• Traditional Hadoop benchmark(in progress)• E.g. Gridmix Expect little degradation when most work is for data transfer

04/10/2023 20

Summary• Now that NN is HA, removing the namespace memory

limitation is one of most important problems to solve• LSM (LevelDB) has worked out quite well

• Initial experiments have shown good results• Need further benchmarks especially on how effective caching is for

different workloads and patterns• Other LSM implementations? (e.g.HBase’s Java LSM)

• Work is done on branch 0.23• Graduate student quality prototype (very good graduate student )

• But worked closed with the HDFS experts at Hortonworks

• Goal of internship was to see how well the idea worked• Hortonworks plans to take this to the next stage once more experiments

are completed.

04/10/2023 21

Q&A• Contact: [email protected]

• We’d love to get trace stats from your cluster • Simple java program to run against your audit logs

• Can also run as Mapreduce jobs

• Extract metadata operation stats without exposing sensitive info• Please contact me if you could help!

mailto:[email protected]