21
REMOVING THE NAMENODE'S MEMORY LIMITATION Lin Xiao Intern@Hortonworks PhD student @ Carnegie Mellon University 1 06/26/2022

August 2013 HUG: Removing the NameNode's memory limitation

Embed Size (px)

DESCRIPTION

Current HDFS Namenode stores all of its metadata in RAM. This has allowed Hadoop clusters to scale to 100K concurrent tasks. However, the memory limits the total number of files that a single NameNode can store. While Federation allows one to create multiple volumes with additional Namenodes, there is a need to scale a single namespace and also to store multiple namespaces in a single Namenode. This talk describes a project that removes the space limits while maintaining similar performance by caching only the working set or hot metadata in Namenode memory. We believe this approach will be very effective because the subset of files that is frequently accessed is much smaller than the full set of files stored in HDFS. In this talk we will describe our overall approach and give details of our implementation along with some early performance numbers. Speaker: Lin Xiao, PhD student at Carnegie Mellon University, intern at Hortonworks

Citation preview

Page 1: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 1

REMOVING THE NAMENODE'S MEMORY LIMITATION Lin Xiao

Intern@Hortonworks

PhD student @ Carnegie Mellon University

Page 2: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 2

About Me: Lin Xiao• Phd Student at CMU

• Advisor: Garth Gibson • Thesis area – scalable distributed file systems

• Intern at Hortonworks• Intern project: removing the Namenode memory limitation

• Email: [email protected]

Page 3: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 3

Big Data • We create 2.5x1018 bytes of data per day [IBM]

• Sloan Digital Sky Survey: 200GB/night• Facebook: 240 billions of photos till Jan,2013

• 250 million photos uploaded daily

• Cloud storage• Amazon: 2 trillion objects, peak1.1 million op/sec

• Need scalable storage systems• Scalable metadata <- focus of this presentation• Scalable storage• Scalable IO

Page 4: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 4

Scalable Storage Systems• Separate data and metadata servers

• More data nodes for higher throughput & capacity• Bulk of work – the IO path - is done by data servers• Not much work added to metadata servers?

Page 5: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 5

Federated HDFS

• Namenodes(MDS) see their own namespace (NS)• Each datanode can serve all namenodes

Page 6: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 6

Single Namenode• Stores all metadata in memory

• Design is simple• Provide low latency and high throughput metadata operations• Support up to 3K data servers

• Hadoop clusters make it affordable to store old data• Cold data is stored in the cluster for a long time• Take up memory space but rarely used• Growth of data size can exceed throughput

• Goal: remove space limits while maintain similar performance

Page 7: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 7

Metadata in Namenode• Namespace

• Stored as a linked tree structure by inodes• Always visit from the top for any operation

• Blocks Map: block_id to location mapping• Handle separately for huge number of blocks

• Datanode status• IPaddress, capacity, load, heartbeat status, Block report status

• Leases

• Namespace and Block map uses the majority of memory• This talk will focus on the Namespace

Page 8: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 8

Problem and Proposed Solution• Problem:

• Remove namespace limit while maintain similar performance when the working set can fit in memory

• Solution• Retain the same namespace tree structure• Store the namespace in persistent store using LSM (LevelDB)

• No separate edit logs nor checkpoints• All Inode and their updates are persistent via LevelDB

• Fast startup, with the cost of slow initial operations• Could prefetch inodes in

• Do not expect customers to drastically reduce the actual heap size• Larger heap benefits transition between different working sets as

applications and workload changes• A customer may occasionally run queries against cold data

Page 9: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 9

New Namenode Architecture• Namespace

• Same as before, but only part of the tree is in memory• On cache miss, read from levelDB

• Edit logs and checkpoints are replaced by LevelDB• Update to LevelDB for every inode change

• Key: <parent_inode_number + name>

Page 10: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 10

Comparison w/Traditional FileSystem• Traditional File Systems

• VFS layer keeps inode and directory entry cache• Goal is to support the work load of single machine

• Relatively large number of files• Support the applications from a single machine or in case of NFS from a

larger number of client machines• Much much smaller workload and size compared to Hadoop use cases

• LevelDB based Namenode• Support very large traffic of Hadoop cluster• Keep a much larger number of INodes in memory

• Cache replacement policies to suite the Hadoop work load

• Data is in Datanodes

Page 11: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 11

LevelDB• A fast key-value storage library written at Google• Basic operations: get, put, delete• Concurrency: single process w/multiple threads• By default, writes are asynchronous

• As long as the machine doesn’t crash, it’s safe.

• Support synchronous writes• No separate sync() operation

• Can be implemented by sync write/delete

• Support batch updates• Data is automatically compressed using the Snappy

Page 12: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 12

Cache Replacement Policy• Only whole directories are replaced in or out

• Hot dirs are all in cache, others will require levelDB scan• Future – don’t cache very large dirs?

• No need to read from disks to check file existence

• LRU replacement policy• Use CLOCK to approximate to reduce cost

• Separate thread for cache replacement• Start replacement when threshold is exceeded• Remove eviction out of sessions with lock

Page 13: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 13

Benchmark description• NNThroughputBenchmark

• No RPC cost, call FileSystem method directly• All operations are generated based on BFS order

• Each thread gets one portion of the work

• NN Load generator using YCSB++ framework (in progress)• Normal HDFS client calls• Thread either works in their own namespace, or choose randomly

• Load generator based on real cluster traces (in progress)• Can you help me get traces from your cluster?

• Traditional Hadoop benchmark(in progress)• E.g. Gridmix Expect little degradation when most work is for data transfer

Page 14: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 14

Categories of tests• Everything fits in memory

• Goal: should be almost the same as the current NN

• Working set does not fit in memory or changes over time • Study various cache replacement policies• Need to get good traces from real cluster to see patterns of hot,

warm and cold data

Page 15: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 15

Experiment Setup• Hardware description (Susitna)

• CPU: AMD Opteron 6272, 64 bit, 16 MB L2, 16-core 2.1 GHz• SSD: Crucial M4-CT064M4SSD2 SSD, 64 GB, SATA 6.0Gb/s• (In progress) Use disks in future experiments

• Heap size is set to 1GB

• NNThroughputBenchmark• No RPC cost, call FileSystem method directly• All operations are generated based on BFS order

• Multiple threads, but each thread gets one portion of the work

• Each directory contains 100 subdirs and 100 files• Named sequentially: ThroughputBenchDir1, ThroughputBench1

• LevelDB NN• Cache monitor thread starts replacement when 90% full

Page 16: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 16

Create & close 2.4M files – all fit in cache

• Note files are not accessed, but clearly parent dirs are• Note: Old NN and LevelDB NN peak at different # threads• Degradation for peak throughput is 13.5%

Page 17: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 17

Create 9.6M files: 1% fits in cache• Old NN with 8 threads and LevelDB NN with 16 threads.• Performance remains about the same using LevelDB

• Namenode’s throughput drops to zero when memory exhausted

Page 18: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 18

GetFileInfo• ListStatus of first 600K of 2.4M files

• Each thread working on different part of tree • Original NN: all fit in memory (of course)• LevelDB NN: 2 cases: (1) all fit, (2) half fit • Half fit: 10%-20% degradation - cache is constantly replaced

2 4 8 16 320

20000

40000

60000

80000

100000

120000

140000

OriginalFitCacheHalfInCache

Number of Threads

Thro

ughp

ut O

ps/s

ec

Page 19: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 19

Benchmarks that remain• NNThroughputBenchmark

• No RPC cost, call FileSystem method directly• All operations are generated based on BFS order

• Each thread gets one portion of the work

• NN Load generator using YCSB++ framework (in progress)• Normal HDFS client calls• Thread either works in their own namespace, or choose randomly

• Load generator based on real cluster traces (in progress)• Can you help me get traces from your cluster?

• Traditional Hadoop benchmark(in progress)• E.g. Gridmix Expect little degradation when most work is for data transfer

Page 20: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 20

Summary• Now that NN is HA, removing the namespace memory

limitation is one of most important problems to solve• LSM (LevelDB) has worked out quite well

• Initial experiments have shown good results• Need further benchmarks especially on how effective caching is for

different workloads and patterns• Other LSM implementations? (e.g.HBase’s Java LSM)

• Work is done on branch 0.23• Graduate student quality prototype (very good graduate student )

• But worked closed with the HDFS experts at Hortonworks

• Goal of internship was to see how well the idea worked• Hortonworks plans to take this to the next stage once more experiments

are completed.

Page 21: August 2013 HUG: Removing the NameNode's memory limitation

04/10/2023 21

Q&A• Contact: [email protected]

• We’d love to get trace stats from your cluster • Simple java program to run against your audit logs

• Can also run as Mapreduce jobs

• Extract metadata operation stats without exposing sensitive info• Please contact me if you could help!