04/10/2023 1
REMOVING THE NAMENODE'S MEMORY LIMITATION Lin Xiao
Intern@Hortonworks
PhD student @ Carnegie Mellon University
04/10/2023 2
About Me: Lin Xiao• Phd Student at CMU
• Advisor: Garth Gibson • Thesis area – scalable distributed file systems
• Intern at Hortonworks• Intern project: removing the Namenode memory limitation
• Email: [email protected]
04/10/2023 3
Big Data • We create 2.5x1018 bytes of data per day [IBM]
• Sloan Digital Sky Survey: 200GB/night• Facebook: 240 billions of photos till Jan,2013
• 250 million photos uploaded daily
• Cloud storage• Amazon: 2 trillion objects, peak1.1 million op/sec
• Need scalable storage systems• Scalable metadata <- focus of this presentation• Scalable storage• Scalable IO
04/10/2023 4
Scalable Storage Systems• Separate data and metadata servers
• More data nodes for higher throughput & capacity• Bulk of work – the IO path - is done by data servers• Not much work added to metadata servers?
04/10/2023 5
Federated HDFS
• Namenodes(MDS) see their own namespace (NS)• Each datanode can serve all namenodes
04/10/2023 6
Single Namenode• Stores all metadata in memory
• Design is simple• Provide low latency and high throughput metadata operations• Support up to 3K data servers
• Hadoop clusters make it affordable to store old data• Cold data is stored in the cluster for a long time• Take up memory space but rarely used• Growth of data size can exceed throughput
• Goal: remove space limits while maintain similar performance
04/10/2023 7
Metadata in Namenode• Namespace
• Stored as a linked tree structure by inodes• Always visit from the top for any operation
• Blocks Map: block_id to location mapping• Handle separately for huge number of blocks
• Datanode status• IPaddress, capacity, load, heartbeat status, Block report status
• Leases
• Namespace and Block map uses the majority of memory• This talk will focus on the Namespace
04/10/2023 8
Problem and Proposed Solution• Problem:
• Remove namespace limit while maintain similar performance when the working set can fit in memory
• Solution• Retain the same namespace tree structure• Store the namespace in persistent store using LSM (LevelDB)
• No separate edit logs nor checkpoints• All Inode and their updates are persistent via LevelDB
• Fast startup, with the cost of slow initial operations• Could prefetch inodes in
• Do not expect customers to drastically reduce the actual heap size• Larger heap benefits transition between different working sets as
applications and workload changes• A customer may occasionally run queries against cold data
04/10/2023 9
New Namenode Architecture• Namespace
• Same as before, but only part of the tree is in memory• On cache miss, read from levelDB
• Edit logs and checkpoints are replaced by LevelDB• Update to LevelDB for every inode change
• Key: <parent_inode_number + name>
04/10/2023 10
Comparison w/Traditional FileSystem• Traditional File Systems
• VFS layer keeps inode and directory entry cache• Goal is to support the work load of single machine
• Relatively large number of files• Support the applications from a single machine or in case of NFS from a
larger number of client machines• Much much smaller workload and size compared to Hadoop use cases
• LevelDB based Namenode• Support very large traffic of Hadoop cluster• Keep a much larger number of INodes in memory
• Cache replacement policies to suite the Hadoop work load
• Data is in Datanodes
04/10/2023 11
LevelDB• A fast key-value storage library written at Google• Basic operations: get, put, delete• Concurrency: single process w/multiple threads• By default, writes are asynchronous
• As long as the machine doesn’t crash, it’s safe.
• Support synchronous writes• No separate sync() operation
• Can be implemented by sync write/delete
• Support batch updates• Data is automatically compressed using the Snappy
04/10/2023 12
Cache Replacement Policy• Only whole directories are replaced in or out
• Hot dirs are all in cache, others will require levelDB scan• Future – don’t cache very large dirs?
• No need to read from disks to check file existence
• LRU replacement policy• Use CLOCK to approximate to reduce cost
• Separate thread for cache replacement• Start replacement when threshold is exceeded• Remove eviction out of sessions with lock
04/10/2023 13
Benchmark description• NNThroughputBenchmark
• No RPC cost, call FileSystem method directly• All operations are generated based on BFS order
• Each thread gets one portion of the work
• NN Load generator using YCSB++ framework (in progress)• Normal HDFS client calls• Thread either works in their own namespace, or choose randomly
• Load generator based on real cluster traces (in progress)• Can you help me get traces from your cluster?
• Traditional Hadoop benchmark(in progress)• E.g. Gridmix Expect little degradation when most work is for data transfer
04/10/2023 14
Categories of tests• Everything fits in memory
• Goal: should be almost the same as the current NN
• Working set does not fit in memory or changes over time • Study various cache replacement policies• Need to get good traces from real cluster to see patterns of hot,
warm and cold data
04/10/2023 15
Experiment Setup• Hardware description (Susitna)
• CPU: AMD Opteron 6272, 64 bit, 16 MB L2, 16-core 2.1 GHz• SSD: Crucial M4-CT064M4SSD2 SSD, 64 GB, SATA 6.0Gb/s• (In progress) Use disks in future experiments
• Heap size is set to 1GB
• NNThroughputBenchmark• No RPC cost, call FileSystem method directly• All operations are generated based on BFS order
• Multiple threads, but each thread gets one portion of the work
• Each directory contains 100 subdirs and 100 files• Named sequentially: ThroughputBenchDir1, ThroughputBench1
• LevelDB NN• Cache monitor thread starts replacement when 90% full
04/10/2023 16
Create & close 2.4M files – all fit in cache
• Note files are not accessed, but clearly parent dirs are• Note: Old NN and LevelDB NN peak at different # threads• Degradation for peak throughput is 13.5%
04/10/2023 17
Create 9.6M files: 1% fits in cache• Old NN with 8 threads and LevelDB NN with 16 threads.• Performance remains about the same using LevelDB
• Namenode’s throughput drops to zero when memory exhausted
04/10/2023 18
GetFileInfo• ListStatus of first 600K of 2.4M files
• Each thread working on different part of tree • Original NN: all fit in memory (of course)• LevelDB NN: 2 cases: (1) all fit, (2) half fit • Half fit: 10%-20% degradation - cache is constantly replaced
2 4 8 16 320
20000
40000
60000
80000
100000
120000
140000
OriginalFitCacheHalfInCache
Number of Threads
Thro
ughp
ut O
ps/s
ec
04/10/2023 19
Benchmarks that remain• NNThroughputBenchmark
• No RPC cost, call FileSystem method directly• All operations are generated based on BFS order
• Each thread gets one portion of the work
• NN Load generator using YCSB++ framework (in progress)• Normal HDFS client calls• Thread either works in their own namespace, or choose randomly
• Load generator based on real cluster traces (in progress)• Can you help me get traces from your cluster?
• Traditional Hadoop benchmark(in progress)• E.g. Gridmix Expect little degradation when most work is for data transfer
04/10/2023 20
Summary• Now that NN is HA, removing the namespace memory
limitation is one of most important problems to solve• LSM (LevelDB) has worked out quite well
• Initial experiments have shown good results• Need further benchmarks especially on how effective caching is for
different workloads and patterns• Other LSM implementations? (e.g.HBase’s Java LSM)
• Work is done on branch 0.23• Graduate student quality prototype (very good graduate student )
• But worked closed with the HDFS experts at Hortonworks
• Goal of internship was to see how well the idea worked• Hortonworks plans to take this to the next stage once more experiments
are completed.
04/10/2023 21
Q&A• Contact: [email protected]
• We’d love to get trace stats from your cluster • Simple java program to run against your audit logs
• Can also run as Mapreduce jobs
• Extract metadata operation stats without exposing sensitive info• Please contact me if you could help!