95
에에에 30% 에에 에에 에에에 에에 에에에에 에에 에에 에에 에에 에에 1에 에에에에 에에에에에 에에에 , 에에에

참여기관_발표자료-국민대학교 201301 정기회의

Embed Size (px)

Citation preview

Page 1: 참여기관_발표자료-국민대학교 201301 정기회의

에너지 30% 이상 절감 가능한 범용 운영체제 핵심 원천 기술 개발 과제1 월 정기회의국민대학교김영만 , 한재일

Page 2: 참여기관_발표자료-국민대학교 201301 정기회의

ContentsResearch Activities

Part 1 : 발표 배정논문 (3 편 )1. A new model for the system and devices latency2. Cross-Layer Frameworks for Constrained Power and Resources Management of

Embedded Systems3. Automatic Run-Time Selection of Power Polices for Operating Systems

Part 2 : 국민대 연구내용4. Performance Evaluation of Parallel Applications on Next Generation Memory Ar-

chitecture with Power-Aware Paging Method5. PPFS : A Scalable Flash Memory File System for the Hybrid Architecture of Phase-

change RAM and NAND Flash6. Address Translation Technique for Large NAND Flash Memory using Page Level

Mapping7. Performance Optimization Techniques for Legacy File Systems on Flash Memory

Page 3: 참여기관_발표자료-국민대학교 201301 정기회의

1. Research ActivitiesResearch Contents

에너지 절감과 관련된 논문 읽기시뮬레이터 분석

PapersA new model for the system and devices latencyAutomatic Run-Time Selection of Power Policies for Operating SystemsCross-Layer Frameworks for Constrained Power and Resources Management of Embedded SystemsAddress Translation Technique for Large NAND Flash MemoryPerformance Optimization Techniques for Legacy File Systems on Flash MemoryA Scalable Flash Memory File System for the Hybrid Architecture of Phase-change RAM and NAND FlashAn Energy Efficient Cache Design Using Spin Torque Transfer (STT) RAMPerformance Evaluation of Parallel Applications on Next Generation Memory Architecture with Power-Aware Paging Method

Page 4: 참여기관_발표자료-국민대학교 201301 정기회의

Part 1발표 배정논문 (3 편 )

Page 5: 참여기관_발표자료-국민대학교 201301 정기회의

1. A new model for the system and devices latency

Page 6: 참여기관_발표자료-국민대학교 201301 정기회의

WHAT IS LATENCY ?• “ In a computer system, latency is often used to mean any de-

lay or waiting that increases real or perceived response time beyond the response time desired. “

• “ Specific contributors to computer latency include mis-matches in data speed between the microprocessor and in-put/output devices and inadequate data buffers.”

• “ Within a computer, latency can be removed or "hidden" by such techniques as prefetching (anticipating the need for data input requests) and multithreading, or using parallelism across multiple execution threads.“

• Source: http://searchciomidmarket.techtarget.com/definition/latency

Page 7: 참여기관_발표자료-국민대학교 201301 정기회의

TERMINOLOGY. (Texas Instrument)• Latency: time to react to an external event, e.g. time spent to execute the

handler code after an IRQ, time spent to execute driver code from an exter-nal wake-up event.

• HW latency: latency introduced by the HW to transition between power states.

• SW latency: time for the SW to execute low power transition code, e.g. IP block save & restore, caches flush/invalidate etc.

• System: ‘everything needed to execute the kernel code', e.g. on OMAP3, sys-tem = CPU0 + CORE (main memory, caches, IRQ controller...).

• Per-device latency: latency of a device (or peripheral). The per-device PM QoS framework allows to control the devices states from the allowed devices latency.

• Cpuidle: framework that controls the CPUs low power states (=C-states), from the allowed system latency. Note : Is being abused to control the sys-tem state.

• PM runtime: framework that allows the dynamic switching of resources.

Page 8: 참여기관_발표자료-국민대학교 201301 정기회의

HOW TO SPECIFY THE ALLOWED LA-TENCY.• The PM QoS framework allows the kernel and user to specify

the allowed latency.• The framework calculates the aggregated constraint value and

calls the registered platform-specific handlers in order to apply the constraints at lower level.

Page 9: 참여기관_발표자료-국민대학교 201301 정기회의

PM QoS FRAMEWORK.• PM QoS is a framework developed by Intel.• It allows kernel code and applications to set their require-

ments in terms of:• CPU DMA latency.• Network latency.

• According to these requirements, PM QoS allows kernel driv-ers to adjust their power management.

• See Documentation/power/pm_qos_interface.txt.• http://free-electrons.com/kerneldoc/latest/power/pm_qos_interface.txt

• Still in very early deployment (only 4 drivers in 2.6.36).

Page 10: 참여기관_발표자료-국민대학교 201301 정기회의

What is the key point of con-trolling the latency ?• The point is to dynamically optimize the power consumption

of all system components.• Knowing the allowed latency (from the constraints) and the

expected worst-case latency allows to choose the optimum power state.

Page 11: 참여기관_발표자료-국민대학교 201301 정기회의

OMAP.• OMAP (Open Multimedia Applications Platform) developed

by Texas Instruments is a category of proprietary system on chips (SoCs) for portable and mobile multimedia applications.

• OMAP devices generally include a general-purpose ARM ar-chitecture processor core plus one or more specialized co-pro-cessors.

• Earlier OMAP variants commonly featured a variant of the Texas Instruments TMS320 series digital signal processor.

• The OMAP family consists of three product groups classified by performance and intended application:• High-performance applications processors• Basic multimedia applications processors• Integrated modem and applications processors

Page 12: 참여기관_발표자료-국민대학교 201301 정기회의

OMAP.

TI OMAP3530 on BeagleBoard described

Page 13: 참여기관_발표자료-국민대학교 201301 정기회의

OMAP.

TI OMAP4430 on PandaBoard described

Page 14: 참여기관_발표자료-국민대학교 201301 정기회의

CURRENT MODEL.

Page 15: 참여기관_발표자료-국민대학교 201301 정기회의

CURRENT MODEL.

Page 16: 참여기관_발표자료-국민대학교 201301 정기회의

LATENCY FIGURE.

Page 17: 참여기관_발표자료-국민대학교 201301 정기회의

LATENCY FIGURE.

Page 18: 참여기관_발표자료-국민대학교 201301 정기회의

PROBLEM.• There is no concept of ‘overall latency’.• No interdependency between PM frameworks

Ex. on OMAP3 : cpuidle manages only a subset of the power domains (MPU, CORE).

Ex. on OMAP3 per-device PM QoS manages the other power domains. No relation between the frameworks, each framework has its own latency

numbers.• Some system settings are not included in the model

Mainly because of the (lack of) SW support at the time of the measurement session.

Ex. On OMAP3 : voltage scaling in low power modes, sys_clkreq, sys_offmode and the interaction with the PowerIC.

• Dynamic nature of the system settings The measured numbers are for a fixed setup, with predefined system settings. The measured numbers are constant.

Page 19: 참여기관_발표자료-국민대학교 201301 정기회의

SOLUTION PROPOSAL.• Overall latency calculation.• We need a model which breaks down the overall latency into

the latencies from every contributor:Latency = latencySW + latencyHW

Latency = latencySW + latencySoC + latencyExternal HW

• LatencySW : time for the SW to save/restore the context of an IP block.

• LatencySoC : time for the SoC HW to change an IP block state. • LatencyExternal HW : time to stop/restart external HW. (Ex: exter-

nal crystal oscillator, external power supply …)• Note: every latency factor maybe be divided into smaller fac-

tors. E.g: On OMAP a DPLL can feed multiple power domains.

Page 20: 참여기관_발표자료-국민대학교 201301 정기회의

NEW MODEL.

Page 21: 참여기관_발표자료-국민대학교 201301 정기회의

3. Automatic Run-Time Selection of Power Polices for Operat-ing Systems

20130108이재열

Page 22: 참여기관_발표자료-국민대학교 201301 정기회의

Problems• Existing studies one power management make an implicit as-

sumption• Only one policy can be used to save power

• Hence, those studies focus on finding the best polices for unique request patterns

Page 23: 참여기관_발표자료-국민대학교 201301 정기회의

HAPPI(Homogeneous Architecture for Power Policy Integration)

• HAPPI is currently capable of supporting power policies for disk, DVD-ROM, and network devices

• But it can easily be extended to support other I/O devices• Must provide• A function that predicts idleness and controls a device’s power

state.• A function that accepts a trace of device accesses, determines

the actions the control function would take, and returns the en-ergy consumption and access delay from the actions.

Page 24: 참여기관_발표자료-국민대학교 201301 정기회의

HAPPI(Homogeneous Architecture for Power Policy Integration)

• If policy is selected to manage the power state of a specific device by HAPPI, it is considered activity

• Each device is assigned only one active policy at anytime

• Whenever the device is accessed, HAPPI captures the size and time of the access

• Also records the energy and delay for each device

Page 25: 참여기관_발표자료-국민대학교 201301 정기회의

HAPPI(Homogeneous Architecture for Power Policy Integration)

• Policy Selection

Page 26: 참여기관_발표자료-국민대학교 201301 정기회의

Implementation• Linux 2.6.5• Policies and evaluators are implemented as kernel module• Experimental hardware is not fully ACPI compliant• So they implement a function that returns the power, transi-

tion energy and transition delay for each state of each device• Policies need these values to compute the power consumed in

each state

Page 27: 참여기관_발표자료-국민대학교 201301 정기회의

Experiments• Fujitsu laptop hard disk(HDD)• Samsung DVD drive(DVD)• NetXtreme integrated wired network card(NIC)

Power states for devices

Page 28: 참여기관_발표자료-국민대학교 201301 정기회의

Experiments• Workload

1. Web browsing + buffered media playback from DVD2. Download video and buffered media playback from disk3. CVS checkout from remote repository4. E-mail synchronization + unbuffered media playback from DVD5. Kernel compile

Page 29: 참여기관_발표자료-국민대학교 201301 정기회의

Experiments• Policies• Null

• 2-competitive timeout

• Exponential prediction

• Adaptive timeout

Page 30: 참여기관_발표자료-국민대학교 201301 정기회의

Exponential Prediction• Formulation

• In : the last predicted value• in : the latest idle period• a : a constant attenuation factor in the range between 0 to 1• If a = 0, then In+1 = In

• If a = 1, then In = in

• So, typically a = 1/2

Page 31: 참여기관_발표자료-국민대학교 201301 정기회의

Exponential Prediction

In

in

Actual Idle(in)

6 4 6 4 13 13 13 …

Prediction(In) 10 8 6 6 5 9 11 12 …

Page 32: 참여기관_발표자료-국민대학교 201301 정기회의

Experiments• Result

Estimated energy consumption for each policy ondevices for experimental workload

Selected policies for devices at each evaluation

Workload 1 2 3 4 5 Workload 1 2 3 4 5

Page 33: 참여기관_발표자료-국민대학교 201301 정기회의

Conclusion• experiments indicate that policy selection is highly adaptive to

workload and hardware types, supporting our claim that au-tomatic policy selection is necessary to achieve better energy savings

Page 34: 참여기관_발표자료-국민대학교 201301 정기회의

Part 2국민대 연구내용

Page 35: 참여기관_발표자료-국민대학교 201301 정기회의

1. Performance Evaluation of Parallel Applications on Next Generation Memory Architecture with Power-Aware Paging Method

Page 36: 참여기관_발표자료-국민대학교 201301 정기회의

Problems• This paper propose solution (architecture and low power pag-

ing algorithm) to reduce energy consumption in HPC (High Performance Computing) systems

• To demonstrate low power paging algorithm can improve HPC performance and reduce energy consumption

Page 37: 참여기관_발표자료-국민대학교 201301 정기회의

SOLUTION• Replace a part of DRAM with MRAM.• Conduct simulation to evaluate the performance and energy

consumption of several application benchmark.• Make a trace file of memory access in each application

benchmark by using the Valgrind profiling tool.• For each memory access that incurs miss, we collect memory

address and profiling results, which are access count on all the memory pages.

• With the trace files, they replay behavior of application with our event-driven simulator.

Page 38: 참여기관_발표자료-국민대학교 201301 정기회의

HOW CAN THEY SOLVE ?• They propose hybrid memory architecture and power aware

swapping.• Use MRAM as main memory beside DRAM due to its higher

access speed and low power consumption.• Use FLASH as fast random-access swap device due to its faster

random access read speed.• Use MRAM hit rate and threshold in Low Power Paging Algo-

rithm to mange the swapping interaction between DRAM/MRAM and FLASH. Therefore can improve performance and reduce energy.

Page 39: 참여기관_발표자료-국민대학교 201301 정기회의

Proposition – Hybrid Memory Architecture and Power Aware Swapping

CPUs

L1 CACHE

L2 CACHEMRAM DRAM

HOTTER PAGE

COLDER PAGE

FLASH SWAP

Overview of Proposed Low Power Memory Architecture

Mai

n M

emor

y

Larger number of access.

SWAP SWAP SWAP

CACHE

Page 40: 참여기관_발표자료-국민대학교 201301 정기회의

Low-Power Paging AlgorithmAllocate Hot Pages on

MRAM Profiling Result

Application Running

Page Fault

Memory Access on

MRAM

MRAM Hit++

Memory Access++

MRAM Hit Rate <-- MRAM Hit / Memory Access

MRAM Hit Rate > Threshold

Swap Out the Last Recently Used Page on

DRAM

Swap Out the Last Recently Used Page on

DRAM or MRAM

L2 Cache Miss

No

Yes

NoNo

Yes

Yes

Figure 2 – Algorithmic Flow of Proposed Paging Algorithm

- A trace file also includes profiling results, which are access counts on all the memory pages.

- Profiling: the per page memory access frequency of a given application throughout its execution.

- Pre execution trial or sampling with HW assist.

With the trace file, we replay behavior of application with our event-driven simulator.

- Memory access L2 Cache Miss Collect Profiling

Page 41: 참여기관_발표자료-국민대학교 201301 정기회의

Why they need that algo-rithm?• First simple algorithm works as follows:• We pin downs the hottest pages so that they are never swap out

and allocated on MRAM.• The remaining pages are allocated onto DRAM and use LRU based

swapping with flash memory.• This simple algorithm in some case increased application exe-

cution time with LRU swapping algorithm.• Excessive swaps, slowing down the application considerably.

Page 42: 참여기관_발표자료-국민대학교 201301 정기회의

Why they need that algo-rithm?• To resolve this situation, we extend our algorithm by introduce

a metric called MRAM hit rate and its threshold so that appli-cations exhibiting lower locality may use both MRAM and DRAM as swappable main memory.

• Thr = α x MRAM_SIZE / TOTAL_SIZE.• α (≈1) is a configurable parameter to be used to determine the

threshold.• Several preliminary experiments have shown that a threshold

value of 0.9 seems to work for the NAS and other HPC appli-cations.

Page 43: 참여기관_발표자료-국민대학교 201301 정기회의

CORE IDEAS OF LOW POWER PAGING ALGORITHM.• MRAM hit rate is a dynamic value that indicates the ratio of

the access counts onto MRAM versus memory access to all the memory at each point in execution time.

• If the ratio is large, we can decide that accesses to MRAM has sufficient locality such that pages should be pinned down.

• On the other hand, if the ratio is small, the application lacks of locality and thus the entire main memory should be seen as swappable.

Page 44: 참여기관_발표자료-국민대학교 201301 정기회의

CONCLUSION.• Reduce DRAM capacity aggressively can reduce energy con-

sumption, even with swapping.• The energy consumption can be reduced to 25% by reducing

DRAM capacity.

Page 45: 참여기관_발표자료-국민대학교 201301 정기회의

2. PPFS : A Scalable Flash Memory File System for the Hybrid Architecture of Phase-change RAM and NAND Flash

Page 46: 참여기관_발표자료-국민대학교 201301 정기회의

NAND Flash Memory• NAND flash memory structure• Page (2KB) : Read and Write Unit• Block (64 pages = 128KB) : Erase Unit

• NAND flash memory is beauty• Non-volatility• Fast access time (No seek latency)• Low power consumption• Relatively large capacity• Shock-resistance

• NAND flash memory is beast• Erase before write : The page should be erased first in order to

update data on that page• Slow write : Support only page-level write and 10x slower than

read• Limited life time : Ensure 100K ~ 1M erase cycles Ref) K9F1G08X0A Datasheet

Page 47: 참여기관_발표자료-국민대학교 201301 정기회의

Feature of PRAM

Source: Motoyuki Ooishi, Nikkei Electronics Asia, Oct. 2007

PRAM memory Random access memory Non-volatile memory Low leakage energy High density: 4x denser than DRAM Limited endurance

Page 48: 참여기관_발표자료-국민대학교 201301 정기회의

NAND flash memory VS. PRAM

1. KPS5615EZM Data Sheet, 2. K9G8G08U0M Data Sheet

PRAM1 NOR SLC NAND MLC NAND2

Volatility Non-volatile Non-volatile Non-volatile Non-volatile

Random access Yes Yes No No

Unit of write Word (2byte) Word (2byte) Page (2Kbyte) Page (2Kbyte)

Read speed 50ns/word 100ns/word 25us/page 60us/page

Write speed 5us/word 11.5us/word 200us/page 800us/page

Erase speed N/A 0.7s/64KB 2ms/128KB 1.5ms/128KB

Program en-durance 108 105 106 105

Size 32MByte 32MByte ~1GB 4GB+

Others • Serial program• Serial program• Paired page damage

Page 49: 참여기관_발표자료-국민대학교 201301 정기회의

JFFS2(Journaling Flash File System)

• Developed by Redhat eCos in 2001• Designed for NOR flash memory at the first time• Supporting data compression

– Good for reducing total page write– Additional computational overhead

• Log-structured File system– Any file system modification is appended to the log

• Scalability problem– Need full scan at a mount time– Manage all metadata in main memory

• Directory structure, File indexing structureScan area

JFFS

Ref. D. Woodhouse, “JFFS: The journaling flash file system,” presented at the Ottawa Linux Symposium, 2001.

Page 50: 참여기관_발표자료-국민대학교 201301 정기회의

YAFFS2 (Yet Another Flash File System)

• Developed by Aleph One in 2003• Designed specifically for NAND flash memory– Use spare region to store the file metadata

• Log-structured File system– Any file system modification is appended to the log

• Scalability problem– Need to scan entire spare region

• Reduced mounting time comparing with JFFS2– Manage all metadata in main memory

• Directory structure, File indexing structureScan area

YAFFS

Ref. http://www.yaffs.net/

Page 51: 참여기관_발표자료-국민대학교 201301 정기회의

CFFS (Core Flash File System)• Developed By CORE Lab in 2006• Log-structured File system

– Any file system modification is appended to the log• Metadata separation

– Metadata and data is written to different blocks in NAND flash– Scanning only the metadata blocks Reduced mounting time

• Store file indexing structure in NAND flash memory– Reduce the main memory usage– Manage directory structure in main memory

• CFFS2 limitation – Need extra metadata write operation

• Updating file index in NAND flash memory– Wear-leveling problem

• Metadata block is updated more frequentlyScan area

CFFS2

Ref. S. H. Lim and K. H. Park, “An efficient nand flash file system for flash memory storage,” IEEE Transactions on Computers, vol. 55, no. 7, pp. 906–912, 2006.

Page 52: 참여기관_발표자료-국민대학교 201301 정기회의

Previous flash file systemsFeature Pros. Cons.

JFFS2[2001]

• LFS approach• Data compression • Node management

• Reliable• Metadata update overhead• Scalability problem• Node management overhead

YAFFS2[2003]

• LFS approach• Using spare region • Reduced mounting time • Metadata update overhead

• Scalability problem

CFFS[2006]

• LFS approach• Metadata separation• File indexing in NAND

• Reduced mounting time • Reduced GC overhead

• Metadata update overhead• Scalability problem remaining• Extra write overhead• Wear-leveling problem

Page 53: 참여기관_발표자료-국민대학교 201301 정기회의

Metadata update problems

Write 512 or 2KiB

Page 54: 참여기관_발표자료-국민대학교 201301 정기회의

Scalability problems

2. Use of main memory

Scan area Non-scan area1. Scan area comparison

>>

Open(“/dir/a.txt”)

i-number

Location of inode

Location of data

Accessing a file ‘/dir/a.txt’

Type of Index JFFS, YAFFS CFFS

1. Find i-number using path name

In memory di-rectory

In memory di-rectory

2. Find inode using i-number

In memory in-ode map

In memory in-ode map

3. Find file data In memory file index

In NAND file index

YAFFS CFFSJFFS

Page 55: 참여기관_발표자료-국민대학교 201301 정기회의

Solution of Metadata update

Write 2Btyte

Page 56: 참여기관_발표자료-국민대학교 201301 정기회의

PFFS Scalability: Mounting time

• PFFS has minimized and fixed mounting time – All metadata are connected from root directory in PRAM– PFFS does not need to scan the NAND flash memory

YAFFS CFFSJFFS PFFS

Scan area Non-scan area

>> >

Scan area comparison

Page 57: 참여기관_발표자료-국민대학교 201301 정기회의

PFFS Scalability: Memory use

• PFFS use no DRAM main memory for metadata structure– Most of metadata structures of PFFS are contained in PRAM

Type of Index JFFS, YAFFS CFFS PFFS

1. Find i-number using path name

In memory directory

In memory directory

In-PRAM di-rectory

2. Find inode us-ing i-number

In memory inode map

In memory inode map

Simple calcu-lation

3. Find file data In memory file index

In NAND file index

In-PRAM data pointers

Open(“/dir/a.txt”)

i-number

Location of inode

Location of data

Accessing a file ‘/dir/a.txt’

Main memory use

Page 58: 참여기관_발표자료-국민대학교 201301 정기회의

Evaluation• CPU: Samsung S3C2413 (ARM 926EJ)• Mem : 64MB DRAM• 1GB MLC NAND, 32MB PRAM• NAND flash memory characteristics

• Benchmark: PostMark • Benchmark for short-lived, small file read/write

performance• Comparison with YAFFS2

Page 59: 참여기관_발표자료-국민대학교 201301 정기회의

Evaluation

Page 60: 참여기관_발표자료-국민대학교 201301 정기회의

Evaluation

Page 61: 참여기관_발표자료-국민대학교 201301 정기회의

Conclusion• PFFS solves the scalability problems of previous flash file sys-

tems by using the hybrid architecture of PRAM and NAND flash memory

• Mounting time and memory usage of PFFS are O(1)• The performance of PFFS is 25% better than YAFFS2 for small

file writes

Page 62: 참여기관_발표자료-국민대학교 201301 정기회의

3. Address Translation Technique for Large NAND Flash Mem-ory using Page Level Mapping

Page 63: 참여기관_발표자료-국민대학교 201301 정기회의

Problems• In page level mapping scheme, relocation of data is possible as

page size

• But, disadvantage is the large size of mapping table• Ex) In 64GB SSD

• If using block level mapping, size of mapping table is 512KB• If using page level mapping, size of mapping table is 64MB

• So most of the actual commercial SSD uses a hybrid scheme based on block level mapping scheme

Page 64: 참여기관_발표자료-국민대학교 201301 정기회의

Page Level Mapping scheme address translation techniques• The entire mapping table is maintained in the NAND

• caching frequently used mapping table to DRAM

• Use FTL-TLB and FTL mapping directory structure

Page 65: 참여기관_발표자료-국민대학교 201301 정기회의

Page table management in Demand Paging Memory System

• Using page level mapping in the NAND Flash memory and us-ing Demand Paging scheme on the memory system are similar

Page 66: 참여기관_발표자료-국민대학교 201301 정기회의

FTL-TLB• manage mapping table as a section

• A section stored mapping table about NAND Flash memory one block• Ex) If a block has 128 pages, size of a section is 128 * 4B

• The number of sections is the same that entire block numbers in NAND

Page 67: 참여기관_발표자료-국민대학교 201301 정기회의

FTL Mapping Directory• FTL Mapping Directory is allocated in DRAM• FTL Mapping Directory has as much as the number of sections

Whether cashed or not in FTL-TLBWhether updated or not in NAND

Page 68: 참여기관_발표자료-국민대학교 201301 정기회의

Architecture

Sector0

1

3

6

Page 69: 참여기관_발표자료-국민대학교 201301 정기회의

Evaluation• Workload• Goal is 64GB SSD• Use Virtual Box and 64GB HDD and Windows XP• Collect access trace• Daily_usage and multi_program are typical environment• Install_update is Windows update and program install• Large_file is copy large file

Trace Requests Data size [MB]

Daily_usage 545031 10270.78

Multi_program 309262 3070.669

Install_update 1022856 14072.22

Large_file 45593 2810.333

Page 70: 참여기관_발표자료-국민대학교 201301 정기회의

Evaluation• Replacement Algorithm• OPT(The Optimal Algorithm)• LRU• LRFU(Last Recently/Frequently Used Replacement)• LRU2(LRU-K)• LIRS(Low Inter-Reference recency Set)• CFLRU(Clean First LRU)• LRU-WSR(LRU-Write Sequence Reordering)

Page 71: 참여기관_발표자료-국민대학교 201301 정기회의

CFLRU(Clean First LRU)• If all page frames are clean pages or dirty pages then CFLRU is

the same LRU algorithm. • But Example: all page frames have dirty pages and clean

pages.• CFLRU divides the LRU list into two regions.• The working region consists of recently used pages and most of

cache hits are generated in this region.• The clean-first region consists of pages which are candidates for

eviction.• CFLRU selects a clean page to evict in the clean-first region

first.• If there is no clean page in this region, a dirty page at the end

of the LRU list is evicted.

Page 72: 참여기관_발표자료-국민대학교 201301 정기회의

CFLRU(Clean First LRU)

P1 P2 P3 P4 P5 P6 P7 P8

5(D) 2(C) 3(D) 7(C) 1(C) 4(D) 6(C) 8(D)

Working RegionC : Clean PageD : Dirty Page Clean First Region

P7 P1 P2 P3 P4 P5 P6 P8

9(C) 5(D) 2(C) 3(D) 7(C) 1(C) 4(D) 8(D)

P5 P7 P1 P2 P3 P4 P6 P8

10(C) 9(C) 5(D) 2(C) 3(D) 7(C) 4(D) 8(D)

P4 P5 P7 P1 P2 P3 P6 P8

11(C) 10(C) 9(C) 5(D) 2(C) 3(D) 4(D) 8(D)

P2 P4 P5 P7 P1 P3 P6 P8

12(C) 11(C) 10(C) 9(C) 5(D) 3(D) 4(D) 8(D)

Access 9

Access 10

Access 11

Access 12

Evict P7

Evict P5

Evict P4

Evict P2

Page 73: 참여기관_발표자료-국민대학교 201301 정기회의

LRU-WSR(LRU-Write Sequence Reordering)

• We have some concepts: Cold dirty page, And Cold flag.• If the page is dirty and cold-flag is set, this page regarded as a

cold dirty page.• We have example: LRU-WSR uses a page list L and an addi-

tional flag - Cold flag.

Page 74: 참여기관_발표자료-국민대학교 201301 정기회의

LRUWSR(LRU-Write Sequence Reordering)

P1 P2 P3 P4 P5 P6 P7 P8

5(D) 2(C) 3(D) 7(C) 1(C) 4(D) 6(C) 8(D)Cf=0 Cf=0 Cf=0 Cf=0 Cf=0 Cf=1 Cf=0 Cf=0

Cf : Cold flag

P7 P8 P1 P2 P3 P4 P5 P6

9(C) 8(D) 5(D) 2(C) 3(D) 7(C) 1(C) 4(D)Cf=0 Cf=1 Cf=0 Cf=0 Cf=0 Cf=0 Cf=0 Cf=1

Access 9 Evict P7

P6 P7 P8 P1 P2 P3 P4 P5

10(C) 9(C) 8(D) 5(D) 2(C) 3(D) 7(C) 1(C)Cf=0 Cf=0 Cf=1 Cf=0 Cf=0 Cf=0 Cf=0 Cf=0

Access 10 Evict P6

P5 P6 P7 P8 P1 P2 P3 P4

11(C) 10(C) 9(C) 8(D) 5(D) 2(C) 3(D) 7(C)Cf=0 Cf=0 Cf=0 Cf=1 Cf=0 Cf=0 Cf=0 Cf=0

Access 11 Evict P5

P4 P5 P6 P7 P8 P1 P2 P3

12(C) 11(C) 10(C) 9(C) 8(D) 5(D) 2(C) 3(D)Cf=0 Cf=0 Cf=0 Cf=0 Cf=1 Cf=0 Cf=0 Cf=0

Access 12 Evict P4

Page 75: 참여기관_발표자료-국민대학교 201301 정기회의

Evaluation• Cache hit ratio

Daily_usage Multi_program

Large_fileInstall_update

Show cache hit ratio of over 95% in all cases except Large_file.

Page 76: 참여기관_발표자료-국민대학교 201301 정기회의

Evaluation• Overhead

Page 77: 참여기관_발표자료-국민대학교 201301 정기회의

Evaluation• Overhead

Daily_usage Multi_program

Large_fileInstall_update

In most of workload when cache size is more than 512KB overhead is less than 2%.

Page 78: 참여기관_발표자료-국민대학교 201301 정기회의

Evaluation• Memory usage• 64GB SSD has 131072 blocks of 512 KB size• A entry use 6B in FTL mapping directory• So size is 768KB

Page mapping ta-ble

512KBFTL-TLB

1024KBFTL-TLB

64 GB SSD 64MB 1280KB 1.9% 1792KB 2.7%

Page 79: 참여기관_발표자료-국민대학교 201301 정기회의

Conclusion• Although FTL-TLB uses only 512KB, cache hit ratio is over 90%

• Cache over head is under 2%

• Memory usage is only 1.9% rather than full mapping table

Page 80: 참여기관_발표자료-국민대학교 201301 정기회의

4. Performance Optimization Techniques for Legacy File Sys-tems on Flash Memory

Page 81: 참여기관_발표자료-국민대학교 201301 정기회의

Problems• No research about File system optimization on Flash

• Legacy cluster allocation scheme for hard disk is not suitable

• Hard disk can in place update

• But Flash can not do

Page 82: 참여기관_발표자료-국민대학교 201301 정기회의

Solutions• AFCA(Anti-Fragmentation Cluster Allocation)• New Fragmentation for Flash

• Data invalidation scheme• If data is not used any more, file system announce to FTL for re-

duce unnecessary overhead

Page 83: 참여기관_발표자료-국민대학교 201301 정기회의

AFCA(Anti-Fragmentation Cluster Alloca-tion)• File fragmentation• The number of logical blocks to save the file : N• The number of logical blocks to actually used : n• If n>N, file is fragmented

• Free space fragmentation• Minimum number of logical blocks with free space : M• The number of logical blocks in free space is located : m• If m>M, Free space is fragmented

Page 84: 참여기관_발표자료-국민대학교 201301 정기회의

AFCA(Anti-Fragmentation Cluster Alloca-tion)

M : 4m : 5

N : 2n : 2

M : 2m : 3

N : 2n : 3

N : 2n : 2

M : 2m : 2

M : 2m : 2

N : 2n : 3

Page 85: 참여기관_발표자료-국민대학교 201301 정기회의

AFCA(Anti-Fragmentation Cluster Alloca-tion)

basic cluster allocation(BCA) AFCA

Page 86: 참여기관_발표자료-국민대학교 201301 정기회의

AFCA(Anti-Fragmentation Cluster Alloca-tion)• Considerations• If file is larger than logical block, allocate as a logical block. It’s

good to reduce file fragmentation

• After allocate all clusters in a block, allocate next logical block. it is good to reduce free space fragmentation

• File is considered as small file and if file exceeds the threshold, file is considered as large file

Page 87: 참여기관_발표자료-국민대학교 201301 정기회의

AFCA(Anti-Fragmentation Cluster Alloca-tion)• Free logical blocks(F-logical block)• All clusters are unused state in logical block

• Logical blocks for small file(S-logical block)

• Logical blocks for large file(L-logical block)

Page 88: 참여기관_발표자료-국민대학교 201301 정기회의

AFCA(Anti-Fragmentation Cluster Alloca-tion)

F-logical block

S-logical block L-logical block

cluster allocationat small file

cluste

r allo

catio

n at

small

file

Retu

rn th

e enti

re cl

uste

r Return the entire cluster

cluster allocation at large file

Return some cluster

Page 89: 참여기관_발표자료-국민대학교 201301 정기회의

Data invalidation scheme• If sector is not used any more, file system announce to FTL

• FTL checks sector that is invalid data on page mapping table

Page 90: 참여기관_발표자료-국민대학교 201301 정기회의

Data invalidation scheme

Page 91: 참여기관_발표자료-국민대학교 201301 정기회의

Data invalidation scheme

Page 92: 참여기관_발표자료-국민대학교 201301 정기회의

Evaluation• Use Ext2, Kernel 2.4 and NAND Flash Emulator

• Page size is 2KB

• Block size is 128KB

• FTL is Z-FTL based on block mapping

Page 93: 참여기관_발표자료-국민대학교 201301 정기회의

Result

BCA AFCA

Page 94: 참여기관_발표자료-국민대학교 201301 정기회의

Conclusion• When we use AFCA• Fragmentation is reduced up to 53%• Performance is improved up to 46%

• When we use data invalidation• Write performance is improved up to 22%

Page 95: 참여기관_발표자료-국민대학교 201301 정기회의

ThanksQ & A