47
工業技術研究院資通所 學界分包/學研合作研究報告 ************************************************* * * * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for Mobile and Embedded Systems * * * ************************************************* 契約編號: 執行期間:10211日至1021231計畫主持人:吳真貞 簽章()□期中報告 ▓期末()報告 執行機構:中央研究院 簽印: 一零二 十二

工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

工業技術研究院資通所 學界分包/學研合作研究報告

************************************************* * * * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for Mobile and Embedded Systems * * * *************************************************

契約編號: 執行期間:102年1月1日至102年12月31日 計畫主持人:吳真貞 簽章(名): □期中報告 ▓期末(總)報告 執行機構:中央研究院 簽印:

中 華 民 國 一零二 年 十二 月 九 日

Page 2: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

摘 要

隨著功能越來越多,行動裝置如智慧型手機等已經成為我

們日常生活中重要的工具之一。大部分的智慧型手機上都只有

單一個作業系統。但是在某些情況下,使用者有可能會需要其

它不同的作業系統,例如想要在Windows Phone上執行iOS的應

用程式或遊戲。行動裝置虛擬化是一項讓手機或無線裝置上,

能夠同時執行多個作業系統或者虛擬機器的技術。 虛 擬 化 可 透 過 兩 種 方 式 達 成 : Hypervisor 或 者

Microkernel。Microkernel 是一個高度模組化的架構,只保留了

執行作業系統所需的最基本功能。由於只保留最基本功能,占

用的容量極小,使得它很適合被用在智慧型手機或者嵌入式系

統中。與Hypervisor相比,Microkernel有較高的安全性以及穩定

度。兩者最主要的差別在於,Microkernel 在排程時,將 VM 中的應用程式視為獨立的程序;而 Hypervisor 則是以VM 為單位做排程。此外,由於新式的處理器具備了每個核心可以

單獨開關以及使用不同電壓的能力,使用Microkernel 能夠更

有效的將程序安排到合適的核心上面做處理。 本研究之目的是為嵌入式系統及行動裝置提出一個基於

microkernel 為基礎的省電解法。為了要達到省電這個目標,

microkernel中有三個部份可能會需要被修改:(1) process排程演

算法;(2) 系統資源分配方式;(3) 執行程序間的資料傳遞。

本期計畫之研究重點為節能省電之process排程演算法設計與製

作。 關鍵詞:微內核、省能、排程、虛擬化技術,嵌入式系統、行

動裝置

Page 3: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

ABSTRACT

As the functionalities keep enhancing, Mobile devices such as

smartphones have become an important part in our daily life. Most smartphones run only one OS. However, there are occasions that require different OSes on a smartphone. For example, user may want to play an Apple (iOS) game on a Windows Phone. Mobile virtualization is a technology that enables multiple operating systems or VMs to run simultaneously on a mobile phone or connected wireless device.

Virtualization can be implemented with two alternative approaches: Hypervisor or Microkernel. Microkernel is the near-minimum amount of software that can provide the mechanisms needed to implement an OS. The small size of a microkernel makes it a better choice to be installed in smartphones or embedded systems. Compared to Hypervisor, microkernel has higher security and stability. The main difference between microkernel and hypervisor is the scheduling unit. Microkernel schedules each process in every VM, while hypervisor takes a VM as its scheduling unit.

Since new CPUs have the ability to turn off or adjust the core voltage of individual cores, each core may have different computing powers. Microkernel can make better scheduling arrangement according to the characteristic of each process and computing ability of each core.

In this proposed project, we aim to provide an energy-efficient microkernel solution to mobile devices. We propose three main research thrusts to address the energy efficiency issue: (1) energy-aware process scheduling and consolidation, (2) resource allocation optimization, (3) Inter-Process Communication (IPC) optimization. Keywords : Key Words: Microkernel 、 Energy-efficient 、

Scheduling、Virtualization、Embedded System、Mobile device

Page 4: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

目錄

一. 研究計畫之背景、目的及重要性..........................................1

二. 欲解決的問題..........................................................................2

三. 研究方法及進行步驟(solution) ..............................................2

四. 結論(含1~12月計畫執行成果)...............................................4

Page 5: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

1

一、 研究計畫之背景、目的、重要性

隨著無線通訊技術的發展,智慧型手機已經成為人人必備的隨身裝備之一,再

加上手機功能的不斷加強,使許多原本只能在電腦上進行的工作現在都可以在手機

上完成,這歸功於手機上的處理器數量愈來愈多且處理速度也越來越快的成果。由

於智慧型手機的快速發展,許多廠商紛紛投入大量資源研發各自的行動作業系統,

例如Google的Android, Apple的iOS, Microsoft的Windows Phone等,但眾多的作業

系統卻也帶來許多的問題與不便,例如開發者必須針對不同系統去撰寫專屬的應用

程式,使開發變得複雜,而使用者也必須購買不同手機以執行特定系統的應用程式

或遊戲等,「行動作業系統虛擬化」正是解決此問題的最佳方法。

圖 1 一般作業系統架構與 Microkernel 比較

Microkernel 和 Hypervisor 是目前兩種常見的作業系統虛擬化技術,

Hypervisor 的原理是對作業系統的封裝與硬體的模擬,讓原有作業系統可在不修改

的情狀下,直接執行於 Hypervisor 的環境下,Hypervisor 本身相當於一個完整的

作業系統,包含了檔案系統(File System)、排程器(Scheduler)、虛擬記憶體管理

(Virtual Memory)、程序間通訊(Inter-Process Communication)、裝置驅動程式

(Device Drivers)等等;Microkernel 則是一個高度模組化的架構,Microkernel

只保留了執行作業系統所需的最基本功能,例如:執行緒管理 (Thread

Management)、記憶體位置管理(Address Spacing)和程序間通訊(IPC),其他的服

務,如裝置驅動程式等,則執行於用戶獨立的空間(User Space),如果發生錯誤,

僅需重新啟動該服務即可恢復運作,相較於 Hypervisor 把全部的元件都執行於特

權模式(Privilege Mode),Microkernel 可提供較高的安全性與穩定度,最明顯的

例子就是 WindowsXP 到 Windows7 的改變,前者即是採用類似 Hypervisor 架構,

把所有作業系統元件全部執行於特權模式,因此只要有某個元件(如最常出錯的裝置

驅動程式)出錯,就會導致系統停止運作,後者採用 Microkernel 架構,把最容易

出錯導致作業系統不穩定的裝置驅動程式部份獨立出來運行於用戶空間,因此大幅

提升了系統的穩定性。

Microkernel 歷經了 3個世代的演進,到目前已經相當穩定與成熟,且

Microkernel 採用模組化架構,大幅減少了執行作業系統所需的記憶體空間,因此

比 Hypervisor 更適合用於嵌入式環境。

Page 6: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

2

二、 欲解決的問題

高性能意味著對能源的需求也愈高,但手機本身是一個能源有限的裝置,在效

能不斷提升的同時,如何省電以延長裝置使用時間已成為相當重要的研究議題。

本計劃的主要目標,是利用microkernel與傳統hypervisor特性的不同,以建

置一套以節能省電為主要目的系統虛擬化技術,讓使用者能夠透過虛擬化的方式,

在智慧型手機/嵌入式系統上執行多個虛擬機器。

隨著技術的發展,手機內處理器的運算能力與功能也越來越強。新的處理器除

了內含多核心之外,每個核心擁有獨立電壓,可以單獨對其電壓和時脈進行調整或

者開關,降低功耗。我們希望能利用"microkernel能夠直接排程虛擬機器中各項處

理程序"這個特性,配合"行動裝置處理器核心獨立電壓及時脈"這項新技術,發展一

套節能省電並能有效運用系統資源之系統虛擬化技術,包含一套以省電為目的的核

心排程演算法。

三、 研究方法及進行步驟

圖 2 系統架構圖

上圖 2為整個計劃的系統架構圖。我們的系統會建構在一多核心的 ARM 開發板

上,並將 L4 Microkernel 建置於該多核心開發板上,使系統上可同時執行多個不

同的作業系統,如 Linux、Android 和 Ubuntu 等,並於 L4 Microkermel 上加入

能耗管理模組。能耗管理模組包含: (1) 核心頻率調整器 (Core frequency

tuner);(2) 高效能低耗能排程器(Energy-efficient scheduler);(3) 收集分析

器(Profiler)。

本計劃分成系統與演算法兩個部份進行。系統面主要為提供上層演算法所需要

的資訊以及可調整處理器耗能的功能,分為:(1) Microkernel 平台建置;(2) 實

作處理器核心頻率調整機制。

Page 7: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

3

(1) Microkernel 平台建置:我們預計採用由德國德累斯頓工業大學(TU Dresden)

所開發的 Fiasco microkernel (uKernel)為基礎,並移植到 ARM 開發板上進行實

驗。

(2) 實作處理器核心頻率調整機制:動態電壓與頻率調整(dynamic voltage and

frequency scaling, DVFS)是一套可動態調整處理器效能與能源消耗的機制,最初

只能對整個處理器做調整,也就是說每個處理器核心的效能是一起被調整的,直到

2012 年,應用於行動裝置的 Exynos 4 Quad 四核心處理器把該機制延伸到可針對

個別處理器核心做調整,且可以動態開啟或關閉個別處理器核心。該機制使系統上

不同的處理器核心不再是完全相同,而有高效能與低耗功的差別出現,現有排程演

算法需針對新硬體架構做適度修改以提高能源使用率。此項目即是將新處理器的

per-core DVFS 功能加入 Microkernel 中,使排程演算法可依照需求動態調整或開

關不同的處理器核心。

演算法部份,為了達到節能省電的目的,我們必須對microkernel做適當的修

改。這些修改可能會包含以下幾個部份:(1) process排程演算法;(2) 系統資源分

配方式;(3) 執行程序間的資料傳遞。本計畫將主要專注在 (1) 節能之虛擬機器

及執行程序(process)排程與執行。以下說明節能排程之進行方法與步驟。

為了更了解智慧型手機/嵌入式系統的電量消耗,藉此對microkernel做出合適

的修改,我們必須先建立裝置的耗能模型。以處理器為例,在不同的核心電壓 V 下

每秒所需要消耗的能量都不相同,若是多核心處理器中每個core又使用不同電壓

時,會使整體能量消耗情況更加複雜。除了能量消耗之外,電壓同時也會影響到核

心時脈,也就是俗稱的"頻率",代表的是每單位時間(秒)內可處理cycle數量,跟核

心的效能相關。

關於排程演算法的部分,現有 Linux kernel 所使用的 Complete Fairness

Scheduler(CFS) 並沒有將耗能加入考慮。本計畫預計提出新的排程演算法來取代

CFS。與以往在hypervisor上以VM為單位的排程不同,在micorkernel中,所有VM上

執行的應用程式都會被當成是單獨的處理程序,等待被排進core執行。另一方面,

有別於過去單核心與多核心排程,新式的多核心處理器中的每個核心,能夠獨立調

整核心電壓或者開關,使得每個核心有不同的運算能力。這使得進行排程的時候除

了考慮各處理程序的類型之外,還必須將核心的能力列入考慮。

實作完成之後則是進行模擬實驗結果,以及套用到真實系統作測量,比較兩者

之間數據的差異,找出並克服理論與現實上的差異。

Page 8: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

4

四、 結論

Currently, we have successfully ported the fiasco microkernel and the L4Linux to a

Samsung Exynos4 Cortex-A9 quad-core ARM board. Since the ARM board does not support per-core frequency tuning as claimed on the Samsung official web site, we also ported the fiasco microkernel and the L4Linux to a quad-core PC (x86), which supports per-core DVS/DFS, as our tentative experiment platform. We also developed a prototype of energy-efficient scheduling framework in the microkernel. We address the energy-efficient scheduling issues in two contexts: off-line scheduling for computation jobs that come in batches, and on-line scheduling for the environment in which tasks may arrive at different time and may have different workload characteristics. We propose mathematical models to formulate these scheduling problems, provide theoretical analysis and new theoretical findings, design optimal algorithms for the off-line scheduling problem for both single-core and multi-core, and design effective approximation algorithm for the on-line scheduling problem. We also report our simulation results and experiment results on the quad-core PC.

4.1 System Design and Implementation In this section, we first give an overview of the system architecture and the interaction between the system components. We then describe the implementation details of the system. 4.1.1 System Architecture

Figure 1 System Architecture

Figure 1 gives the overview of our system architecture. There are three software

layers. The bottom layer is the microkernel that executes in kernel-space (privilege mode), the middle layer consists of the components we developed. The top layer includes user applications and virtual machines that execute on top of microkernel. To achieve energy-efficiency on mobile platform, we add three components to microkernel: profiler, energy-efficient scheduler and core frequency tuner.

The profiler collects application runtime performance information, identifies workload characteristics, such as interactive task, computation task and background task,

Page 9: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

5

analyzes application requirements, such as tolerable response time for short request and long request respectively, and collects workload of computing task, such as number of instructions to execute in each scheduling interval.

The energy-efficient scheduler decides the new scheduling policy based on the system status and current task workload collected by the profiler. The policy includes CPU core assignment for each task and task execution arrangement within each core (i.e., the execution order of the tasks on a core).

The core frequency tuner is a device driver which contains a table of supported frequencies and can change the frequency of CPU cores based on the policy generated by the scheduler.

The workflow of these components is as follows. First, the profiler collects application workload related data, and then the scheduler uses these data to decide a new scheduling policy, including task assignment and CPU frequency for executing each task. If CPU core frequency scaling is needed, the scheduler will request the frequency tuner to scale up/down the core frequency accordingly.

4.1.2 System Implementation We first give a high-level description on the procedure to bring up microkernel on

ARM, to bring up L4Linux on the microkernel, and to tune the CPU frequency. An installation guide is included in this report as an appendix. More detailed guide and

scripts will be included in the final report and the CD. It took us more than 2 months to get the microkernel system working, mainly because there is no such installation guide out there; we had to build everything from scratch. We thought that such experience/guide maybe helpful to other research groups.

This section presents our implementation on Hardkernel ODROID-X2 development board, which contains four Samsung Exynos4412 ARM Cortex-A9 cores with 2GB memory. Developing kernel-level application on ARM platform is not easy because ARM platform doesn’t have the standard interface like x86 platform. If we want to make the kernel run on the board, we need to implement many very low-level system components, such as Timer, Interrupt controller and UART console driver, etc. This implementation process is called ‘porting’.

Based on our experience, a system developer needs to get himself/herself familiar with the following knowledge before porting the microkernel to the target board. First, ARM is a total memory-mapped processor in which all system components and peripheral devices can be controlled and accessed by normal memory instructions. This is very different from the x86 platform, which uses special instructions for system control and device access. The second is the system booting process, such as where to put the bootloader, system component initialization sequence, etc. The third is the target board specification manual that describes the system memory-mapping layout and the meaning of each bit of each system component control register. The memory layout describes the starting address and size of each control register. 4.1.2.1 System Booting Process

This section describes the booting process of ARM platform. After powering up the

system, the processor loads the first instruction from its internal SROM and checks the boot monitor that contains the information of where to load the next stage bootloader program. Next, the processor loads the first stage of U-Boot bootloader which initializes UART debug console as well as sdcard device and loads the second stage of U-Boot bootloader. Then, the second stage U-Boot loads the microkernel and the microkernel

Page 10: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

6

initializes other CPUs, L2 cache, MMU, starts system services, and then loads first guest virtual machine.

Since ARM does not have standard interface, we need to implement the low level components for the microkernel to run on ODROID board. First we need to define the memory-mapping layout that maps the physical address of each system component to kernel accessible kernel virtual address and then implement inevitable device driver, such as Timer, UART console, Interrupt Controller, etc. Interrupt controller and Timer are the most important system components. Many pieces of hardware send interrupt signals to the CPU. When the CPU notices that signal, it triggers a hardware interrupt ‒ the CPU performs a context switch, saving some information about what it was doing, then jumps to execute the "interrupt handler" associated with that particular hardware interrupt. For example, when a user types on a keyboard, the keyboard sends a key interrupt. The CPU then executes code for a key interrupt, which typically displays a character on a screen or performs a task. The most important interrupt for kernel is the "timer interrupt", which is emitted at regular intervals by a timer chip. The timer interrupt is mainly used for OS to control system execution flow, prevent user process holding CPU resource and content switch to other processes.

Context switching is the procedure of storing the state of an active process for the CPU when it has to start executing another one. For example, process A with its address space and stack is currently being executed by the CPU and there is a system call to jump to a higher priority process B; the CPU needs to store the current state of the process A so that it can suspend its operation, begin executing process B and when done, return to its previously executing process A. The detail of each implemented component will be included in the appendix. 4.1.2.2 Building L4 Microkernel

Figure 2 Architecture of L4 Microkernel

Figure 2 is the architecture of L4 Microkernel which contains the Fiasco microkernel,

L4RE, and guest virtual machines. Microkernel is a simplified kernel which only contains inevitable part of a OS kernel, such as timer, interrupt capture, and only provides the mechanism for schedule, page table manipulating but the microkernel itself doesn’t know which task to schedule, what page entry to modify. In order to make microkernel work as normal kernel, it requires the policy. L4RE (Runtime Environment) provides the policy for schedule, page fault resolver, interrupt handler and device drivers. The later section will describe the steps to build a runnable microkernel-based virtualization environment. First

Page 11: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

7

section describes the tool required to build the kernel and later the building process for L4 Fiasco microkernel, L4RE and L4Linux. A. Preparing Required Tools

To speed up the building process, we need a cross-compiler tool-chain to compile ARM source code on faster x86 machine. ‘Sourcery G++ Lite 2011.03-41 for ARM GNU/Linux’ is a pre-built tool-chain and can be downloaded from Mentor Graphic Inc.’s website. In addition to tool-chain, we also need development tools from GNU, such as ‘make’, ‘autoconf’ and u-boot tools to build the image for loading on ARM board as well as dialog library for displaying configuration menu. B. Fiasco Microkernel & L4RE

After implement all needed components for the target board, the kernel can be built by setting the correct option in configuration menu. There are many options in Fiasco microkernel menu, the most important 5 settings that must be chosen correctly are CPU type, platform, UART console, timer, interrupt. The CPU type is ARM, platform is Samsung Exynos, UART console number is 1, timer is Multi-core timer and interrupt is ExtGIC. For L4RE part, it contains the policy for microkernel as well as our profiler, energy-efficient scheduler and core frequency tuner. To build L4RE, select Platform option to Exynos4 which define the starting address for program loader.

Figure 3 Interaction between L4Linux and L4 Microkernel of creating new process

C. L4Linux

L4Linux is a modified Linux which can run on L4 Microkernel. The privilege part of the original Linux kernel has been removed and replaced by L4 IPC. This allows the whole L4Linux to execute in non-privilege mode. Origin privilege system calls, such as device access, task scheduler, page fault handler, interrupt handler are mapped to L4 API. These mapping are transparent to the user applications running in L4Linux. Figure 3 shows the example of creating a new process in L4Linux and the interaction between L4Linux and L4 Microkernel. When an application wants to create a new process, it calls the Linux system-call ‘create_new_process()’. This system-call will be replaced by L4 IPC

Page 12: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

8

‘L4_task_new()’. Then the task dispatcher will call ‘L4_thread_schedule()’ to our scheduler module. After the scheduler decides a new schedule policy, it calls ‘L4_thread_switch()’ to assigns the first task in the new schedule to a CPU core and switches to it to start execution.

4.1.2.3 Energy-efficient Scheduler Modules Figure 1 shows the system architecture. The middle layer contains three modules we

developed: profiler, scheduler, frequency tuner. The profiler module uses HPM (hardware performance monitor) to collect application

execution information, such as number of instruction per schedule interval and classify the workload into interactive task, task with periodical deadlines, computation task, background task, etc. The development of the profiler is at its early stage. Currently, most of the classification is done manually. We will put more effort on the profiler in the next few months.

The scheduler module will use the performance information collected by profiler module as well as the internal data structure, such as Task Queue, Core Structure, Task Structure, and decide a new policy based on the scheduling algorithms we have developed. Figure 4 depicts the system diagram of the scheduler.

Figure 4 System diagram of the scheduler

If the policy decides to scale up or down the core frequency, the schedule module

will call ‘L4_CoreTuner_set’ to tune the frequency of CPU cores. 4.1.2.4 Core Frequency Tuner Module

To save energy, we need to tune the frequency of CPU core if current workload doesn’t need that much computing resource. In this section, we describe the implementation of frequency tuner module on both x86 and ARM platform. A. X86 Platform

Page 13: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

9

Intel x86 processor has a special purpose register, named MSR, for controlling CPU

frequency and two instruction, rdmsr and wrmsr, for reading and writing to the MSR register. The MSR is a set of registers. MSR 0x198 is used to read the current frequency and MSR 0x199 is used to set the desired frequency. Table 1 shows the supported frequency for Intel i7-950 processor. The formula for setting frequency is as follow:

Frequency OUT = 133MHz * Multiplier

Multiplier Frequency (GHz) 12 1.6 13 1.7 … 22 2.9 23 3.07

Table 1 Supported frequency for Intel i7-950 processor.

The step to read/write MSR register is: first, put the MSR number to the ECX register

and issue ‘rdmsr’ to read the data, the result will save in EAX register. For writing the MSR, first put the MSR number to ECX, then put desired value into EAX and issue ‘wrmsr’ to write the value to MSR register. Figure 5 shows the example of read/write to the MSR register.

Figure 5 Example of tuning frequency by MSR on Intel processor

B. ARM platform

Tuning core frequency on ARM is more complicated. It needs to setup the clock generator and the core frequency divider. APLL is a high frequency clock generator available on ARM platform. CPUDIV is the register to setup the core frequency. The equation to calculate the frequency is:

FOUT = MDIV x FIN / (PDIV x 2^SDIV)

Table 2 shows the PDIV/MDIV/SDIV value for each core frequency. The procedure to tune the core frequency is as follows:

For CPU Clock Divider (CPUDIV) 1. Write clock divider value into CLK_DIV_CPU register 2. Wait until clock is stable (CLK_DIV_STAT_CPU = 0x00000000)

For APLL Clock Generator (APLL_CON0)

Page 14: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

10

1. Set CPU clock source to another one, like MPLL (CLK_SRC_CPU->MUX_CORE_SEL[16] = 1)

2. Wait until clock is changed (CLK_MUX_STAT_CPU[18:16] = 0x2) 3. Set PDIV/MDIV/SDIV divider value in APLL 4. Wait until clock is stable (APLL_CON0->LOCKED[29] = 1) 5. Set CPU clock source back to APLL (CLK_SRC_CPU-

>MUX_CORE_SEL[16] = 0)

Table 2 PDIV/MDIV/SDIVvalue for each core frequency

Scheduling for Multi-cores with per-core DVFS 4.2 Energy-efficient Off-line Task Scheduling

This section presents our approach to energy-efficient offline task scheduling. First we introduce our model, including tasks, CPU processing rate, and energy consumption model. Second, we give a formal definition of the energy-efficient offline task scheduling problem, then present our theoretical findings for such class of scheduling problems, and present our proposed algorithms for task scheduling and CPU frequency tuning in both single-core and multi-core environments. Finally, we report our simulation result as well as the actual experiment results.

4.2.1 Models A. Task

A task consists of a sequence of instructions to be executed by a processor. Mathematically, we model a task Tk ∈ T as a tuple with the following parameters:

Tk = (Bk, Ak, Dk) Bk is the number of instructions to be executed. Ak is the arrival time of Tk. Since we only consider offline task scheduling in this section, Ak is set to 0 for all Tk ∈ T. Dk is the deadline of Tk. If Tk has a deadline requirement, then Dk>Ak≧0; otherwise Dk≦0, which indicates that Tk does not have a deadline constraint.

Page 15: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

11

We make some assumptions about task Tk. First, we assume that tasks are non-preemptive. A running task cannot be interrupted by other tasks. This assumption reduces the overhead caused by context switch and task migration. The second assumption is that tasks can be scheduled in different orders. Since we focus on offline scheduling in this section, the arrival time Ak of each task is set to be 0. In other words, the scheduler has the information on all the tasks to be scheduled, and therefore, it can decide the execution order of these tasks according to some scheduling policies. B. Processing Rate

Since current processors support DVS/DFS technique, a processor can have different processing rates, or frequency. We model the processing rate pi of task Ti as the number of instructions executed per second by a processor. The value of pi is discrete. Different processor models can provide different processing rate selections. For example, on an x86 machine with core Intel i7-950, pi ∈ P = {1.6, 1.73… 3.06} (GHz); on ARM board with core exynos4412, pi ∈ P = {0.2, 0.3… 1.7} (GHz). In this work, we assume that while executing a task, the processing rate remains unchanged. Processing rate/frequency change only occurs when the processor/core starts to execute a task. C. Energy Consumption

For a task Tk, let εk be the expected energy consumption of Tk in Joules. We model the energy consumption εk and execution time tk as follows.

kkk

kkk

BpQtBpE

)()(

==ε

where Bk is the number of instructions to be executed in task Tk, pk is the processing

rate used for Tk, E(pk) is the amount of energy to process an instruction using processing rate pk, and Q(pk) is the time to process an instruction using processing rate pk. We assume that if the processing rate is fixed (i.e. it does not change during execution of a task), both energy consumption (εk) and execution time (tk) are proportional to the number of instructions (Bk) to be executed. 4.2.2 Scheduling strategies

For off-line scheduling, we consider two kinds of tasks: tasks with deadline and tasks without deadline. Also we consider two different environments: single core and multi-cores. The combinations of tasks and environments result in four different scenarios. In the following sections, we first give the problem definition in each scenario, and then provide some theoretical analysis and algorithm design. A. Tasks with deadline in single core environment

Given a set of tasks each with a deadline, the problem is to decide the execution order and the processing rate of each task, such that every task can be finished before its deadline and the overall energy consumption is minimized. In the following, we show that this problem is NP-Complete.

Page 16: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

12

Q E

PL 2s 1J

PH 1s 4J

Table 3 Example

Assume that there are only two processing rates, pH and pL, pH equals to two times pL, Q(pL)=2*Q(pH)=2s, E(pH)=4*E(pL)=4J. (Theoretically, E is proportional to the square of 1/Q.) Given n+1 tasks, the number of instructions of the n+1-th task equals to the summation of the first n tasks, denotes as Bsum. The given time and energy constraint are 2.5Bsum (sec) and 6.5Bsum (Joule), respectively.

In this case, if the n+1-th task uses pL, which will take 2Bsum (sec) in time, the rest of the n tasks can never be finished in the remaining 0.5Bsum (sec). Thus the n+1-th task must use a higher processing rate pH, which reduces the execution time to Bsum (sec), and consumes 4Bsum (J) energy. The constraints for the rest of the n tasks become 1.5Bsum (sec) in time and 2.5Bsum (J) in energy. In other words, we need to select some of the tasks with the sum of their number of instructions exactly 1/2 Bsum to use pH as their processing rate, and the others use pL. This is equivalent to the Partition problem, which is NP-Complete.

We employ dynamic programming to find an optimal solution. For simplicity of presentation and without loss of generality, we assume that there are only two processing rates, pH and pL. The energy consumption of finishing i tasks before time t can be formulated as in the following equation. We can recursively find the minimum energy consumption using this equation.

))())(,1(,)())(,1(min(),( iLiLiHiH BpEBpQtiGBpEBpQtiGtiG +−−+−−= B. Tasks with deadline in multi-core environment According to the previous section (4.2.2.A), deciding the processing rate of each task under time and energy constraint on single core is NP-Complete. Extending from single-core to multi-cores increases the complexity of the problem, and hence the problem is also NP-Complete. To solve the problem, we use the Earliest Deadline First Round-Robin heuristic to dispatch tasks to cores, and then apply dynamic programming to recursively find the minimum energy consumption. C. Tasks without deadline in single core environment

The results presented in Section C and D for scheduling tasks without deadline is useful for energy/performance testing and evaluation for mobile device manufactures, as well as users of a new mobile device who want to know the energy/performance of the new device. Usually a batch of jobs is loaded in the system and starts their execution in a batch mode. In Section C and D, we show that there exists optimal solution (which minimizes energy consumption without sacrificing performance) for both single-core and multi-cores.

Given a set of tasks without deadlines, the problem is to find an execution order of the tasks and the processing rate for each task such that the overall cost is minimized. Note that we use cost instead of energy consumption, since one can always run every task with the lowest processing rate in order to minimize energy consumption. However, the performance of tasks will greatly suffer using such approach. The cost is a metric that takes both minimizing energy consumption and reducing the waiting time of the tasks into

Page 17: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

13

consideration. Since energy and time cannot be added directly, we introduce a coefficient H in our cost function.

We define the cost function as in Equation 1: (the execution order is T1, T2, T3, ..., Tn)

ii

i

jjji

i

jji BpEBpQHtHC )())((

1

1

1

1+×=+×= ∑∑

=

=

ε (1)

The cost Ci of task Ti depends on how long it has to wait until being processed, and the energy consumed by this task. tj is the execution time of task Tj. According to our model, tj equals to Q(pj) times Bj, where Bj is the number of instructions to be executed, pj is the processing rate used for task Tj, and Q(pj) is the time to process an instruction using processing rate pj. εi is the energy consumption of Ti, which equals to Bi multiples E(pi). H is the cost ratio between time and energy, and it should be non-negative. For example, given a set of tasks, B = {1, 2, 3, 4, 4, 6}. Assume there are only two processing rates, P = {1, 2}, Q(p)=1/p, E(p)=p2. Figure 6 shows three different execution sequences with different costs. In Figure 6, every task in the first sequence uses processing rate equals to 1. Tasks T2 and T3 in the second sequence use a higher processing rate (i.e., p2 = p3 = 2). In the third sequence, T1, T2 and T3 use a higher processing rate. We can observe that sequence 3 has the lowest cost among these three examples. Next, we present the theoretical results on energy-efficient off-line task scheduling for single core.

Figure 6 Example of execution sequence

Lemma 1 If the execution order of the n tasks is T1, T2, T3, ..., Tn, then the total cost can be computed as in Equation 2.

{ }∑∑−

=−−−−

=− ×+×××=

1

0

1

0)()()(

n

kknknknkn

n

kkn BpEBpQkHTF (2)

Proof of Lemma 1 The right hand side term of Equation 2 represents energy consumption (the same as in Equation 1). The left hand side term represents waiting time. In Equation 1, the waiting time of task Ti is the sum of the execution time of the i-1 tasks before Ti. That is, the waiting time is the delay for Ti caused by waiting for the i-1 tasks before it. Here, we can formulate the waiting time from a different view point. We can define the waiting time

Page 18: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

14

caused by task Tk to be the delay for the n-k tasks after Tk. Let Tn-k be the last k+1th task. There are k tasks after Tn-k. Let the execution time of Tn-k be tn-k, then the waiting time of the k tasks contributed by Tn-k is k × tn-k, as shown in the left hand side term of Equation 2. Lemma 2 The decision on the processing rate for a task to minimize the total cost only depends on the number of tasks waiting after that task. Proof of Lemma 2

knkn

knknkn

knknknkn

kn

BpkCBpEpQkH

BpEBpQkHTF

−−

−−−

−−−−

×=×+××=

×+×××=

),())()((

)()()(

It is obvious that the value of C(k, pn-k) only depends on k (the number of tasks waiting after task Tn-k) and pn-k (the processing rate for Tn-k), and is independent of Bn-k. Lemma 3 For any four non-negative real numbers a, b, x, y, with a≧b and x≧y, we will have ay+bx≦ax+by. Proof of Lemma 3

bxaybyaxbybxayax

yxba

+≥+⇒≥+−−⇒

≥−−0

0))((

Theorem 1

There exists an optimal solution with the minimum cost in which the tasks are in non-decreasing order of the number of instructions.

Proof of optimality Let C(k) = min C(k,pn-k). Based on Lemma 1 and Lemma 2, when the task execution order is T1, T2, T3, ..., Tn, the minimum total cost is:

∑−

=−⋅

1

0)(

n

kknBkC

We will have C(k)≦C(k+1), since we can use the best processing rate of C(k+1) (say

p) as the processing rate for C(k). Then

)()1(0)()()1(

kCkCpQHkCkC

≥+⇒≥×=−+

Assume that in the global optimal solution, there are two tasks i and j, Bi>Bj, and task i is placed before task j (that means task i will be executed before task j). According to Lemma3, changing the order of these two tasks will not increase the cost. By repeating the task order swapping until no more task pairs can be swapped, we will obtain a task sequence in which the tasks are in non-decreasing order of the number of instructions.

Page 19: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

15

D. Tasks without deadline in multi-core environment

Given a set of tasks without deadline, deploy these tasks to some homogeneous cores and find the execution sequence of each core that minimizes the overall cost. Same as the previous section, we use cost instead of energy consumption only. Theorem 2

There exists an optimal solution with the minimum cost in which the tasks are assigned in non-decreasing order of the number of instructions to the cores in a round-robin fashion.

Proof of optimality

Based on Theorem 1, we learned that task m with larger Bm should be put in position k with small C(k) in order to get the optimal solution. Since the cores are homogeneous, the corresponding position k of each core will have the same C(k). Thus, we can deploy the task in a round-robin fashion and achieve minimum cost.

Same as Longest Task Last, we sort the tasks according to its Bi, and deploy the tasks to cores in a round-robin fashion, starting from the largest task. After deploying every task, tasks in each core are in descending order, thus we have to reverse the sequence, making them in ascending order. Final step is to set the processing rate to pm, which makes C(k) minimum. The output is the scheduling plan of each core with minimum overall cost.

4.2.3 Simulation and Experimental Results A. Environment

Algorithm: Workload Based Round Robin Input: n tasks, x cores Output: execution sequence of each core, and processing rates of tasks 1. Sort the tasks by Bi in descending order. 2. From the first (largest) task to the last (smallest) do 3. Assign the task to a core in a round-robin fashion 4. end do 5. Reverse the execution order of the tasks in each core

6. For each task in each core do Set its processing rate according to its position in the sequence.

7. end do

Algorithm: Longest Task Last Input: n tasks Output: execution sequence and processing rates of the n tasks 1. Sort the tasks by Bi in non-decreasing order. 2. For each task Tn-k do 3. Set its processing rate pn-k to argmin(C(k,p)), p∈P. 4. end do

Page 20: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

16

Since we cannot find an ARM platform that support individual core frequency tuning, we conduct our experiments on a quad-core x86 PC that does support individual core frequency tuning.

The following is our environment setting. We set up fiasco microkernel on a quad-core x86 machine. The frequency of each core can be separately adjusted. There are 12 frequency choices, ranging from 1.6GHz to 3.06GHz. We only use some of them in our experiments. The benchmark we use is MiBench. Since the execution time of each application in MiBench is short (less than 1 second), we synthesis 16 workloads using these applications, with each loops for different number of iterations. The execution time of these workloads ranges from 1 seconds to 17 minutes.

The power consumption is measured by a power meter, DW-6091. The energy consumption is the integration of power reading during the execution period. Since there are other components that consume energy, we first measure the power consumption of an idle machine and minus the idle power from our experimental results. B. Simulation We run some simulations before conducting experiments. We have done the following simulations. Given a set of tasks, use the proposed Workload Based Round Robin to find the optimal schedule plan with the minimum cost. In the simulation, only two processing rate, 1.6GHz and 3.0 GHz, are used. C. Experiment 1

In this experiment, we implement the scheduling algorithms WBRR in our Energy-Efficient Scheduler, and measure the actual energy consumption on a x86 machine.

Table 4 is the cost breakdown of WBRR schedule plan. The discrepancy of overall cost is 4.75%, which means that our model and simulation is quite close to real environment. The discrepancy of energy cost is 7.22%, while discrepancy of time cost is 2.79%. The reason that the energy cost has larger discrepancy between simulation and experiment might be caused by the power meter. The readings from the power meter are integers, thus produce the discrepancy.

Cost Cost(Energy) Cost (Waiting Time)

Simulation 57842473 25527848 32314630

Experiment 55096215 23684531 31411684

Discrepancy 4.75% 7.22% 2.79%

Table 4 Cost decomposition of WBRR

Figure 7 is the execution results of WBRR. The x-axis is time in second. The y-axis is power in Watts, and the processing rate in GHz, respectively. The cyan-blue line is the reading from power meter, while the other four lines are the processing rate of each core. This figure shows that the power change is more significant while reducing the processing rate of a core compared to finishing tasks earlier and put the cores to idle.

Page 21: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

17

Figure 7 Execution results of WBRR

D. Experiment 2 The second experiment compares the cost between WBRR and the baseline scheduling in fiasco microkernel. The baseline schedules the incoming tasks to the cores in a FIFO fashion. Each core is always at the highest frequency. The baseline scheduling does not involve dynamic frequency scaling.

Cost Cost(Energy) Cost(Waiting Time)

WBRR (optimal) 55096215 23684531 31411684

Baseline 74955130 45858410 29096720

Improvement 26.49% 48.35% -7.96%

Table 5 Cost comparison between WBRR and Baseline

Table 5 demonstrates the cost of WBRR and baseline. The last row shows the

improvement of changing from baseline to WBRR. The improvement in overall cost is about 26%, mostly from the cost of energy. The result shows that even if WBRR results in slightly longer waiting time compared to baseline, it saves a significant amount of energy consumption. E. Experiment 3

In this experiment, we extend the number of processing rates a task can choose from. We use the following five frequencies: 1.6GHz, 2.0GHz, 2.4GHz, 2.8GHz, and 3.0 GHz. First we use the simulation to generate a schedule plan, then run the experiment and measure the cost.

Page 22: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

18

Cost Cost(Energy) Cost (Waiting Time)

2 frequencies 55096215 23684531 31411684

5 frequencies 53371136 23680186 29690950

Improvement 3.13% 0.018% 5.48%

Table 6 Cost comparison between WBRR using 2 and 5 frequencies

Table 6 is the cost comparison using 2 processing rates and 5 processing rates. The last row shows the improvement of using 5 frequencies instead of 2 frequencies. We can see that most of the improvement comes from the cost of waiting time, while the cost of energy is only 0.018%.

Figure 8 is the execution results of WBRR using 5 frequencies. The x-axis is time in second. The y-axis is power in mW, and the processing rate in GHz, respectively. The cyan-blue line is the reading from power meter, while the other four lines are the processing rate of each core. From this figure we can observe that even if there are five frequencies, only three of them are used. The reason is that the number of tasks is not sufficient to support a task to use a higher frequency, as mentioned in Lemma 2.

Figure 8 execution results of WBRR with 5 frequencies

F. Overall Comparison Figure 9 demonstrates the power readings of baseline, WBRR with two frequencies, and WBRR with five frequencies. The energy consumption of each scheduling metrics is the integration of its power reading. It is clear that even if baseline can finish tasks earlier, the energy consumption is significantly larger than WBRR. On the other hand, the two WBRR have similar task finishing time and energy consumption.

Page 23: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

19

Figure 9 Execution results of WBRR and WBRR

Table 7 shows the actual energy consumption of baseline, WBRR with two

frequencies, and WBRR with five frequencies. The energy consumption is the integration of the power consumption during its execution time. The improvements of the two WBRR are about 48% compared to baseline, which are consistent with the result in Table 5.

Energy Consumption

Ratio against Baseline

WBRR(2) 31591.53 0.5167

WBRR(5) 31573.58 0.5164

Baseline 61144.55 1.0

Table 7 Actual energy consumption (Joules)

4.3 Energy-efficient On-line Task Scheduling

In this section, we focus on the online task scheduling. First we introduce our task model. Second, we give a formal definition of the energy-efficient online task scheduling problem, and present our solution. The last section gives some preliminary results. 4.3.1 Task Model

Tasks running on a mobile device can be roughly characterized into three categories: interactive task, computation task, and background task. Interactive tasks are triggered by user. The response time is curial to interactive tasks. For example, the switching between screens must be smooth while users slide their finger on the screen. We assume that the workload of an interactive task is small, and must be finished in a short time. That is, there is a deadline for each interactive task. Computation tasks are tasks that generated by applications. For example, a game A.I. Computation tasks also have deadline. However, their deadlines do not have such curial time constraint compare to interactive tasks. They also have larger amount of computation. Computation tasks can be further categorized into CPU-bound and memory-bound, but we treat them equally for now.

Page 24: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

20

Unlike the above two kinds of tasks, which have arrival times and deadlines, background tasks exist while the system is up. Background tasks, such as checking for updates and new messages, are less urgent compare to the other tasks. We assume that background tasks only require a fixed amount of cycles to “survive”. Based on the characteristic of each kind of tasks, we model the tasks as following:

Tk = (Bk, Ak, Dk)

Bk is the number of instruction to be executed. Ak is the arrival time of Tk. Dk is the deadline of Tk.

Bk

Ak

Dk

Interactive ∈ ℕ ∈ ℕ ∈ ℕ

Dk < U

Computation ∈ ℕ ∈ ℕ ∈ ℕ

Dk ≧U

Background θ 0 -1

Table 8 Task parameters

Table 8 shows the constraint of each kind of tasks. U is the length of a fixed time interval. For interactive and computation tasks, the number of instruction to be executed, arrival time, and deadline are positive integers. However, an interactive task should be finished in one time interval, while computation tasks can take more than one interval. Background tasks do not have arrival time and deadline. They only require θ cycles in each time interval U. We further make some assumptions about tasks. First, we know the Bk of each task. By profiling and other techniques, we can estimate the instructions to be executed in order to finish a task. Second, all the tasks are sequential tasks. Third, there are few interactive tasks in the system, since mobile devices is used by a single user. Also, the amount of instruction to be executed in a task is within the computation capacity of a core during each interval U. Last but not the least, task can migration between cores, but the cost of task migration is high due to expensive IPC communication in microkernel. 4.3.2 Scheduling strategies A. Objective For ever time interval U, we need to decide the processing rate of each core and make a scheduling decision, such that the power consumption is minimum. Also the scheduling plan must satisfy the following objectives: every interactive task can be finished before its deadline; every computation task can execute at least x instructions in order to accomplish before deadline; every background task can executed θ instructions. B. Current solution We separate the time into time intervals. For each interval, we are aware of all the running tasks since we use microkernel. According to the current tasks and their

Page 25: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

21

characteristic, we need to make a scheduling decision for this interval. The overhead of the solution must be small since it is calculated every time interval. Our current solution, called “Iterative Earliest Deadline First+Best Fit” (IEDF+BF), includes three steps: First, estimate the number of high frequency cores; second, schedule tasks to core according to bias; third, further adjust core frequency. The first step is to estimate the number of high frequency cores. The purpose is to use the least number of high frequency cores as possible. We use the deadline of interactive tasks to determine the number of high frequency cores. The following is the algorithm:

This algorithm iteratively finds the least number of high frequency cores. One thing worth mention is that if there is no interactive task, or interactive tasks can meet their deadline with all the core using the lowest frequency, we set i to 0. The second step is scheduling tasks to cores. Before scheduling a task, we have to make sure the current processing rate setting can provide enough computation resources. In step one, we already decide some high frequency cores, and schedule the interactive tasks according to their deadline. The rest of the tasks are computation and background tasks. These two kinds of tasks require a number of instructions to be executed in a scheduling interval.

Figure 10 illustrate the available computation resources for computation and background tasks. In this example, the number of high frequency is two. We schedule interactive tasks 1~3 first since they have deadlines. The rest of the computation resources are reserved for computation and background tasks. If the reserved resources are less than the requirement summation of these two kinds of tasks, we have to increase the processing rate to provide more resources in this interval. Even if we provide enough resources, deploying tasks to core is still an NP-Complete problem. Here we apply a heuristic, Best-Fit, to schedule tasks to core. First, we sort all the computation tasks according to their number of instructions to be executed in descending order. Starting from the task with largest Bk, we deploy it to the core with the minimum amount of remaining resources that can accommodate this task. After deploying all the computation tasks, we apply the same method to background tasks. Figure 11 is an example after scheduling.

Algorithm: Iterative Earliest Deadline First Input: interactive tasks, n cores Output: the number of high frequency cores 1. Set all the cores with the lowest frequency 2. Sort the interactive tasks by Di in ascending order. 3. For i=1,n do 4. Set i cores with a high frequency. 5. Deploy interactive tasks to these cores using EDF. 6. if(every interactive tasks meet their deadline) 7. break; 8. end do 9. return i

Page 26: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

22

Figure 10 Example of step 1

Figure 11 Example after scheduling

The third step is to further adjust core frequency. For each core, if the tasks can be

finished using a lower processing rate, then we reduce it frequency; on the other hand, if this core can’t finish the assigned tasks in time, then we increase its frequency. 4.3.3 Simulation A. Environment We run some simulations and compare the power consumption between our solution and baseline. Our simulation environment consists of a quad-core CPU. The frequency of each core can be separately adjusted. There are six frequency choices, ranging from 200MHz to 1.2GHz. The time interval U is set to be 0.1 second.

Page 27: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

23

B. Benchmark

We execute two famous games from Google Play, Candy Crush and Monkey, and collect the workload changing as our input. Also we collect the workload changing of an app from a student project. The sampling period is one second. The frequency of each core is 1.2GHz. Figure 12, Figure 13, and Figure 14 show the results.

Figure 12 Workload changing in Candy Crush

Figure 13 Workload changing in Monkey

Page 28: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

24

Figure 14 Workload changing in VR-project

In these figures, the x-axis is time in seconds, while the y-axis is the loading. We can

calculate the number of instructions to be executed each second by multiplying the loading with the core frequency. In Figure 12, the first 30 seconds are in stage selection menu. The following 160 seconds are game playing. The last 30 seconds show some statistics. In Figure 14, the game crashes at the 18-th second, and create a new thread at the 26-th second. As we can observe, there is only one thread in each app that results in most of the workload. These threads are the “computation threads” we expected. However, we cannot find the “interactive threads” from these figures. The reason might be that the inputs from users, such as touch events, are handled by the computation thread instead of another interactive thread. To be more specific, computation thread takes the x and y coordinate where user touched as input, and calculates the corresponding reaction. Since these computation threads also include the behavior of interactive threads, we duplicate the threads and assign one of them to be computation thread. The other one is assigned to be interactive thread with periodical deadline 0.1 second.

Each computation thread is further divided into “computation tasks”. The number of instructions to be executed for each computation task equals to the number of instructions to be executed in 1 second. The deadline of each computation task is set to 1 second.

On the other hand, the interactive thread is also divided into “interactive tasks”. However, the deadline of each interactive task is 0.1 second instead of 1 second. The number of instruction to be executed for each interactive task is the number of instructions to be executed in 0.1 second. Also we pick another six threads from Candy Crush, three threads from Monkey, and one thread from VR project as background tasks. We divide these threads into background tasks. C. Simulation Method Starting from time 0, we invoke an “app”, which consists of different kinds of threads each activated at some specific time. During each interval U, we need to decide a frequency setting of each core that satisfies the deadline requirement of tasks from each thread. The overall energy consumption is calculated at the end.

Page 29: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

25

Figure 15 Example after scheduling

Figure 15 demonstrates the workload changing of each thread in our simulation. The

app Candy Crush starts in time 0. Monkey starts one minute later. The app VR is executed twice, in time 15 and time 210, respectively. The threads with postfix “-I” are interactive threads, the ones with “-C” are computation threads, and the ones with “-BO” are background threads. D. Simulation Results Table 9 shows the simulation results. We assume that the energy consumption of a 200MHz core in one interval, which is 0.1 second, equals to 1s(s is the unit of energy consumption we use in our cost model). The Baseline method always uses the highest frequency during execution. The Adaptive method has two thresholds, 30% and 80%. For each core, if the loading of its previous interval is over 80%, it will increase the frequency of this interval one level higher. For example, a 800MHz core has loading 85% in an interval. In the next interval, it will increase the frequency to 1.0GHz. On the other hand, if the loading is lower than 30%, the frequency will decrease in the next interval.

Energy After Normalization

Baseline 2194560s 1.0

Adaptive 1374015s 0.626

IEDF+BF 739080s 0.337

Table 9 Simulation results As can be seen from Table 9, our method, IEDF+BF, significantly reduces the energy consumption. However, both Adaptive and IEDF+BF have the same drawback: In each

Page 30: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

26

interval, there may be some instructions that cannot be executed in that interval. Our method produces about six times of such instructions compared to Adaptive in this case. However, we can execute these instructions in the next intervals and meet the deadline of each task. 4.3.4 Discussion In our approach, we assume that the time interval is fixed. However, interactive tasks will not come into the system in such a regular way. Also the deadline of an interactive task is short. It may be too late to schedule and process it in the next interval. Thus we propose an alternative way. If there is a new interactive task, stop the current interval, and start a new one. This makes sure that every interactive task can be scheduled immediately. As for computation and background tasks, the instructions not being executed in the interrupt interval can be compensated by asking for more resources in the new interval. Scheduling tasks to cores is an NP-Complete problem. However, this is under the assumption that tasks are non-preemptive by other tasks, and no task migration between cores. If we allow the tasks to be preemptive and can migrate to other cores at most once in an interval, there might be a simple solution for scheduling these tasks. This will be one of our research directions in the next five months. Scheduling for bigLITTLE core Architectures 4.4 Background

big-LITTLE core [1] is a heterogeneous computing architecture developed by ARM in 2011. There are two kinds of processors in big-LITTLE core architecture, one with relatively slower speed but lower power consumption, the other with higher processing power but also higher power demand. The intention is to create a multi-core environment that can provide a balance between performance and power efficiency. For example, CPU intensive tasks such as gaming or web page rendering can be run on big cores for performance reason, while other tasks such as texting and email are run on low power LITTLE cores.

The two kinds of processors in big-LITTLE core architecture must be architecturally compatible, since instructions must be transparently migrated from one to the other. The design proposed by ARM in 2011 used Cortex-A7 for LITTLE cores and Cortex-A15 for big cores. One year later, ARM announced another pair, Cortex-A53 and Cortex-A57, both are ARMv8 cores on a big-LITTLE core chip.

Currently there are three different models for arranging processor cores in a big-LITTLE core architecture design: cluster migration, CPU migration, and heterogeneous multi-processing. These three different arrangements affect the design of kernel scheduling.

Cluster migration is the simplest design. Cores are divided into two clusters. One contains only big cores, the other with only LITTLE cores. The scheduler can only see one of the two processor clusters at a time. If the load of LITTLE core cluster reaches a threshold, the system activates the big core cluster, transitions all relevant data by L2 cache, and resumes the execution of all task on big core cluster. Then, the LITTLE core cluster is shutdown. The transition also happens when the load of big core cluster is below certain threshold. Figure 16 demonstrates the idea of cluster migration.

Page 31: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

27

Figure 16 Cluster migration

CPU migration pairs up one big core with a LITTLE core. Each pair is treated as a

virtual core. Only one of the two cores in a pair is powered up and processing tasks at a time. In-kernel switcher (IKS) is responsible for switching tasks between two cores: the big core is used when the demand, or load, is high; otherwise the LITTLE core is used. The scheduler only sees a virtual core instead of two cores. Figure 17 shows the idea of CPU migration.

Figure 17 CPU migration

Figure 18 Heterogeneous multi-processing

The third kind of big-LITTLE core architecture is heterogeneous multi-processing (HMP). In this model, all cores are visible to the scheduler. In other words, all the big and LITTLE cores can be used at the same time. However, the scheduler needs to be aware of different CPU processing power while scheduling. For example, threads with higher

Page 32: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

28

priority or computation-intensive should be assigned to big cores, while threads with low priority or less computation-intensive are assigned to LITTLE cores. Figure 18 depicts the heterogeneous multi-processing model.

Task scheduling in big-LITTLE core environment is different from scheduling tasks on traditional SMP environment. The scheduling goal of SMP is to evenly distribute the workloads to all available cores in order to achieve maximum performance. On the other hand, maximizing power efficiency with only a modest performance sacrifice is the scheduling goal of big-LITTLE aware scheduling. In that case, tasks might be distributed unevenly. For example, if most of the tasks are less computation-intensive, these tasks should be assigned to LITTLE cores even if the loads of big cores are low. In this work, our goal is to build an energy-efficient scheduler for big-LITTLE core architecture that satisfies the resource requirement of each task and minimizes the energy consumption. Many prior works, such as Linaro[2], uses CPU load as the lone metric in core frequency tuning and core migration decision. In this work, we propose a resource-guided scheduling policy. We quantify the resource consumed by a task using the current CPU core frequency and the CPU load of the task as the parameters. We also define the minimum resource requirement for each task in order to satisfy the QoS requirement of the task. With such resource-based metric, our scheduler is able to estimate the performance and energy consumption of a task without complex profiling or offline analysis. Based on the resource-based metric, we propose a scheduling policy which decides the resource use for the tasks in a dynamic fashion, including the number of cores to be powered on, the assignment of tasks to big/LITTLE core clusters, the proper core frequency for the big core cluster and the LITTLE core cluster, and the best timing to migrate task(s) between big/LITTLE cores. We verify our proposed scheduling method with both simulation and actual experiment on a ODROID-XU ARM platform with Exynos5 Octa Cortex™-A15 1.6 Ghz quad core and Cortex™-A7 quad core CPUs, and with the cluster migration design. Our result demonstrates that in a real world scenario, our resource-guided strategy consumes only 10% average power compared to Linaro’s strategy and still satisfies the QoS requirement of all applications. 4.5 Related Work 4.5.1 Linaro

There are some researches about big-LITTLE core aware scheduling from Linaro [2]. Linaro is a not-for-profit engineering organization that works on consolidating and optimizing open-source software for the ARM architecture, including the GCC toolchain, the Linux kernel, ARM power management, graphics and multimedia interfaces. Linaro has published their strategies and some experimental results about big-LITTLE core aware scheduling for different big-LITTLE models [3][4]. The main idea of their strategy is to monitor the “load”, and make migration decisions according to these loads. However, the “load” has different meanings in the different models.

For the first and second model, the “load” refers to the load of each CPU. In the second model, the in-kernel switcher monitors the load of each CPU and adjusts the core frequency using DVFS. The selectable frequencies of big core and little core are combined and formed a range of adjustable frequency for a virtual core. First the tasks are assigned to virtual cores according to the Complete Fairness Scheduling policy. If the loading of a (virtual) core is greater than some pre-defined threshold, e.g., 85%, the scheduler computes a new frequency for this core [4]. If the new frequency falls in the range of big core, the tasks will be executed on big core; or else they will be run on little core. Unused processors

Page 33: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

29

are powered off. Figure 19 illustrates the scheduling strategy.

Figure 19 IKS: CPU migration

As for the third model, the scheduler keeps track of the load of each task. Only tasks

with load above a fixed threshold and higher priority than default are migrated to big cores. The idea is to use big cores only when it’s necessary. The load of each task is the time spent in CPU divided by the time spent in the run queue, and multiplies the task priority. This strategy treats big and LITTLE cores as separate scheduling domains, and load balancing is performed within each domain.

Linaro’s experiments compare the results between their big-LITTLE scheduler and Linux Vanilla Scheduler on two platforms, Versatile Express platform and ARM TC2. Their experiment results show that Linaro’s strategy reduces the up time of big cores while keeping average response time, thus consumes less power than the Linux Vanilla Scheduler[3][4]. Linaro also proposes some future directions for improving big-LITTLE core scheduling.

4.5.2 Research papers

Also there are researches that aim at heterogeneous processor environment, which can be applied to big-LITTLE core. Koufaty et al. [5] propose bias scheduling as a technique that influences how an existing scheduler selects the core where a thread will run. First they compute the bias of each thread according to stall ratio. When the load of cores is imbalanced, the scheduler tries to migrate a task that has the highest bias from the busiest core to the idlest one. Also they inspect big cores and switch threads with small core bias with threads with big core bias but assigned to small cores. In short, the proposed bias scheduling only hooks into the existing scheduler during load balance. For heterogeneous workloads with a clear bias differentiation, performance is improved by an average of 11%. Ren et al. [6] proposed an online algorithm, Fast-Preempt-Slow (FPS), which improves

Page 34: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

30

response quality subject to deadline and total power constraint. The idea is to run short requests on slow cores for energy efficiency and run long requests on fast cores to meet the deadline and quality requirement. First they assigned each job an urgency value. While a core is idle, it will select the task with the highest urgency from a slower core instead of directly selecting task from run queue. Their results show that under the same power budge, the throughput is 60% higher than the corresponding homogeneous processor. Petrucci et al. [7] state that if we only schedule tasks according to their bias, i.e. big core bias tasks only on big cores and small core bias tasks only to small cores, it can hurt performance. Thus they propose lucky scheduling, which is based on lottery scheduling. Their results show that lucky scheduling outperform big core fair policy and bias scheduling in energy-delay-production (EDP). Thannirmalai et al. [8] propose a framework that adjusts the frequency of each core cluster and the computation resource of each task based on control theory. The framework tries to keep the QoS of each QoS task within an acceptable range while obeying the power budget constraint. They measure the performance of QoS tasks using heart rate, which is the throughput of the critical kernel of the task. For example, number of frames per second for a video decoder.

4.5.3 Industrial/Manufactures

There are other manufacture such as Qualcomm, Nvidia, and MTK that propose their solutions on heterogeneous multi-core platform. Oualcomm [9] propose Asynchronous Symmetrical Multi-Processing (aSMP). Their Krait core can support per-core Dynamic Clock and Voltage Scaling (DCVS). Core that is not being used can be completely collapsed independently. They claim that aSMP reduces the need for hypervisors or more complex software management of disparate cores. Nvidia [10], on the other hand, propose Variable Symmetric Multiprocessing (vSMP). Their new CPU, Tegra 4, consists of a quad-core Cortex A15 CPU and a fifth low-power Cortex A15 companion core which is invisible to the OS and performs background tasks to save power, implementing an idea similar to ARM's big.LITTLE technology. As for MTK [11], their new CPU MT 8135 “True Octa-Core” is in fact big.LITTLE HMP. According to their website, they also used loading as their task scheduling criteria. Unfortunately, we do not know the details about how these manufactures design their task scheduling algorithms without any further information What distinguishes our work and those prior works on scheduling in big-little or heterogeneous multi-core environment is that we consider a more dynamic environment where the number of active big or LITTLE cores can vary from one scheduling interval to another. Most of the previous studies assume that the numbers of big and little cores are fixed, thus they can focus on determining if a task is big-core-bias or little-core-bias. During scheduling, the scheduler assigns each task according to the bias. Some of the previous studies power off idle cores. Simply powering off an idle core is trivial. Our proposed scheduler adopts a more aggressive approach that consolidates tasks with low workload to a small number of cores to minimize the number of powered on cores.

Our goal is to design big-little core aware, energy-efficient scheduling methods that can decide the number of cores powered on in each core cluster, their core frequency, and which task should run in each cluster during runtime. After surveying some previous works, only Linaro provides scheduling strategies that have similar goal as ours. 4.6 Power Model

Before we start to design our scheduling strategy, we conduct some experiments to verify the power consumption of our big-LITTLE core platform. The platform we used is

Page 35: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

31

ODROID-XU [12]. The CPUs on ODROID-XU are Exynos5 Octa Cortex™-A15 1.6 Ghz quad core and Cortex™-A7 quad core CPUs. It only supports the first model of big-LITTLE core architecture. To verify the power consumption of cores under different loading on this platform, we design our experiment as followed. First we develop a computation-intensive program that makes the loading of a core 100%. We executed this program on a core with frequency fixed to the lowest core frequency available. We record the average power consumption during execution, and set to core frequency one level higher, and measure the average power consumption again. This process is repeated until we have the average power consumption of the core using the highest frequency. The whole process is repeated to measure the average power consumption of CPU loading 50% and 0%. 0% is the power consumption while the core is idle (not running any task). 50% is achieved by running the same computation-intensive program, but limiting its maximum loading to 50% using cpulimit [13]. Cpulimit is a program that limits the CPU usage (expressed in percentage, not in CPU time) of a process. Figure 20 shows the power consumption results of big and little core under different loading. From this figure we can tell that using the same frequency, the power consumption of 50% loading is approximate half of the power consumption of 100%. The power consumption of an idle core, either big or little, is small compared to a non-idle core. The average power increases drastically with the frequency in big core. The conclusion of our empirical study is that, we should consider increasing the frequency of little core(s) before powering up a big core and migrating tasks to the big core.

Figure 20 Average power consumption of a core under different loading

Based on our observation, we propose a resource-guided task scheduling in big-LITTLE core environment. Instead of using loading as the metric as Linaro does, we define resource of a task Ti as follows.

Page 36: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

32

loadingi is the percentage of time that task Ti is running on a core in a period of time. CoreFrequencyi is the current frequency of the core cluster this task is running on. The reason we include current frequency into consideration is that if a task has the same loading on two cores with different frequencies, the one on the core with larger frequency can actually does more work (e.g., execute more instructions). The meaning of resource is equivalent to the amount of CPU cycles a task can use in a scheduling interval.

We also define the minimum resource required (min_res_reqi) for each task Ti as follows.

The minimum resource required is determined by two factors: QoSFreq and QoSLoading. QoSFreq is the minimum core frequency that satisfies the QoS requirement of a task. For example, the minimum core frequency required to make a game program run “fluently”. QoSLoading is the CPU load of a task using QoSFreq. The minimum resource required implies the least amount of resources the scheduler has to provide to a task in order to satisfy its QoS requirement.

Since there are two core clusters, big and little, there will be two run queues, each keeping the runnable tasks in the corresponding cluster, in our design. We adopt Complete-Fairness-Scheduler (CFS) to schedule tasks in each core cluster, since CFS works well in scheduling tasks in homogeneous environment. As for the power consumption of each core, we measure the average power consumption of a big or little core under different loading, as shown in Figure 20. Table 10 shows the detailed numbers. These power consumption numbers are the parameters of our power model. For each scheduling interval, we compute the power consumption by adding the power consumption of each core. The power consumption Pt of an interval t is:

where nb and nL are the number of big and LITTLE cores, respectively. Pi,t is the power consumption of big core i, and Pj,t is the power consumption of little core j during time t.

Our objective is to find a proper setting, i.e. nb, nL, Pi,t, and Pj,t, for an interval t according to task loadings, such that Pt is minimum.

LITTLE core (Loading 100%) Big core (Loading 100%) Freq(kHz) Power(mW) Freq(kHz) Power(mW)

250 0.152221 800 1.631437 300 0.179003 900 1.885008 350 0.204294 1000 2.24107 400 0.233657 1100 2.69247 450 0.288032 1200 3.116443 500 0.35922 1300 3.676866 550 0.435527 1400 4.358485 600 0.527482 1500 5.150607

1600 5.880247 Table 10 Power consumption data

4.7 Resource-Guided Scheduling

Page 37: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

33

The control flow of our proposed scheduling method consists of three main phases:

the scheduler collects the loading information in the TaskInfo phase, then make decisions in LittleCore and BigCore phases. For every scheduling interval, the scheduler goes through these three phases and makes scheduling decisions for the next interval. In the TaskInfo phase, the scheduler gathers the loading information of each task running in both core clusters. Also the scheduler collects the loading and current frequency of each core. Since we consider per-cluster DVFS in our work, i.e. all cores in the same core cluster have the same frequency, the core frequency equals to the cluster frequency. In LittleCore phase, the scheduler deals with the tasks running on little cores. If the little core cluster is powered, our scheduler will compare resourcei of each task Ti in little core cluster with their min_res_reqi. If any task gets resourcei less than its minimum requirement, the scheduler will first try to deal with this “lack of resource” situation by scaling up the core frequency. If DVFS cannot solve the problem, the scheduler will turn on a new little core to provide more resources. However, if both DVFS and turning on core cannot provide enough resources, the scheduler will have to perform task migration. Tasks that will be migrated to a big core are marked as “migration candidates”. On the other hand, if all tasks in little core cluster gets at least its minimum requirement, the scheduler will close a little core, or scale down the frequency in order to achieve power saving.

After LittleCore phase, the tasks in big cores will be processed by the following steps in BigCore phase. The scheduler first checks the power status of the big core cluster. If the big core cluster is powered off while there are candidate tasks waiting for migration, a big

Algorithm LittleCore phases Input: loading Li and minimum resource requirement minRi of task Ti on little cores, frequency Fc of little cores, number of powered cores Nc Output: migration candidates {Tm}, new frequency Fn of little cores, number of powered cores Nn 1. if (Nc >= 0){ 2. lackResource = false; 3. for each task Ti do 4. resourcei = Li x Fc 5. if (resourcei < minRi) 6. lackResource = true; 7. end for 8. if (lackResource) 9. if (scale up frequency can provide enough resources) 10. Fn = DVFS(Fc) 11. else if (open a little core can provide enough resources) 12. Nn = Nc + 1 13. else 14. Mark migration candidates Tm 15. else 16. if (close a little core still provide enough resources) 17. Nn = Nc – 1 18. else if (scale down frequency still provide enough resources) 19. Fn = DVFS(Fc) 20.}

Page 38: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

34

core will be powered up. In BigCore phase, the scheduler migrates the candidate tasks from little to big before measuring the resource requirement of tasks already on big cores. If no migration action takes place, the scheduler checks if any task in a big core can be migrated back to a little core.

If no task migration is needed (either migrating in or out from big cores), the scheduler then compares the resourcej of each task Tj in big core cluster with their minimum resource required. The rest is the same as the steps in LittleCore phase with one difference -- If opening a new core cannot satisfy the requirement of a task, instead of migrating it as little core cluster does, the scheduler will only scale the frequency up to maximum in order to provide as much resource as possible.

After the BigCore phase, the scheduler checks if any of the two clusters has no task running, and power-off the entire cluster to reduce power consumption.

To summarize, our proposed method finds the proper setting, including core frequency, number of cores to stay powered on, and the assignment of tasks to the core

Algorithm BigCore phases Input: loading Li and minimum resource requirement minRi of task Ti on big cores, frequency Fc of big cores, number of powered cores Nc, migration candidates from little core {Tin} Output: migration candidates {Tout}, new frequency Fn of big cores, number of powered cores Nn 1. if (Nc == 0 && | Tin | > 0) 2. Nc = 1 3. if (Nc >= 0){ 4. if (| Tin | > 0) 5. Migrate Tin from little to big 6. else 7. if ( Tasks can be migrated back to little) 8. Mark migration candidates Tout 9. else 10. lackResource = false; 11. for each task Ti do 12. resourcei = Li x Fc 13. if (resourcei < minRi) 14. lackResource = true; 15. end for 16. if (lackResource) 17. if(scale up frequency can provide enough resources) 18. Fn = DVFS(Fc) 19. else if (open a little core can provide enough resources) 20. Nn = Nc + 1 21. else 22. Fn = maximum frequency of big cores 23. else 24. if (close a little core still provide enough resources) 25. Nn = Nc – 1 26. else if (scale down frequency still provide enough resources) 27. Fn = DVFS(Fc) 28.}

Page 39: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

35

cluster, that satisfies the resource requirement of each tasks and minimize the energy consumption in each scheduling interval. There are three phases, TaskInfo, LittleCore, and BigCore phase, in each scheduling interval. The flowchart of the proposed method is in the Appendix.

One thing worth mentioning is that the proposed method can support all the three big-LITTLE models by applying different candidate selection methods. For example, in the cluster migration mode, all the tasks will be marked as candidates since we can only power up a core cluster at a time. As for the heterogeneous multi-processing model, a task that consumes the largest amount of resources will be selected and marked as migration candidate. The reason we choose the task with the largest amount of resources instead of tasks with insufficient resource is that migrating larger tasks to a big core release more resources in the little core cluster, resulting in larger opportunity to scale down the frequency or power-off the cores. 4.8 Evaluation

Since the ODROID-XU ARM platform only supports the cluster migration model, we implemented a simulator to evaluate the effectiveness of our resource-guided scheduling method in all three models (cluster migration, CPU migration pairs, and heterogeneous multi-processing). We also conduct experiment on the ODROID-XU ARM platform.

We conduct two experiments to evaluate the proposed scheduling method. The first compares the average power consumption of running single application using our resource-based strategy and Linaro’s. The second compares the power consumption in a real-world scenario.

4.8.1 Simulation Result

We conduct simulations to verify our scheduling method. The parameters, such as power consumption under different frequencies, are measured on an ODROID-XU platform. The schedule interval is set to 1 second. CFS is assumed to be the task scheduler that schedules tasks in a run queue to cores for execution.

The benchmarks used in our simulation are as follows. TTpod [14] is a music player app on Android platform. Candy Crush [15] is a match-three puzzle game, which is one of the TOP 20 games on Google Play. Chrome [16] is a web browser released by Google.

We measure the minimum resource requirement of these benchmarks by finding the minimum core frequency that meet the QoS requirement of a benchmark. Table 11 shows the QoS requirements of these three benchmarks in this work.

Benchmark QoS requirement TTpod Play music without interrupts Candy Crush At least 24 FPS during gameplay Chrome Jump to next page within one second after clicking a link

Table 11 QoS requirement of benchmarks

To determine the minimum resource requirement, we execute these benchmarks and

observe their behaviors. From our experiment, we found that for TTpod, the loading of its main thread is less than 20% while running on LITTLE core with the minimum frequency. We set the minimum resource requirement of TTpod as the minimum frequency (250 kHz) multiplies 20%.

As for Candy Crush, the loading of its main thread is always high no matter which

Page 40: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

36

frequency is used, while the other threads have loading less than 5 %. Figure 21 shows a piece of workload from Candy Crush. However, even with the high loading, the fps of Candy Crush is above 24 while running on little core with the minimum frequency. Thus we set the minimum resource requirement of Candy Crush as the minimum frequency (250 kHz) multiplies 100%.

Figure 21 A piece of workload from Candy Crush

The minimum resource required is slightly different with Chrome. Since web browser

is “best-effort”, this means that the performance will be better if it gets more resource. Also the thread behavior of Chrome is very different from TTpod and Candy Crush. Instead of a high-loading main thread, the loadings vary a lot during execution. While user clicks a link and enters a new web page, the loading increases drastically. However, the high loading only continues for a short period of time and then decreases. Since each website has different contents, the time to fully load a page varies. Thus we consider only the time between user click and the browser enters the new page instead of fully loading an entire page. The QoS requirement can be satisfied by setting the minimum resource required to the minimum frequency of a big core multiplies the CPU load of Chrome at that core frequency. This means that if our scheduler encounters web events/tasks, it will use big core with the lowest frequency to complete the tasks.

Currently we compute the minimum resource required manually. In the future, we will design an online profiler which automatically computes the minimum resources required for each incoming new task. In the first experiment, we simulate the execution of the three applications separately, and measure their average power consumption. We apply three different candidate selection methods, one for each model. Table 12 shows the result. We can see from Table 12 that the average of estimated power consumption of TTpod is the same. The reason is that TTpod requires only one LITTLE core with the least frequency in all three models. To further verify this result, we measure the actual power consumption of TTpod. The result is in Figure 22.

Page 41: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

37

Figure 22 shows the result of the loading of TTpod (3318-ttpod), our estimated power (estimate_w), and the actual power on a LITTLE core (a7_w). As can be seen from Figure 22, the average value of estimated power (0.01904) is close to the actual one (0.01929). However, the estimated power is less fluctuating than the actual power does.

Figure 22 Estimated power and actual power of TTpod

On the other hand, the average estimated power consumption of Candy Crush differs

between our method and Linaro’s. We’ve mentioned that the main thread of Candy Crush always results in high loading. Since Linaro’s scheduler considers only loading, it will keep scaling up the frequency, and eventually uses the highest frequency of big core. Our method can keep Candy Crush working on LITTLE cores and still satisfies the QoS requirement, which is at least 24 fps during gameplay. As for Chrome, both our method and Linaro’s use big cores. Again, Linaro’s scheduler considers only loading, and uses the highest frequency of big cores. Our scheduler only uses the lowest frequency of big cores to complete the tasks.

Resource-Guided Scheduler Linaro Model I II III I II III

TTpod 0.019 0.019 0.019 0.019 0.019 0.019 Candy Crush 0.371 0.371 0.371 1.49 1.49 1.49 Chrome 0.916 0.916 0.916 1.88 1.73 1.73

Table 12 Comparison of average power consumptions using simulation

4.8.2 Experiments on the ODROID-XU ARM platform.

In the second experiment, we execute the three applications together, and measure the average power consumption during execution. The applications start on different time. The scenario is that a user first starts TTpod at time 0 to play some music. A minute later, this user wants to play the game, Candy Crush, while keeping the music. After playing the game for three minutes, this user finishes the game and opens Chrome to search for a solution on how to conquer a certain stage of Candy Crush.

We generate the scheduling plan using our simulator. The hardware settings, i.e. number of cores and core frequency, change during execution according to the scheduling plan. The two executions, our resource-guided and Linaro, use different scheduling plans since the strategies are different.

Figure 23 and Figure 24 show the results. Figure 23 is the loading and power consumption result of Linaro’s strategy, while Figure 24 is the result of our resource-guided strategy. The average power consumptions are 0.071 mWatt and 0.0072mWatt, respectively. This result shows that our resource-guided is more power-efficient than

Page 42: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

38

Linaro’s.

Figure 23 Loading and Power consumption of Linaro

Figure 24 Loading and Power consumption of resource-guided

4.9 Conclusion

In this work, we build an energy-efficient scheduler for big-LITTLE core architecture that satisfies the resource requirement of each task and minimizes the energy consumption. Based on the resource-based metric, we propose a scheduling policy which decides the resource use for the tasks in a dynamic fashion, including the number of cores to be powered on, the assignment of tasks to big/LITTLE core clusters, the proper core frequency for the big core cluster and the LITTLE core cluster, and the best timing to migrate task(s) between big/LITTLE cores.

We verify our proposed scheduling method with both simulation and actual experiment on ODROID-XU ARM platform. The results demonstrate that compared to Linaro’s scheduling strategies, our resource-guided scheduling method is more power-efficient.

Page 43: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

39

References [1] ARM_big.LITTLE. http://en.wikipedia.org/wiki/ARM_big.LITTLE [2] Linaro. http://www.linaro.org/ [3] Research Update on big.LITTLE MP Scheduling.

http://www.linaro.org/documents/download/77ee402977299bd4545bf7027e3175284fc3128188e4d

[4] Update on big.LITTLE scheduling experiments. http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-scheduler-task-placement-rasmussen.pdf

[5] David Koufaty, Dheeraj Reddy, and Scott Hahn. Bias scheduling in heterogeneous multi-core architectures. In Proceedings of the 5th European conference on Computer systems (EuroSys '10). 2010. ACM, New York, NY, USA, 125-138.

[6] Shaolei Ren, Yuxiong He, and Sameh Elnikety. Energy-Efficient Scheduling of Interactive Services on Heterogeneous Multicore Processors. 2012. Microsoft.

[7] Vinicius Petrucci, Orlando Loques, and Daniel Mossé. Lucky scheduling for energy-efficient heterogeneous multi-core systems. In Proceedings of the 2012 USENIX conference on Power-Aware Computing and Systems (HotPower'12). 2012. USENIX Association, Berkeley, CA, USA, 7-7.

[8] Thannirmalai Somu Muthukaruppan, Mihai Pricopi, Vanchinathan Venkataramani, Tulika Mitra, and Sanjay Vishin. 2013. Hierarchical power management for asymmetric multi-core in dark silicon era. In Proceedings of the 50th Annual Design Automation Conference (DAC '13). ACM, New York, NY, USA,

[9] Qualcomm. http://www.qualcomm.com/ [10] Nvidia. www.nvidia.com [11] MTK true Octa cores.

http://www.mediatek.com/_tw/Event/201307_TrueOctaCore/tureOcta_.php [12] ODROID-XU. http://archlinuxarm.org/platforms/armv7/samsung/odroid-xu [13] Cpulimit. http://cpulimit.sourceforge.net/ [14] TTpod. www.ttpod.com/ [15] Candy Crush. http://en.wikipedia.org/wiki/Candy_Crush_Saga [16] Chrome for Android. https://www.google.com/intl/zh-

TW/chrome/browser/mobile/android.html

Page 44: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

40

Appendix A: Procedure for installing L4Microkernel and L4Linux on ODROID-X ARM board To run Fiasco on ODROID-X board, it needs Fiasco kernel and L4Re.

Note: The line start with '$' is the command.

1. Download l4re-snapshot-xxx.tar.xz from http://os.inf.tu-

dresden.de/download/snapshots-oc/

2. Prepare toolchain:

Download 'Sourcery G++ Lite 2011.03-41 for ARM GNU/Linux' from

https://sourcery.mentor.com/GNUToolchain/release1803

untar the file 'arm-2011.03-41-arm-none-linux-gnueabi-i686-pc-linux-

gnu.tar.bz2' $ tar jxf arm-2011.03-41-arm-none-linux-gnueabi-i686-pc-linux-gnu.tar.bz2 The untar file will be located in 'arm-2011.03' folder.

2.1 Setting fiasco to use the toolchain: $ vi unpacked_fiasco_path/bin/.arm-linux-SCRIPT

Add the following line into .arm-linux-SCRIPT

DEFAULT_GCC_PATH_PREFIX=/path/to/arm-2011.03/bin/arm-none-linux-gnueabi-

2.2 Add the toolchain path to current environment $ export PATH=/path/to/unpacked_fiasco_path/bin:$PATH

3. Build Fiasco kernel:

Go to unpacked fisco kernel folder $ cd unpacked_path/src/kernel/fiasco

Create a new folder 'mybuild' to hold the building files $ make BUILDDIR=mybuild

goto the folder and configurate the kernel options $ cd mybuild $ make config

In config menu, choose as the following items Target configuration: Architecture: ARM processor family Platform: Samsung Exynos UART: 1 Timer: Multi-core timer Interrupter: ExtGiC

Build the fisco kernel: $ make

The compiled image will be located at 'mybuild/fiasco'.

Page 45: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

41

4. Build L4Re

goto unpacked L4Re folder $ cd /path/to/unpacked_fiasco_path/src/l4

Create a new folder 'mybuild' to hold the building files $ mkdir mybuild

Configure the L4Re options $ make O=mybuild config

In config menu, choose as the following items Target Architecture: ARM architecture CPU variant: ARMv7A type CPU Platform Selection: Samsung Exynos4

Add fiasco kernel image path to the L4Re config $ vi /path/to/unpacked_fiasco_path/src/l4/conf/Makeconf.boot

Add the following line into 'Makeconf.boot'

P=/path/to_unpacked_fiasco_path/src

MODULE_SEARCH_PATH=$P/kernel/fiasco/mybuild:$P/l4/mybuild:$P/l4/conf/exam

ples

goto 'mybuild' folder $ cd mybuild

Build the L4Re $ make

Build bootable 'helloworld' image:

Note. ODROID-X board only support rawimage format $ make rawimage E=hello

The image will be located in mybuild/images/bootstrap.rawimage

Copy to sdcard $ cp images/bootstrap.raw /path/to/sdcard/boot/partition/bootstrap.raw

5. Booting on ODROID-X board

5.1 stop auto booting

Hit any key when you see the bootloader information.

5.2 loading fisco image $ fatload mmc 0:1 41000000 bootstrap.raw $ bootm 41000000

If success, it will print 'Hello World!' on the console.

Page 46: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

42

6. Build L4Linux

goto unpacked L4Linux folder $ cd src/l4linux

Create a new folder 'mybuild' to hold the building files $ mkdir mybuild

6.1 L4Linux configuration

Load the default configuration $ make O=mybuild arm-mp_defconfig

goto the folder and configure the L4Linux options $ cd mybuild $ make menuconfig

In the kernel configuration menu, set as the following:

L4Linux configuration L4 tree build directory: /path/to/src/l4/mybuild

Unset the following:

General setup Kernel Performance Events And Counters [ ] Kernel performance events and counters [ ] Patch physical to virtual translations at runtime Kernel Features [ ] High Memory Support

6.2. build L4Linux $ CROSS_COMPILE=arm-linux- make

The result image will be located at 'mybuild/vmlinuz.arm'

6.3 create bootable image $ cd /path/to/src/l4/mybuild $ make rawimage choose "L4Linux ARM"

6.4 copy to sdcard $ cp images/bootstrap.raw /path/to/sdcard/

Boot the image (same as section 5.2)

Page 47: 工業技術研究院資通所 學界分包 學研合作研究報告r97128/Final_report.pdf · * 行動作業系統虛擬化技術 * * Energy-Efficient System Virtualization * * for

43

Appendix B: Flowchart of resource-guided scheduling mechanism