Upload
marion-reeves
View
248
Download
7
Embed Size (px)
Citation preview
HSA Kernel Mode Driver
Advisor: 徐慰中教授Student: 黃昱儒2015/4/20
Agenda
● Introduction to HSAo hUMAo User Level Queueing
● HSA Drivero Flow Overviewo User & Hardware Queues
● IOMMUo GCR3o PPR
● HSA Virtualizationo Backgroundo Virtualization Idea & Architecture
hUMA
User Level Queuing - Before HSA
User Level Queuing
Application 1Queue 1 (In VA)
HSA Device
Application 1Queue 2(In VA)
Application 2Queue 1(In VA)
Application 3Queue 1(In VA)
HSA device access application’s ring
Application kick doorbell
1. AQL Packet2. Ring (Application queue)3. Doorbell
IOMMU address translation (VA->PA)
rptr
wptr
rptr
rptr
rptr
wptr
wptr
wptr
HSA Software Stack
HSA Software Stack
HSA-aware Kernel
KFD IOMMU Driver
Runtime Library
● open(“/dev/kfd”)● ioctl(KFD_IOC_SET_MEMORY_POLICY)● ioctl(KFD_IOC_CREATE_QUEUE)● ioctl(KFD_IOC_DESTROY_QUEUE)
Application
HSA Device
IOMMU
libhsakmt
Agenda
● Introduction to HSA● HSA Driver
o Concepts Flow Overview User & Hardware Queues
o Source Code Detail● IOMMU● HSA Virtualization
Concepts - HSA Run Flow
Create user queuesCreate HW queue with user
queue information
Enqueu AQL packets, kick doorbell, and wait
signal
Nothing
Application finish and destroy queues
Release HW queue
User Space Kernel Space
Initialization
Computation
Finish
User - HW interaction
Scheduled Policy(Queue Binding Policy)
● Hardware policy with oversubscription (more queues than HW slots)
● HW policy without oversubscription o create_queue requests fail when run out of HW slots
● SW policy, driver manually assigns queues to HW slots by programming HW configuration registers
HSA GPU’s configuration register mmio address
Free hardware queue_id bitmap
doorbell
ring_base_address
pasid=0queue_id=0
doorbell
ring_base_address
pasid=0queue_id=1
doorbell
ring_base_address
pasid=1queue_id=0
doorbell
ring_base_address
pasid=1queue_id=1
queue acquire register
Physical Address
Software Policy
(pipe, queue)
write_pointer_address
read_pointer_address
write_pointer_address
read_pointer_address
write_pointer_address
read_pointer_address
write_pointer_address
read_pointer_address
HSA GPU’s configuration register mmio address
doorbell
ring_base_address
queue acquire register
Physical Address
Hardware Policy - No Oversubscription (1)
(pipe=4, queue=0)
kernel_queue
Per Application
Per Device
Per Queue
Only for HW Policy
IOCTL Command Provided by KFD
● KFD_IOC_CREATE_QUEUEo Create hardware queue from application’s information (ex: ring base address)
● KFD_IOC_DESTROY_QUEUEo Release hardware queue
● KFD_IOC_UPDATE_QUEUE● KFD_IOC_SET_MEMORY_POLICY
o Set cache coherent policy● KFD_IOC_GET_CLOCK_COUNTERS
o Get GPU clock counter● KFD_IOC_PMC_ACQUIRE_ACCESS● KFD_IOC_PMC_RELEASE_ACCESS
o Exclusive access for performance counters ● KFD_IOC_GET_PROCESS_APERTURES
o Get apertures information of GPU
● KFD_IOC_CREATE_VIDMEM● KFD_IOC_DESTROY_VIDMEM
o Used for GPU local memory
HSA Driver Flow
● System intialization○ module_init○ device_init (Called by radeon)
● Application open “/dev/kfd” device
● Application send ioctl○ KFD_IOC_SET_MEMORY_POLICY○ KFD_IOC_CREATE_QUEUE
● Application send ioctl○ KFD_IOC_DESTROY_QUEUE
● Application termination
module_init(kfd_module_init)
● radeon_kfd_pasid_inito Initialize PASID bitmapo PASID 0 is reserved
● radeon_kfd_chardev_inito register_chrdev: /dev/kfdo kfd_ops
Define open, ioctl, mmap member function
SW Policy
kgd2kfd_device_init
● radeon_kfd_doorbell_init(kfd);● radeon_kfd_interrupt_init(kfd);● device_iommu_pasid_init(kfd);● kfd_topology_add_device(kfd);● amd_iommu_set_invalidate_ctx_cb(kfd->pdev,
iommu_pasid_shutdown_callback);● device_queue_manager_init(kfd);
o dqm->initialize● dqm->start(kfd->dqm);
dqm->initialize
● Prepare pipe, queue bitmap
kfd_open
● radeon_kfd_create_process(current)o Create kfd_processo Assign PASID
KFD_IOC_SET_MEMORY_POLICY
● Two policyo cache_policy_coherento cache_policy_noncoherent
● Aperture used for GPU local memory
radeon_kfd_bind_process_to_device
● Called when user application send ioctl command
● amd_iommu_bind_pasid()o Register iommu with this kfd_processo Also set page table to IOMMU
KFD_IOC_CREATE_QUEUE
● Create queue with informations from userspace
● Return queue_id and doorbell_address to user-spaceo queue_id is per kfd_processo doorbell_address map to device mmio address
pqm_create_queue
● find_available_queue_sloto Assign qid (per kfd_process)
● dqm->register_processo Register process to dqm (device queue manager)
● create_cp_queueo Create with queue_properties get from applicationo Map doorbell mmio address to application
● dqm->create_queue● dqm->execute_queue
dqm->execute_queue
● Write queue configuration to device● load_mqd
o ring_base_addro doorbell_offseto queue_priorityo ...
HSA GPU’s configuration register mmio address
Free hardware queue_id bitmap
doorbell
ring_base_address
pasid=0queue_id=0
doorbell
ring_base_address
pasid=0queue_id=1
doorbell
ring_base_address
pasid=1queue_id=0
doorbell
ring_base_address
pasid=1queue_id=1
queue acquire register
Physical Address
(pipe, queue)
write_pointer_address
read_pointer_address
write_pointer_address
read_pointer_address
write_pointer_address
read_pointer_address
write_pointer_address
read_pointer_address
HW Policy
kgd2kfd_device_init
● radeon_kfd_doorbell_init(kfd);● radeon_kfd_interrupt_init(kfd);● device_iommu_pasid_init(kfd);● kfd_topology_add_device(kfd);● amd_iommu_set_invalidate_ctx_cb(kfd->pdev,
iommu_pasid_shutdown_callback);● device_queue_manager_init(kfd);
o dqm->initialize● dqm->start(kfd->dqm);
dqm->start
● pm_init (packet manager)● kernel_queue_init
o kernel_queue doorbello kernel_queue ring addresso load_mqd to write kernel_queue configuration to
device
pqm_create_queue
● find_available_queue_sloto Assign qid (per kfd_process)
● dqm->register_processo Register process to dqm (device queue manager)
● create_cp_queueo Create with queue_properties get from applicationo Map doorbell mmio address to application
● dqm->create_queue● dqm->execute_queue
dqm->execute_queue
● pm_send_runlisto pm_create_runlist_ib
Construct pm4 packet of MAP_PROCESS and MAP_QUEUES type
● Packet contains application’s ring addresso pm->kernel_queue->acquire_packet_buffer
Get a not used entry of kernel_queueo pm_create_runlist
Construct pm4 packet of RUN_LIST typeo pm->kernel_queue->submit_packet
Kick kernel queue’s doorbell
Hardware Scheduler - No Oversubscription
IT_RUN_LIST
run_list
PM4 Packet (Type3)
IT_MAP_PROCESS
page_table_basepasid
sh_mem_config
PM4 Packet (Type3)
IT_MAP_QUEUES
mqd_addr(Memory Queue
Descriptoy)
PM4 Packet (Type3)
3 Processes
Hardware Scheduler - Oversubscription
IT_RUN_LIST
run_list
PM4 Packet (Type3)
IT_MAP_PROCESS
page_table_basepasid
sh_mem_config
PM4 Packet (Type3)
IT_MAP_QUEUES
mqd_addr(Memory Queue
Descriptoy)
PM4 Packet (Type3)
IT_RUN_LIST
run_list
PM4 Packet (Type3)
● Prepare (pipe, queue) bitmapdqm->initialize
dqm->start
● Create kfd_process● Assign PASID
kfd_open
● Get queue_id● Map doorbell to application
ioctl(CREATE_QUEUE)
● init_mqd● Find unused (pipe, queue) to
assign HW queue_id
dqm->create_queue
● Write queue configuration to device
dqm->execute_queue
dqm->initialize
● pm_init● kernel_queue_init
dqm->start
● Create kfd_process● Assign PASID
kfd_open
● init_mqddqm->create_queue
● Create pm4 packet ● Kick kernel_queue’s doorbell
dqm->execute_queue
● Get queue_id● Map doorbell to application
ioctl(CREATE_QUEUE)
SW Policy HW Policy
Application Computation ...
● HW has ring_base_addr user-space addresso Application enqueue AQL packet and wait signal
● Application has HW doorbell mmio addresso Use to kick hardware
● Driver do nothing● Until application send
ioctl(KFD_IOC_DESTROY_QUEUE) or application finish
Haredware Queue Deactivation
1. Application send ioctl(KFD_IOC_DESTROY_QUEUE)
2. Task exit notifier
dqm->destroy_queue For SW Policy
● destroy_mqdo acquire_queue(kgd, pipe_id, queue_id);o write_register(kgd,
CP_HQD_DEQUEUE_REQUEST, DEQUEUE_REQUEST_DRAIN);
dqm->destroy_queue For HW Policy
● dqm->destroy_queueso pm_send_unmap_queue
Send a pm4 packet of UNMAP_QUEUESo pm_send_query_status(KFD_FENCE_COMPLETE
D)
Haredware Queue Deactivation (2)
● Task exit notifier will call iommu_pasid_shutdown_callbacko Register in kgd2kfd_device_init-
>amd_iommu_set_invalidate_ctx_cbo Will be called in mmu_notifier’s release function
(mmu_notifier is registered in radeon_kfd_bind_process_to_device->amd_iommu_bind_pasid)
Agenda
● Introduction to HSA● HSA Driver● IOMMU
o Concepts GCR3 PPR
o Source Code Detail● HSA Virtualization
Introduction to IOMMU
● User application send AQL packet into ring address which is virtual address
● Device accessing need translate VA to PA
HSA GPU
Device table
PASID=2
GCR3
Assign this entry with kfd_process->mm->pgd
PRI & PPR
● The operating system is usually required to pin memory pages used for I/O.
● IOMMU provide mechnism to let peripheral to use unpinned pages for I/O.
● Only support in AMD IOMMU_v2
PRI & PPR
● PRI(page request interface)o peripheral request memory management service
from a host OS (eg, page fault service for peripheral)o Issued by peripheral
● PPR(peripheral page service request)o When IOMMU receives a valid PRI request, it
creates a PPR message in request log to request changes to virtual address space
o Issued by IOMMU as interrupt
● Use to request IO page table changeo IOMMU driver can register PPR notifier
module_init(amd_iommu_v2_init)
● amd_iommu_register_ppr_notifier(&ppr_nb);o PPR callback
ppr_notifier function
amd_iommu_bind_pasid
● Called when kfd_process createo mmu_notifier_register(&pasid_state->mn,
pasid_state->mm);o amd_iommu_domain_set_gcr3(dev_state->domain,
pasid, __pa(pasid_state->mm->pgd));
HSA GPU
Device table
PASID=2
GCR3
Assign this entry with kfd_process->mm->pgd
PRI & PPR Flow
Peripheral issue PRI to IOMMU
IOMMU write PPR request to PPR log(log contains fault address, pasid,
device_id, tag, flags)
IOMMU send interrupt to CPU
PPR FlowWhen irq comes
ppr_notifier
readl(iommu->mmio_base + MMIO_STATUS_OFFSET);
if (status & MMIO_STATUS_PPR_INT_MASK)
Register in amd_iommv_v2_init
do_fault
do_fault
● get_user_pages() API to pin fault pages into memoryo mm_struct, fault_addr
Flow Review
HSA-aware Kernel
KFD IOMMU Driver
Runtime Library
● open(“/dev/kfd”)● ioctl(KFD_IOC_SET_MEMORY_POLICY)● ioctl(KFD_IOC_CREATE_QUEUE)● ioctl(KFD_IOC_DESTROY_QUEUE)
Application
HSA Device
IOMMU
libhsakmt
Agenda
● Introduction to HSA● HSA Driver● IOMMU● HSA Virtualization
o Backgroundo Virtualization Idea & Architecture
System Virtualization
● How to do?o For same ISA virtualization, just load guset OS
kernel into memory and set PC
● We need guest OSes are isolated !!!● How to manage system resource?
o CPU virtualizationo Memory virtualizationo I/O virtualization
CPU Virtualization
User Mode
Kernel Mode
Without Virtualization Extension
Hypervisor
Guest ApplicationGuest OS User Mode
Kernel Mode
Hypervisor Mode
With Virtualization Extension
Hypervisor
Guest Application
Guest OS
Problem 1’s solution: Trap critical instruction (control by hypervisor)
Problem 2’s solution:System call trap into guest OS inside kernel mode
Problem 1: system resouce control● Need binary translation or para-virtualization
Problem 2: unnecessary trap, ex:system call● Cause performance drop
Critical Instruction
● Privileged instruction: cause trap if executed in unprivileged mode● Sensitive instruction: interact with system resource● Critical instruction: sensitive but non-privileged instruction
Privileged Instruction
Sensitive Instruction
Non-privileged Instruction
Memory Virtualization
System Memory
Guest Virtual Address(GVA)
Host Physical Address(HPA)
Guest Physical Address(GPA)
Guest OS
Hypervisor
Guest APP
Shadow Page Table (Maintained by Hypervisor)
Guest OS
Hypervisor
Guest APP
Guest Virtual Address(GVA)
Host Physical Address(HPA)
Guest Physical Address(GPA)
Page Table (Maintained by GOS)
Page Table (Maintained by Hypervisor)
With Virtualization ExtensionWithout Virtualization Extension
I/O Virtualization
● Four techniqueo Full virtualizationo Virtio
o Directed I/O (also called device pass-through)o SRIOV (Single-Root I/O Virtualization)
Above two is useful only when I/O virtualization extension support
Full Virtualization
Guest OS
Guest Driver
Hypervisor
Device Emulation
Physical Device Driver
Physical Device
Trap
out 0x1F3, 0x00 out 0x1F4, 0x00out 0x1F5, 0x03out 0x1F6, 0xE8out 0x1F2, 0x08out 0x1F7, 0x20
第 1000(0x3E8)號邏輯磁區開始
讀取 command (0x2)
8個磁區
Virtio
Access Virtio MMIO region, cause trap
Virtio in KVM+Qemu
Guest OS QEMU
Host kernel
Front-end driver Back-end driver
KVM
Real device driver
Share virtqueue
Disk
Device Pass-through
Device Pass-Through Single Root I/O Virtualization (SR-IOV)
I/O Virtualization Extension - IOMMU
Guest OS
Hypervisor
System Memory
Guest APP
Guest Virtual Address(GVA)
Host Physical Address(HPA)
Guest Physical Address(GPA)
Page Table (maintained by GOS)
Page Table (maintained by Hypervisor)
Device Driver
CPU
MMU
I/O Device
GPA
Host Physical Address(HPA)
Guest Physical Address(GPA)
Page Table (maintained by Hypervisor)
IOMMU
More About I/O Virtualization
● Principle: How to virtualize the HW interface o Full virtualization & virtio: let qemu emulateo Device pass-through: assign HW interface to guest
● A Full GPU Virtualization Solution with Mediated Pass-Througho Intel Corporation @ USENIX ATC’14o HW interfaces
Frame buffer: Partitioning Command buffer: Partitioning I/O registers: Emulate GPU page tables: Emulate
Application 1Queue 1
HSA Device
Application 1Queue 2
Guest 1 Application 1
Queue 1
Guest 2Application 1
Queue 1
HSA device access application’s queue
Application kick doorbell
Need IOMMU second level address translation support
(GVA->IPA->PA)
IOMMU address translation (VA->PA)
rptr
wptr
rptr
rptr
rptr
wptr
wptr
wptr
Guest OS
HSA Runtime Library
● open(“/dev/kfd”)● ioctl(KFD_IOC_SET_MEMORY_POLICY)● ioctl(KFD_IOC_CREATE_QUEUE)● ioctl(KFD_IOC_DESTROY_QUEUE)
Guest Application
Host OS
KVM
Virtio-KFD
IOMMU Driver
KFD
HSA Device
IOMMU
Qemu(Host process)
Shared Virtqueue
Back-end Driver
1. Guest application queue address
2. Guest process page table
Enable IOMMU
two-stage translation
Handle I/O Page Fault
Host OS
IOMMU Driver
IOMMU
Interrupt
Process1 Process2 Process3
Get fault address, process context
Handle Guest I/O Page Fault
Host OS
KFD Driver
KVM
Guest OS
virtio-iommu
Guest Process
1
Guest Process
3
Guest Process
2
Qemu
Back-end
IOMMU Driver Shadow PPR log
Map guest fault log region
1. allocate guest fault log region 2. mmap
IOMMU
Interrupt
Get fault address, process context
Reference
● https://github.com/HSAFoundation/HSA-Drivers-Linux-AMD
● http://www.hsafoundation.com/standards/● http://www.intel.com/content/www/us/en/pci-
express/pci-sig-sr-iov-primer-sr-iov-technology-paper.html
● https://www.usenix.org/system/files/conference/atc14/atc14-paper-tian.pdf
Q&AThanks!