hsa-4138, hsaemu – a full system emulator for hsa platform, by yeh ching chung and jiun-hung ding
DESCRIPTION
Presentation HSA-4138, HSAemu – A Full System Emulator for HSA Platform, by Yeh Ching Chung and Jiun-Hung Ding at the AMD Developer Summit (APU13) November 11-13, 2013TRANSCRIPT
HSAHSA A F ll S t E l tA F ll S t E l tHSAemuHSAemu ‐‐ A Full System Emulator A Full System Emulator for HSA Platformfor HSA Platformfor HSA Platformfor HSA Platform
Prof. Yeh‐Ching Chung
System Software LaboratoryDepartment of Computer scienceDepartment of Computer science National Tsing Hua University
National Tsing Hua University ® copyright OIANational Tsing Hua University 1
Outline
Introduction to HSAIntroduction to HSA Design of HSAemu P f E l ti Performance Evaluation Conclusions and Future Work
National Tsing Hua University ® copyright OIANational Tsing Hua University 2
Introduction to HSA
HSA is an industry standard to define next‐generation f g
hardware/software architecture for heterogeneous computingfor heterogeneous computing
National Tsing Hua University ® copyright OIANational Tsing Hua University 3
Hardware Platform of HSA
National Tsing Hua University ® copyright OIANational Tsing Hua University 4
Simplified HSA Software Stack
Application
Domain Specific Libs(Bolt, OpenCV™, … many others)Application
SW O GL ES OthRenderscript OpenGL‐ESRuntime
OtherRuntime
p/OpenCLRuntime
Legacy Driver
HSA Runtime
HSAILHSA Software
Ctl
Legacy Driver
HSA FinalizerDrivers
Kernel Driver
CPU(s) GPU(s) Other AcceleratorsDifferentiated HW
GPU ISA
National Tsing Hua University ® copyright OIANational Tsing Hua University 5
Specification of Simple HSA Platform
Hardware– Memory
SoftwareHSA R ti API– Memory
• Shared Virtual Memory (hUMA)• Cache Coherency Domains• Memory‐Based Signaling and
– HSA Runtime APIs• Initialization of HSA components• Topology discovery• Manage AQL packets
Synchronization for CPU and GPU
– Task Control• Architected Queuing Language (AQL)
Manage AQL packets• Dispatch application tasks• Signal HW and wait for result• Recycle available resources
• Efficient Syscall Infrastructure• Preemptive Context Switching
– Debugging Infrastructure
– User Mode Queue• Store AQL packets
– Virtual ISA ‐ HSAILgg g• Allow system software to set
Instruction/ Memory/ Conditional, etc., breakpoints
E ti H dli
Virtual ISA HSAIL• A low level instruction set designed for
parallel computing
– Exception Handling• GPU trap handler to trigger GPU
interrupt for GPU exception
National Tsing Hua University ® copyright OIANational Tsing Hua University 6
What Is HSAemu
HSAemu is a full system emulator that supports the following HSA features– Shared virtual memory between CPU and GPU– Memory based signaling and synchronizationMemory based signaling and synchronization– Multiple user level command queues– Preemptive GPU context switching
Concurrent execution of CPU threads and GPU threads– Concurrent execution of CPU threads and GPU threads– HSA runtime– FinalizerA P j S d b M di T k (MTK) A Project Sponsored by MediaTek (MTK)
Currently, it supports simple HSA platform simulationsimulation– Functional‐accurate simulation– Cycle‐accurate simulation
National Tsing Hua University ® copyright OIANational Tsing Hua University 7
Architecture of HSAemu
HSAemu consists of 6 components– HSA Runtime– CPU Simulation Module– GPU Task Dispatcher– Functional‐Accurate GPU Simulator (Fast‐GPU Simulator)
– Cycle‐Accurate GPU Simulator (Mult2sim)– GPU Helper Functions
National Tsing Hua University ® copyright OIANational Tsing Hua University 8
HSAemu Runtime
User Mode Queue– Store AQL packetsStore AQL packets
AQL Queue Manager – Manage AQL packets in User Mode
Queue
AQL Command Dispatcher Launch the execution of kernel jobs on– Launch the execution of kernel jobs on HSAemu
Support OpenCL runtime
National Tsing Hua University ® copyright OIANational Tsing Hua University
pp p
9
CPU Simulation Module (1)
PQEMU – Perform multicore CPU simulation HSA Signal Handler – Receive AQL command
from HSA Runtime and launch GPU simulation
National Tsing Hua University ® copyright OIANational Tsing Hua University 10
CPU Simulation Module (2)
PQEMU– A parallel system emulator based on QEMU– A parallel system emulator based on QEMU– Tow efficient synchronization models (UCC/SCC)– Dynamic binary translation (DBT) technique– A project sponsored by MTK
Agent code, HSA runtime, and operating system are run on PQEMUsystem are run on PQEMU
Code Cache
DBT DBT DBT DBT
CPU CPUCPU CPU
Unified Code Cache (UCC) Model
National Tsing Hua University ® copyright OIANational Tsing Hua University 11
“PQEMU: A Parallel System Emulator Based on QEMU” (ICPADS 2011)
GPU Task Dispatcher (1)
AQL Command Monitor– Receive signal from HSA Signal Handler– Copy AQL packets from User Mode Queue
to HW AQL Queue– Launch AQL Packet Worker
AQL Packet Worker– Dequeue AQL packets from HW AQL Queue– Parse AQL packetParse AQL packet– Dispatch kernel jobs to Fast‐GPU Simulator
or M2S‐GPU Simulator according to the kernel informationkernel information
National Tsing Hua University ® copyright OIANational Tsing Hua University 12
GPU Task Dispatcher (2)
Execution Flow
National Tsing Hua University ® copyright OIANational Tsing Hua University
GPU Task Dispatcher (3)
Signal from HAS Signal Handler
National Tsing Hua University ® copyright OIANational Tsing Hua University
GPU Task Dispatcher (4)
Copy AQL packets fromCopy AQL packets fromUser Mode Queue
National Tsing Hua University ® copyright OIANational Tsing Hua University
GPU Task Dispatcher (5)
Ask AQL Packet Workerto parse AQL Packet
National Tsing Hua University ® copyright OIANational Tsing Hua University
GPU Task Dispatcher (6)
Launch Fast-GPUSimulator
National Tsing Hua University ® copyright OIANational Tsing Hua University
GPU Task Dispatcher (7)
Launch M2S-GPU SimulationSimulation
National Tsing Hua University ® copyright OIANational Tsing Hua University
Fast‐GPU Simulator
A functional‐accurate simulator for generic GPU model simulation– HSAIL Translator
• Act as a Finalizer• Use static binary translation technique to translate BRIG file to host executableto translate BRIG file to host executable binary file (x86) based on LLVM
• Host SSE instruction optimization
– GPU Thread Scheduler• Simulate a generic GPU model
National Tsing Hua University ® copyright OIANational Tsing Hua University 19
HSAIL Translator (1)
Architecture
National Tsing Hua University ® copyright OIANational Tsing Hua University
HSAIL Translator (2)
Launch LLVMHSAIL Translator
National Tsing Hua University ® copyright OIANational Tsing Hua University
HSAIL Translator (3)
ConstructConstruct Control Flow
Graph of HSAIL
National Tsing Hua University ® copyright OIANational Tsing Hua University
HSAIL Translator (4)
Translate HSAIL to LLVM IR
National Tsing Hua University ® copyright OIANational Tsing Hua University
HSAIL Translator (5)
Translate LLVM IRto Host Executableto Host Executable
Object File
National Tsing Hua University ® copyright OIANational Tsing Hua University
HSAIL Translator (6)
Load Host ExecutableLoad Host ExecutableObject File
to memory
National Tsing Hua University ® copyright OIANational Tsing Hua University
HSAIL Translator (7)
Link to GPU Helper Functions
National Tsing Hua University ® copyright OIANational Tsing Hua University
HSAIL Translator (8)
SStore the translation resultto GPU Code Cache
National Tsing Hua University ® copyright OIANational Tsing Hua University
HSAIL Translator (2)
Host SSE instruction Optimization– Reconstruct the control flow graph of kernel function
– Use bitmap masking and packing/unpacking algorithms to generate host SSE instructionsalgorithms to generate host SSE instructions
National Tsing Hua University ® copyright OIANational Tsing Hua University 28
HSAIL Translator (3)
Example : The control flow graph for kernel function $foo
National Tsing Hua University ® copyright OIANational Tsing Hua University 29
HSAIL Translator (4) Reconstruct the control flow graph by depth‐first traversal
Perform bitmap maskingand packing & unpackingalgorithmsalgorithms
National Tsing Hua University ® copyright OIANational Tsing Hua University 30
GPU Thread Scheduler
Simulate a generic GPU model– GPU Thread Scheduler assigns work groups
to free CU threads in the GPU Thread Poolto free CU threads in the GPU Thread Pool– Each CU thread executes all work items in a
work group The maximum number of CU threads is– The maximum number of CU threads is limited by host operating system
National Tsing Hua University ® copyright OIANational Tsing Hua University 31
M2S‐GPU Simulator (1)
A cycle‐accurate simulator for AMD Southern Islands GPU model simulation– HSAIL Translator
• Translate BRIG file to GPU binary
– M2S Bridge• Bridge Multi2Sim GPU Model to HSAemuHSAemu
– M2S GPU Module• Simulate a cycle‐accurate GPU modelSimulate a cycle accurate GPU model
National Tsing Hua University ® copyright OIANational Tsing Hua University 32
M2S‐GPU Simulator (2)
HSAIL Translator– Act as a Finalizer– Translate HSAIL to AMD Southern Islands GPU binary
– Use static binary translation technique based on LLVM
National Tsing Hua University ® copyright OIANational Tsing Hua University 33
M2S‐GPU Simulator (3)
M2S Bridge : An interface to launch M2S GPU M d lM2S GPU Module– Initialize the data structures used by
AMD Southern Islands GPU, including aAMD Southern Islands GPU, including a memory register for AMD Southern Islands GPU to access the shared system memory in HSAemumemory in HSAemu
– Invoke M2S GPU Module (the AMD Southern Islands GPU module in Multi2Sim)
National Tsing Hua University ® copyright OIANational Tsing Hua University 34
M2S‐GPU Simulator (4)
M2S GPU Module– A cycle‐accurate AMD Southern Islands GPU simulator in Multi2Sim
Memory access is performed by y p yHSAemu memory helper function to comply the hUMA modelp y
National Tsing Hua University ® copyright OIANational Tsing Hua University 35
GPU Helper Functions (1)
Memory Helper Function– A soft‐mmu of GPU with a page table
worker and a TLB to enable hUMA model– Support the redirect access of a local pp
segment memory to a non‐shared private memory in GPU
K l I f ti H l F ti Kernel Information Helper Function– Collect and return information of GPU
simulation and current execution state s u at o a d cu e t e ecut o state– Retrieve kernel information such as
working item ID, work group size, etc, from AQL packetAQL packet
National Tsing Hua University ® copyright OIANational Tsing Hua University 36
GPU Helper Functions (2)
Mathematic Helper Function– Simulate special mathematical instructions
such as trigonometric instructions by calling the corresponding mathematical functions in standard library
Synchronization Helper Function– Barrier synchronization implementation for
generic GPU model simulation
National Tsing Hua University ® copyright OIANational Tsing Hua University 37
hUMAModel in HSAemu
Unified coherent address space – GPU can access a virtual memory page allocated by CPU
Soft‐mmu is simulated for GPU– TLB hit/miss events can be traced
Memory segment access– Global memory segment access is handled by memory helper function
– Group memory segment access is handled by host ld/st instructions
National Tsing Hua University ® copyright OIANational Tsing Hua University 38
Recall: Hardware Simulation of HSAemu
HSA hardware components simulated– Multicore CPU: A parallel multicore CPU model simulation– Functional‐Accrate GPU: A generic GPU model simulation– Cycle‐Accurate GPU: AMD Southern Islands GPU model simulation
– hUMA: A unified address space between CPU and GPU simulation
– Synchronization Primitive: Barrier instruction simulation– Hardware AQL Queue: A HW dispatch queue for GPU
i l tisimulation
National Tsing Hua University ® copyright OIANational Tsing Hua University 39
Recall: Software Utilities of HSAemu
HSA software utilities designed– HAS Runtime: HSA runtime library (OpenCL runtime)– Topology Discovery: Discover the current platform topology– User Mode Queue: A queue for each user application– Signal Event: Notify GPU to work– HSAIL Generator: A PTX to HSAIL source level translator– BRIG Generator: Generate a binary format from a Kernel file– HSAIL Translator: Translate HSAIL to host executable binary– GPU Code Cache: store translated host binaries
National Tsing Hua University ® copyright OIANational Tsing Hua University 40
Performance Evaluation
Experimental Environment
Benchmarks: – Nearest Neightbor (NN), K‐Means, FFT, FWT, N‐Body– Binary Search, Bitonic Sort, Reduction, FWT
National Tsing Hua University ® copyright OIANational Tsing Hua University
y , , ,
41
Scalability of Fast‐GPU Simulator
Comparison of NN, K‐means and FWT benchmarks on 32 physical coresphysical cores
The speedup is scalable when # of CU threads < # of host physical coresphysical cores
National Tsing Hua University ® copyright OIANational Tsing Hua University 42
SSE Optimization of Fast‐GPU Simulator
Performance comparison of FFT when turn on/off SSE i i iSSE optimization
National Tsing Hua University ® copyright OIANational Tsing Hua University 43
N‐Body Simulation by Fast‐GPU Simulator
N‐Body Simulation
All of host physical CPUs are running
National Tsing Hua University ® copyright OIANational Tsing Hua University 44
Comparison of HSAemu and Multi2Sim
20
benchmark speedup
14
16
18
Fast‐GPU Sim > M2S‐GPU sim > Multi2Sim
10
12
14
4
6
8
BinarySearch BitonicSort FastWalshTransform Reductionmulti2sim 1 1 1 1
0
2
multi2sim 1 1 1 1HSAemu 2.931317 18.88827 8.645516 6.294213Hybrid 2.873768 0.921835 2.407809 2.105663
multi2sim HSAemu Hybrid
National Tsing Hua University ® copyright OIANational Tsing Hua University 45
Conclusions
An HSA‐compliant full system emulator has been implemented– A functional‐accurate simulator for generic GPU model– A cycle‐accurate simulator for AMD Southern Islands GPU model (from Multi2Sim)
The HSAIL Translator acts as a finalizer that enables the integration of HSAemu with existing simulators, for example, Multi2Sim
Open source – Nov. 12, 2013p ,– http://hsaemu.org/
National Tsing Hua University ® copyright OIANational Tsing Hua University 46
Future work
Enhance HSAemu by implementing more HSA f tfeatures
I HSA i h i i l Integrate HSAemu with some existing cycle‐accurate GPU simulators
Design a cycle‐accurate simulator based on PQEMU for generic CPU model
Deisgn a cycle‐accurate simulator based on PQEMU for big.LITTLE CPU model
National Tsing Hua University ® copyright OIANational Tsing Hua University 47
Q & AQ & A
National Tsing Hua University ® copyright OIANational Tsing Hua University 48