solros: a data-centric operating system architecture for heterogeneous...
TRANSCRIPT
Solros: A Data-Centric Operating System Architecturefor Heterogeneous Computing
Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap,Steffen Maass, Heeseung Jo, Taesoo Kim
Virginia Tech, eBay, Georgia Tech, Chonbuk National University
April 26, 2018
Changwoo Min Solros: Data-Centric OS April 26, 2018 1 / 21
Cambrian Explosion of Processor Architecture
Changwoo Min Solros: Data-Centric OS April 26, 2018 2 / 21
Specialization of general-purpose processors
Cambrian Explosion of Processor Architecture
Changwoo Min Solros: Data-Centric OS April 26, 2018 2 / 21
Specialization of general-purpose processors
Generalization of co-processors
Cambrian Explosion of Processor Architecture
Changwoo Min Solros: Data-Centric OS April 26, 2018 2 / 21
Specialization of general-purpose processors
Generalization of co-processors
Specialization of co-processors
Blazingly fast IO Devices
Changwoo Min Solros: Data-Centric OS April 26, 2018 3 / 21
Blazingly fast storage/memory
Blazingly fast IO Devices
Changwoo Min Solros: Data-Centric OS April 26, 2018 3 / 21
Blazingly fast storage/memory
Blazingly fast network
Blazingly fast IO Devices
Changwoo Min Solros: Data-Centric OS April 26, 2018 3 / 21
Blazingly fast storage/memory
Blazingly fast network
How to exploit the full potential of such hardware devices without pain?
System-wide performance
Ease of programming
Outline
1 Heterogeneous Computing Architectures
2 Solros: Split-Kernel ApproachSolros ArchitectureOperating System Services
3 Evaluation
Changwoo Min Solros: Data-Centric OS April 26, 2018 4 / 21
Host-Centric Approach
Host OS controls co-processors and IO devices
Examples: OpenCL, CUDA
Application
Host processor
OS
Mem Core
I/O deviceSSD / NIC
Co-processorApplication
Mem Core controldata
Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21
Host-Centric Approach
Host OS controls co-processors and IO devices
Examples: OpenCL, CUDA
Application
Host processor
OS
Mem Core
I/O deviceSSD / NIC
Co-processorApplication
Mem Core
①
controldata
Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21
Host-Centric Approach
Host OS controls co-processors and IO devices
Examples: OpenCL, CUDA
Application
Host processor
OS
Mem Core
I/O deviceSSD / NIC
Co-processorApplication
Mem Core
①
②
controldata
Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21
Host-Centric Approach
Host OS controls co-processors and IO devices
Examples: OpenCL, CUDA
Application
Host processor
OS
Mem Core
I/O deviceSSD / NIC
Co-processorApplication
Mem Core
①
②
③ controldata
Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21
Host-Centric Approach
Host OS controls co-processors and IO devices
Examples: OpenCL, CUDA
Application
Host processor
OS
Mem Core
I/O deviceSSD / NIC
Co-processorApplication
Mem Core
①
②
③ controldata
Problem
Redundant data communicationComplex to program and hard to optimize
Changwoo Min Solros: Data-Centric OS April 26, 2018 5 / 21
Coprocessor-Centric Architecture
Co-processors control IO devices
Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]
Application
Host processor
Mem Core
I/O deviceSSD / NIC
Application
Co-processor
OS
Mem Core
OS
controldata
Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21
Coprocessor-Centric Architecture
Co-processors control IO devices
Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]
Application
Host processor
Mem Core
I/O deviceSSD / NIC
Application
Co-processor
OS
Mem Core
①
OS
controldata
Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21
Coprocessor-Centric Architecture
Co-processors control IO devices
Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]
Application
Host processor
Mem Core
I/O deviceSSD / NIC
Application
Co-processor
OS
Mem Core
①
②
OS
controldata
Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21
Coprocessor-Centric Architecture
Co-processors control IO devices
Examples: Xeon Phi (Linux), GPUfs [ASPLOS13], GPUNet [OSDI14]
Application
Host processor
Mem Core
I/O deviceSSD / NIC
Application
Co-processor
OS
Mem Core
①
②
OS
controldata
Problem
Significant effort required for porting IO stack to co-processorNot completely exploiting powerful host processors
Changwoo Min Solros: Data-Centric OS April 26, 2018 6 / 21
Outline
1 Heterogeneous Computing Architectures
2 Solros: Split-Kernel ApproachSolros ArchitectureOperating System Services
3 Evaluation
Changwoo Min Solros: Data-Centric OS April 26, 2018 7 / 21
Solros Goal
Ease of programming
Best use of processor architecture
System-wide optimization
Changwoo Min Solros: Data-Centric OS April 26, 2018 8 / 21
Solros Goal
Ease of programming
Best use of processor architecture
System-wide optimization
Challenge
Co-processor needs IO abstraction
IO stacks is branch-divergent and difficult to parallelize
It needs system-wide information
Changwoo Min Solros: Data-Centric OS April 26, 2018 8 / 21
Solros Architecture
Split-Kernel Architecture
Data-plane OS
Runs on a co-processorProvides IO abstractionDelegates actual IO operations to a control-plane OS
Control-plane OS
Runs on a host processorRuns actual IO stackPerforms system-wide coordination
Changwoo Min Solros: Data-Centric OS April 26, 2018 9 / 21
Solros Architecture
Control-plane OS: actual OS service + system-wide coordination
Data-plane OS: thin communication layer to host processor
controldata
Application
Host processor
OS proxy
Mem Core
I/O deviceSSD / NIC
Application
Co-processor
OS stub
Core Mem
Policy
Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21
Solros Architecture
Control-plane OS: actual OS service + system-wide coordination
Data-plane OS: thin communication layer to host processor
controldata
Application
Host processor
OS proxy
Mem Core
I/O deviceSSD / NIC
Application
Co-processor
OS stub
Core Mem
① Policy
Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21
Solros Architecture
Control-plane OS: actual OS service + system-wide coordination
Data-plane OS: thin communication layer to host processor
controldata
Application
Host processor
OS proxy
Mem Core
I/O deviceSSD / NIC
Application
Co-processor
OS stub
Core Mem
①
②
Policy
Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21
Solros Architecture
Control-plane OS: actual OS service + system-wide coordination
Data-plane OS: thin communication layer to host processor
controldata
Application
Host processor
OS proxy
Mem Core
I/O deviceSSD / NIC
Application
Co-processor
OS stub
Core Mem
①
②③
Policy
Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21
Solros Architecture
Control-plane OS: actual OS service + system-wide coordination
Data-plane OS: thin communication layer to host processor
controldata
Application
Host processor
OS proxy
Mem Core
I/O deviceSSD / NIC
Application
Co-processor
OS stub
Core Mem
①
②③
Policy
Co-processor has OS abstraction with minimal effort
Best use of each of the fat and lean processors
Efficient global coordination among devices (policy)
Changwoo Min Solros: Data-Centric OS April 26, 2018 10 / 21
Operating System Services
1 Transport service
2 Filesystem service
3 Network service
Changwoo Min Solros: Data-Centric OS April 26, 2018 11 / 21
Operating System Services
1 Transport service
2 Filesystem service
3 Network service
Changwoo Min Solros: Data-Centric OS April 26, 2018 12 / 21
Transport Service
High performance data transfer among devices are challenging:
Uniform data transfer among devices
High contention in massively-parallel co-processor
Asymmetric performance between host processor and co-processor
Changwoo Min Solros: Data-Centric OS April 26, 2018 13 / 21
Transport Service
High performance data transfer among devices are challenging:
Uniform data transfer among devices
High contention in massively-parallel co-processor
Asymmetric performance between host processor and co-processor
Our approach
Uniform data transfer ⇒ system-mapped PCIe window
High contention ⇒ combining, replication, interleaving, etc.
Asymmetric performance ⇒ flexibly configurable (host DMA enginevs. co-processor DMA engine)
Changwoo Min Solros: Data-Centric OS April 26, 2018 13 / 21
Transport Service
High performance data transfer among devices are challenging:
Uniform data transfer among devices
High contention in massively-parallel co-processor
Asymmetric performance between host processor and co-processor
Our approach
Uniform data transfer ⇒ system-mapped PCIe window
High contention ⇒ combining, replication, interleaving, etc.
Asymmetric performance ⇒ flexibly configurable (host DMA enginevs. co-processor DMA engine)
See details in the paper
Changwoo Min Solros: Data-Centric OS April 26, 2018 13 / 21
Filesystem Service
Peer-to-peer operation
Buffered operation
Host processor
SSD
DMA engine
Application
Co-processor
File system stub①
File system proxy
File system
PCIe
controldata
Zero-copy of data between co-processor memory and SSDMinimal data transfer
Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21
Filesystem Service
Peer-to-peer operation
Buffered operation
②
Host processor
SSD
DMA engine
Application
Co-processor
File system stub①
File system proxy
File system
PCIe
controldata
Zero-copy of data between co-processor memory and SSDMinimal data transfer
Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21
Filesystem Service
Peer-to-peer operation
Buffered operation
②
Host processor
SSD
DMA engine
Application
Co-processor
File system stub① ③
File system proxy
File system
PCIe
controldata
Zero-copy of data between co-processor memory and SSDMinimal data transfer
Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21
Filesystem Service
Peer-to-peer operation
Buffered operation
②
Host processor
SSD
DMA engine
Application
Co-processor
File system stub① ③
File system proxy
File system
PCIe
④
controldata
Zero-copy of data between co-processor memory and SSDMinimal data transfer
Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21
Filesystem Service
Peer-to-peer operation
Buffered operation
②
Host processor
SSD
DMA engine
Application
Co-processor
File system stub① ③
File system proxy
File system
PCIe
④
⑤ controldata
Zero-copy of data between co-processor memory and SSDMinimal data transfer
Changwoo Min Solros: Data-Centric OS April 26, 2018 14 / 21
Filesystem Service
Peer-to-peer operation
Buffered operation
②
Host processor
SSD
DMA engine
Application
Co-processor
File system stub①
File system proxy
File system
PCIe
controldata
Buffer cache
Reduce storage IO by leveraging shared buffer cache among co-processorsAvoid performance anomaly of peer-to-peer communication over PCIe bus
Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21
Filesystem Service
Peer-to-peer operation
Buffered operation
②
Host processor
SSD
DMA engine
Application
Co-processor
File system stub① ③
File system proxy
Buffer cache
File system
PCIe
controldata
Reduce storage IO by leveraging shared buffer cache among co-processorsAvoid performance anomaly of peer-to-peer communication over PCIe bus
Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21
Filesystem Service
Peer-to-peer operation
Buffered operation
②
Host processor
SSD
DMA engine
Application
Co-processor
File system stub① ③
File system proxy
Buffer cache
File system
PCIe
controldata
④
Reduce storage IO by leveraging shared buffer cache among co-processorsAvoid performance anomaly of peer-to-peer communication over PCIe bus
Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21
Filesystem Service
Peer-to-peer operation
Buffered operation
②
Host processor
SSD
DMA engine
Application
Co-processor
File system stub① ③
File system proxy
Buffer cache
File system
PCIe
controldata
⑤
④
Reduce storage IO by leveraging shared buffer cache among co-processorsAvoid performance anomaly of peer-to-peer communication over PCIe bus
Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21
Filesystem Service
Peer-to-peer operation
Buffered operation
②
Host processor
SSD
DMA engine
Application
Co-processor
File system stub① ③
File system proxy
Buffer cache
File system
PCIe
controldata
⑤
④⑥
Reduce storage IO by leveraging shared buffer cache among co-processorsAvoid performance anomaly of peer-to-peer communication over PCIe bus
Changwoo Min Solros: Data-Centric OS April 26, 2018 15 / 21
Implementation
Host: 2-socket Xeon processor (12 cores each)
Co-processor: 4 Xeon Phi (KNC, 61 cores, Linux, PCIe Gen 3x16)
Storage device: 4 NVMe SSD
NIC: 100 Gbps Ethernet
ModuleLines of code
Added lines Deleted lines
Transport service 1,035 365
File system ServiceStub 5,957 2,073Proxy 2,338 124
Network ServiceStub 2,921 79Proxy 5,609 34
NVMe device driver 924 25SCIF kernel module 60 14
Total 18,844 2,714
Changwoo Min Solros: Data-Centric OS April 26, 2018 16 / 21
Evaluation
Questions:
Performance of Solros services
Impact on real-world applications
Changwoo Min Solros: Data-Centric OS April 26, 2018 17 / 21
Performance of Solros Services
0
0.5
1
1.5
2
2.5
32KB
64KB
128KB
256KB
512KB
1MB
2MB
4MB
0
20
40
60
80
100
10 100 1000
GB
/se
c
(a) file random read on SSD
Phi-SolrosPhi-Linux
Per
cen
tag
eo
fre
qu
ests
(%)
Latency(usec)
(b) TCP latency: 64-byte message
Phi-SolrosPhi-Linux
File IO performance: 19x faster than the stock Linux on Xeon PhiTCP latency (99 percentile): 7x shorter than the stock Linux on Xeon Phi
Changwoo Min Solros: Data-Centric OS April 26, 2018 18 / 21
Performance of Solros Services
0
2
4
6
Phi-Linux Phi-Solros0
2
4
6
8
10
Phi-Linux Phi-Solros
La
ten
cy(m
sec)
(a) file random read on SSD
StorageBlock/Transport
File system
(b) TCP latency: 64-byte message
Proxy/TransportNetwork stack
Significant performance gain in data transportRunning IO stack on co-processor is slower
Changwoo Min Solros: Data-Centric OS April 26, 2018 19 / 21
Real-world Application - Image Search
Image search engine is running on Xeon Phi
Image database is on NVMe SSD (shared read-only)
Image search queries are from network
0
100
200
300
400
0
100
200
0 100 200 300
Ela
pse
dT
ime
(sec
)
Phi-SolrosPhi-Linux
Ba
nd
wid
th(M
B/
s)
Time(sec)
Phi-SolrosPhi-Linux
Solros performs 2x faster than stock Linux on Xeon Phi
Changwoo Min Solros: Data-Centric OS April 26, 2018 20 / 21
Conclusion
Solros, a new operating system architecture for co-processors and fastIO devices
Control-plane, data-plane architecture allow:
Supporting high-level OS abstraction on co-processorEfficient global coordination among devicesNear ideal IO performance from co-processor
We will release source code soon
Changwoo Min Solros: Data-Centric OS April 26, 2018 21 / 21
Image Search - Scalability
Increase the number of Xeon Phi to 4
0
1
2
3
4
1 2 3 4
Nor
ma
lize
dsp
eed
up
The number of Xeon Phi co-processors
Solros load balancing mechanism achieves near linear scaling
Changwoo Min Solros: Data-Centric OS April 26, 2018 22 / 21
Related work
Control-plane/data-plane OS: Arrakis [OSDI’14], IX [OSDI’14]
OS for heterogeneous systems: Helios [SOSP]09], M3 [ASPLOS’16],Hydra [ASPLOS’08]
IO support for GPU: PTask [SOSP’11], GPUfs [ASPLOS’13], GPUnet[OSDI’14]
Changwoo Min Solros: Data-Centric OS April 26, 2018 23 / 21
Transport service
Host's physical address space
Host RAM PCIe Window mmio
DRAM
NICDRAM mmio
Xeon PhiDRAM mmio
NVMe
Host's virtualaddress space
App 1 App 2
...
mmioDevice's physical
address sapce
mapped to
(across devices)
mapped to
(in-host)
Changwoo Min Solros: Data-Centric OS April 26, 2018 24 / 21
Network Service (TCP)
Outbound operation
Inbound operation
Host processor
NIC
Application
Co-processor
TCP stub
TCP proxy
Load balancer
TCP stack
PCIe
③
①
②
controldata
④
⑤
A load balancer on a host distributes incoming TCP connections to one ofleast-loaded co-processors.
See details in the paper.
Changwoo Min Solros: Data-Centric OS April 26, 2018 25 / 21
Discussion
Hardware support other than Xeon Phi
Two atomic instructions: transport serviceMMU: isolation among co-processor applications
Scalability of control-plane OS
Limited by scalability of OS service, PCIe interconnect, andperformance of IO devices
Changwoo Min Solros: Data-Centric OS April 26, 2018 26 / 21
Real-world Application - Text Search
CLucene text indexing engine running on Xeon Phi
Text data is on NVMe SSD
0
500
1000
1500
2000
2500
3000
0.0
0.5
1.0
1.5
0 50 100 150
Ela
pse
dT
ime
(sec
)
Phi-SolrosPhi-Linux
Ba
nd
wid
th(G
B/
s)
Time(sec)
Phi-SolrosPhi-Linux (virtio)
Solros performs 19x faster than stock Linux (ext4/virtio) on Xeon Phi
Changwoo Min Solros: Data-Centric OS April 26, 2018 27 / 21