using uncacheable memory to improve unity linux performance ning qu xiaogang gou xu cheng...
TRANSCRIPT
Using Uncacheable Memory to Improve Unity Linux Performance
Ning QuXiaogang Gou
Xu Cheng
Microprocessor Research and Development CenterPeking University
Peking University
Issues
Main Memory
TLB DMA
DCache CPU ICache
No snooping
Hardware table walking
in main memory
Cache coherency problem
everywhere !!
Unity SoC architecture
UniCore32UniCore-F64
(CP2)
I_BUSD_BUS
BIU
CP0
CP1
IMMU
I-Cache
DMMU
D-Cache
APB Bridge
PCI Bridge EMI10/100M
MAC
SPI
IIC
UART1
UART028 GPIO
RTC
INTC
PowerM.
OST
ResetC
System Control Modules
6 channelDMA
Peking University
Issues cont.
process I/O buffer
User Processprocess I/O buffer
User Process
kernel I/O buffer
Linux Kernel
I/O device buffer
I/O Device
kernel I/O buffer
Linux Kernel
I/O device buffer
I/O Device
DMA DMA
poor temporal
locality!
Peking University
Motivation
Heavy cost of Cache coherency operations Many high-end embedded processors have Cache, But many of them have very limited support to guarantee cache coherency
Poor locality leads to more data Cache pollutionCache is based on property of localitySome programs have poor locality, for example TCP/IP processing
How to avoid the disadvantages?
Uncacheable memory may be a solution!
Peking University
Contributions
Analyze the scenarios in which Cache doesn’t perform well, propose uncacheable memory has two advantages Eliminate most of Cache coherency operations Avoid Cache pollution
Apply uncacheable memory in Unity Linux to improve the I/O performance. Some important aspects improves from 5% - 29%
Peking University
Outline
IssuesMotivationContributionUncacheable MemoryEvaluationRelated WorkConclusions
Peking University
Recv Packet Flow
BufferKernel Space
User Space
I/O Device
step 1 step 2 step 3 step 4
User Bufferflush cache
DMA copy
Simple data
processing
CPU copy
Buffer Buffer Buffer
using uncacheable memory
Peking University
Send Packet Flow
BufferKernel Space
User Space
I/O Device
step 1 step 2 step 3 step 4
User Buffer clean cache DMA copy
Simple data
processingCPU copy
Buffer Buffer Buffer
using uncacheable memory
Peking University
Cacheable vs. Uncacheable
Send Receive
CH processing 1. copy from U to K
2. clean data cache
1. clean&invalidate data cache
2. copy from K to U
NC processing 1. copy from U to K(N) 1. copy from K(N) to U
side effect 1. accessing uncacheable memory is slower
2. no data cache pollution
3. no cache clean operation
1. accessing uncacheable memory is slower
2. no data cache pollution
3. no cache flush operation
DMA send and receive cost analysis
Peking University
Cacheable vs. Uncacheable cont.
DMA Send:
DMA Recv:
Cache clean costload U to Cache
load U into Cache
load K to Cache
store to KCache flush cost load U into Cache and store
load U into Cache and store
load K to Cache
load K
Peking University
Cacheable vs. Uncacheable cont.
Recv and Send Performance CH vs NC
Peking University
Using Uncacheable Memory
Implemented in Unity Linux ported from Linux 2.4.17 Uncacheable page table
eliminate Cache coherency operations when modifying the page tables
Uncacheable socket buffer for sending eliminate Cache coherency operations avoid data Cache pollution
Peking University
Outline
MotivationIssuesContributionUncacheable Memory?EvaluationRelated WorkConclusions
Peking University
Methodology
Benchmarks: Netperf, Lmbench and Modified Andrew benchmark.
Experiments environment 160 MHz Unity network computer with 256 MB DRA
M, a SoC build-in 10M/100M Ethernet card Dell 4600 server, two Intel Xeon PIII 700 MHz proc
essors with 4 GB DRAM and 1000M/100M Ethernet card
All benchmarks are executed in single-user mode on NFS.
Peking University
Netperf Benchmark Results
Netperf TCP_STREAM Send Performance
Peking University
Netperf Benchmark Results cont.
Netperf TCP_RR Performance
Peking University
Lmbench Benchmark Results
Lmbench Performance
Peking University
Modified Andrew Benchmark Results
Modified Andrew Benchmark
Peking University
Related Work
Related work: accelerate uncacheable memory performance New memory type
Intel write-combining MIPS R10000: uncached-accelerated page
New instructions SPARC V9, ARM, Unity II: block move instructions
Future work: new memory type support Read like common cache with low pollution Write like Write-Combining without write-allocate
Peking University
Conclusions
This paper focuses on the uncacheable memory usage. Pros: eliminating coherency operations and
avoiding data Cache pollution. Cons: slow accessing time
Uncacheable memory can perform well with a carefully design when considering system specialties
Peking University
Thank You!
Questions?