bigstation: enable scalable real-time signal processing in...

BigStation: Enable Scalable Real-time Signal Processing in Large MU-MIMO Systems

Qing Yang Xiaoxiao Li § Hongyi Yao¶ Ji Fang ‡ Kun Tan † Wenjun Hu † Jiansong Zhang† Yongguang Zhang †

†Microsoft Research Asia, Beijing, China MSRA and CUHK, Hong Kong

§ MSRA and Tsinghua University, Beijing, China ¶ MSRA and USTC, He Fei, An Hui, China

‡ MSRA and BJTU, Beijing, China

Motivation

• Demand for more wireless capacity – Proliferation of mobile devices: wireless access is

primary

– Data-intensive applications: video, tele-presence

– “amount of net traffic carried on wireless will exceed the amount of wired traffic by 2015” (sourced from CISCO VNI 2011-2016)

SIGCOMM 2013, Hong Kong, Aug 2013 2

Motivation

• Demand for more wireless capacity – Proliferation of mobile devices: wireless access is

primary

– Data-intensive applications: video, tele-presence

– “amount of net traffic carried on wireless will exceed the amount of wired traffic by 2015” (sourced from CISCO VNI 2011-2016)

Can we engineer next wireless network to match existing wired network –

Giga-bit wireless throughput to every user?

How to Gain More Wireless Capacity

• More spectrum (DSA) – Spectrum is scarce, shared resource and there is a

• Spectrum reuse (micro cell, pico cell, …) – Existing cells are already small (like Wi-Fi)

– Increased deployment and management complexity

• Spatial multiplexing (MU-MIMO) – More promising

Background: MU-MIMO

• Transmit to/Receive from multiple mobile stations

Access Point (AP)

Joint Signal Processing

mobile

mobile mobile

m AP antennas n total client antennas

mobile mobile

mobile

𝑌 = 𝐻S, 𝑋 = 𝐻∗(𝐻𝐻∗−1)𝐻𝑌 Uplink:

S = 𝐻∗𝐻 −1𝐻∗𝑋 Y = 𝐻𝑆 = 𝑋 Downlink:

S = 𝑋

• In theory, linearly scale capacity with # of AP antennas

How Many Antennas do We need

• … for giga-bit wireless link per user

# of ant 1 2 4 8 16 32 64 128

20MHz 72.2M 144M 289M 578M 1.2G 2.3G 4.6G 9.2G

40MHz 150M 300M 600M 1.2G 2.4G 4.8G 9.6G 19.2G

80MHz 325M 650M 1.3G 2.6G 5.2G 10.4G 20.8G 41.6G

160MHz 650M 1.3G 2.6G 5.2G 10.4G 20.8G 41.6G 83.2G

802.11n

802.11ac Large-scale MU-MIMO systems

Giga-bit to 20 concurrent users: 160MHz channel with at least 40 antennas

Challenge

• Can we build a scalable AP to support such large-scale MU-MIMO operation? – When n, so as m, increases large?

Access Point (AP)

mobile mobile mobile

m AP antennas

n total client antennas

Computation and Throughput Requirement: a Back-of-Envelope Estimation

• Setting: 160MHz, 40 antennas • Data path:

– 160MHz channel width 𝑟 = 5Gbps sa. per ant. – 40 antennas 200Gbps in total

• Computation: – Channel inverse (once every frame): 𝑂(𝑚𝑛2𝑟/𝑡𝑓)

269 GOPS – Spatial demutiplexing/precoding: 𝑂(𝑚𝑛𝑟) 1.5 TOPS – Channel Decoding: 𝑂(𝑛𝑟) 5.5 TOPS – 7.27 TOPS in total!

• State-of-art multi-core CPU achieves only 50 GOPS

A Single Central Processing Unit

Access Point (AP)

m AP antennas

BigStation AP

BigStation: Parallelizing to Scale

Simple Processing

m AP antennas

Simple Processing

Unit Inter-connecting

Network

Outline

• Parallel architecture

• Parallel algorithms and optimization

• Performance

• Conclusion

Naive Architecture

• A pool of processing servers

– Sending samples of the same frame to one server…

• A pool of processing servers

• Enough processing capability with ⌈𝑡𝑝/𝑡𝑓⌉ servers

Naive Architecture

• Issue: long processing latency for a frame (~1𝑠)

• Wireless protocols requirement: milliseconds

Our Approach: Distributed Pipeline

• Parallelizing MU-MIMO processing into 3-stage pipeline • At each stage, the computation is further parallelized

among multiple servers

Channel inversion

Spatial demultiplexing

Channel decoding

Channel inversion

Channel decoding

Data Partitioning across Servers

• Exploiting data parallelism inside MU-MIMO

OFDM signal

Partitioning by subcarriers

Channel inversion

Channel decoding

Data Partitioning across Servers

• Exploiting data parallelism inside MU-MIMO

OFDM signal

Partitioning by spatial streams

Example

• Giga-bit to 20 users – 160MHz 468 parallel subcarriers

• Subcarrier partitioning – Each server needs to handle a minimum of 10Mbps data

• Spatial stream partitioning – Each server needs to handle 5Gbps data

• Generally within existing server’s processing capability – Multi-core (4~16)

– 10G NIC

Summary

• Distributed pipeline for low latency

• Exploiting data parallelism across servers at each processing stage

• If single datum is still beyond capability of a single processing unit

– Building deeper pipeline (see paper for details)

Outline

• Performance

• Conclusion

Computation Partitioning in a Server

• Three key operations in MU-MIMO

– Matrix multiplication

– Matrix inversion

– Viterbi decoding (channel decoding)

Parallel Matrix Multiplication

• Divide-and-conquer

𝐻∗ 𝐻 = 𝐻1∗𝐻2

∗ 𝐻1

= 𝐻1∗𝐻1 + 𝐻2

∗𝐻2

Core 1 Core 2

Parallel Matrix Inversion

• Based on Gauss-Jordan method

ℎ11 ℎ12

ℎ21 ℎ22

ℎ1𝑛

ℎ2𝑛

ℎ31 ℎ32 ⋱ ⋮

ℎ𝑛1 ℎ𝑛2 … ℎ𝑛𝑛

1 00 1

⋱ ⋮

0 0 … 1

Core 1 Core 2

Parallel Matrix Inversion

• Based on Gauss-Jordan method

ℎ11 ℎ12

ℎ21 ℎ22

ℎ1𝑛

ℎ2𝑛

ℎ31 ℎ32 ⋱ ⋮

ℎ𝑛1 ℎ𝑛2 … ℎ𝑛𝑛

1 00 1

⋱ ⋮

0 0 … 1

Core 1 Core 2

1 00 1

0 00 0

⋱ ⋮

0 0 … 1

𝑖11 𝑖12

𝑖21 𝑖22

𝑖1𝑛

𝑖2𝑛

𝑖31 𝑖32 ⋱ ⋮

𝑖𝑛1 𝑖𝑛2 … 𝑖𝑛𝑛

Parallel Viterbi Decoding

• Challenge: sequential operations on a continuous (soft-)bit stream

• Solution: – Artificially divide bit-stream into blocks

Core 1

Core 2

• Challenge: sequential operations on a continuous (soft-)bit stream

• Solution: – Artificially divide bit-stream into blocks – Add overlaps to ensure converging to optimal

Core 1

Core 2

• How to choose a right block size? – The tradeoff between latency and overhead

• Our goal: fully utilize the computation capacity while keeping 𝐿 minimal

• Optimal size: 𝐿∗ = 2𝐷𝑢/(𝑚𝑣 − 𝑢)

Core 1

Core 2

𝑢: stream bit rate 𝑣: processing rate per core 𝑚: # of cores

Optimization: Lock-free Computing Structure

• Complex interaction between communication and computation threads

(1.31x ) Contention at output buffer Lock free

Optimization: Communication

• Parallelizing communication among multiple cores

• Dealing with incast problem

– Application-level flow control

• Isolating communication and computation on different cores

Outline

• Performance

• Conclusion

Micro-benchmarks

• Platform: Dell server with an Intel Xeon E5520 CPU (2.26 GHz, 4 cores)

Channel inversion

Micro-benchmarks

Spatial demultiplexing Viterbi decoding

Micro-benchmarks

6 users, 100Mbps

20 users, 600Mbps

50 users, 1Gbps

Prototype

• Software radio: Sora MIMO Kit – 4x phase coherent radio chains – Extensible with an external clock

Capacity Gain

Caped at a constant value due to random-user selection!

Capacity Gain

Overprovisioned AP antennas

Processing Delay

Light load (1 frame per 10ms) Heavy load (back-to-back frames)

860𝜇𝑠

Things I didn’t talk about

• How to get channel state in a scalable way

– Argos [Shepard, et al., mobicom 2012]

– JMB [Rahul, et al., SIGCOMM 2012]

• MU-MIMO MAC

– Better user selection than random? (Future work)

• Automatic gain control in large scale MU-MIMO

– Future work

(related and future work)

Conclusions

• Scale processing of large MU-MIMO systems is possible

– Exploiting parallelism of MU-MIMO operations and processing servers

– Developing a distributed processing pipeline

• Large-scale MU-MIMO is a promising way to scale wireless capacity by another 100x

– Yet, many challenges remains (user-selection, AGC …)

Thanks! Take you questions!

bigstation: enable scalable real-time signal processing in...

Documents

netlord: a scalable multi-tenant network architecture for...

sigcomm review 2005

cell spotting - sigcomm

microsoft powerpoint - sigcomm-p2ptvworkshop-qianzh

datacenter new challenges to networking ethernet,...

lessons learned in production - sigcomm

towards predictable datacenter networks - sigcomm

improving gnutella willy henrique säuberli seminar in...

networking named content - sigcomm

7610: distributed systemscnitarot.github.io › courses ›...

a framework for scalable global ip-anycast sigcomm 2000,...

carousel: scalable traffic shaping at end...

taming the torrent - sigcomm

enable a scalable and secure vmware view deployment ......5...

4g wireless networks - sigcomm

enable a scalable and secure vmware view deployment | f5 ......

wrex: a scalable middleware architecture to enable xml...

s. keshav, chair, acm...

scalable packet classiﬁcation using distributed...

sigcomm'13 ~ full duplex radios