gpus for online deep learning applications · cpu vs gpu cpu (intel e5-2660 v3) gpu (nvidia k1200*)...

Post on 18-Jan-2021

12 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

GPUs for Online Deep Learning Applications

Chris Fougner

Content

• Deploying a streaming speech recognition service

• GPU deployments within Baidu

Missing Content

• Songbai Pu

• FPGA vs. GPU discussion

Speech Recognition

你好

Breaking down an utterance

• "Take me to Philz Coffee on Middlefield"

Breaking down an utterance• Recent state of the art neural networks for speech

recognition have ~200M parameters

• Translates to ~50B FLOPs for a 2.53s utterance, or 20 GFLOP per second of audio

• Users want a response in ~100ms

50B FLOPs in a datacenter• Doesn't smell like a typical datacenter application

Typical datacenter model• Typically setup yields 2-3 concurrent users

Borrow tricks from training

• Use GPUs?

• Batch utterances?

CPU vs GPUCPU (Intel E5-2660 v3) GPU (Nvidia K1200*)

TDP 105 W 45 WPrice $1500 USD $300 USDPeak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPsMemory Bandwidth 68 GB/s 80 GB/sMax Units / Server 2 4-8Float 16-bit libraries No Yes

*or Tesla M4 shortly.

K1200 GPU server• Naive approach, directly replace CPU with GPU,

we get 2x users per serverU

sers

Per

Ser

ver

0

12

3

45

6

7

8

E5-2660 v3 K1200

Batching

W x h

=*

W X H

=*...

Is batching feasible?Time

Use

r

Batch Dispatch

GPU + Batch Dispatch• With GPU + Batch Dispatch 10x throughput over

naive CPU.

Use

rs p

er s

erve

r

0

5

10

15

20

25

30

35

E5-2660 v3 K1200 K1200 + Batch Dispatch

Impact on latency?98% Latency (ms)

0

75

150

225

300

Concurrent Users

0 5 10 15 20 25 30

Single BatchBatch Dispatch

Borrow code from training?

• Bonus: Highly optimized code shared between in research and production. Huge productivity boost

• Eg. Switching from LSTM to GRU models in production code < 1 day

Baidu GPU deployment

• Image classification

• Machine translation

Image classification• Feature extraction for image classification uses

neural networks

Queries per second

0

25

50

75

100

125

150

E5-2620 v2 K1200

Latency (ms)

0

50

100

150

200

E5-2620 v2 K1200

Machine translation• translate.baidu.com uses neural network

Queries per second

012345678

E5-2620 v2 K1200

Latency (ms)

0

100

200

300

400

E5-2620 v2 K1200

Conclusions

• GPUs are an efficient way to boost performance and decrease latency of floating point intensive tasks in production

• Use Batch Dispatch to increase throughput

• GPUs for neural network applications allow you to share code

Mention

• Songbai Pu and Zhiqian Wang from Baidu China

Thank you!

• Questions?

top related