gpus for online deep learning applications · cpu vs gpu cpu (intel e5-2660 v3) gpu (nvidia k1200*)...

GPUs for Online Deep Learning Applications

Chris Fougner

Content

• Deploying a streaming speech recognition service

• GPU deployments within Baidu

Missing Content

• Songbai Pu

• FPGA vs. GPU discussion

Speech Recognition

你好

Breaking down an utterance

• "Take me to Philz Coffee on Middlefield"

Breaking down an utterance• Recent state of the art neural networks for speech

recognition have ~200M parameters

• Translates to ~50B FLOPs for a 2.53s utterance, or 20 GFLOP per second of audio

• Users want a response in ~100ms

50B FLOPs in a datacenter• Doesn't smell like a typical datacenter application

Typical datacenter model• Typically setup yields 2-3 concurrent users

Borrow tricks from training

• Use GPUs?

• Batch utterances?

CPU vs GPUCPU (Intel E5-2660 v3) GPU (Nvidia K1200*)

TDP 105 W 45 WPrice $1500 USD $300 USDPeak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPsMemory Bandwidth 68 GB/s 80 GB/sMax Units / Server 2 4-8Float 16-bit libraries No Yes

*or Tesla M4 shortly.

K1200 GPU server• Naive approach, directly replace CPU with GPU,

we get 2x users per serverU

E5-2660 v3 K1200

Batching

Is batching feasible?Time

Batch Dispatch

GPU + Batch Dispatch• With GPU + Batch Dispatch 10x throughput over

naive CPU.

E5-2660 v3 K1200 K1200 + Batch Dispatch

Impact on latency?98% Latency (ms)

Concurrent Users

0 5 10 15 20 25 30

Single BatchBatch Dispatch

Borrow code from training?

• Bonus: Highly optimized code shared between in research and production. Huge productivity boost

• Eg. Switching from LSTM to GRU models in production code < 1 day

Baidu GPU deployment

• Image classification

• Machine translation

Image classification• Feature extraction for image classification uses

neural networks

Queries per second

E5-2620 v2 K1200

Latency (ms)

E5-2620 v2 K1200

Machine translation• translate.baidu.com uses neural network

Queries per second

012345678

E5-2620 v2 K1200

Latency (ms)

E5-2620 v2 K1200

Conclusions

• GPUs are an efficient way to boost performance and decrease latency of floating point intensive tasks in production

• Use Batch Dispatch to increase throughput

• GPUs for neural network applications allow you to share code

Mention

• Songbai Pu and Zhiqian Wang from Baidu China

Thank you!

• Questions?

gpus for online deep learning applications · cpu vs gpu cpu (intel e5-2660 v3) gpu (nvidia k1200*)...

Documents

build gpu cluster hardware for efficiently accelerating...

alternative gpu friendly assignment algorithms€¦ · 0.0...

gpu architecture: implications &...

reproducible science at large scale within a continous...

cygnus: gpu meets fpga for hpc - riken r-ccs · 2020. 2....

amazon elastic inference - developer guide€¦ · amazon...

gpu protein folding up to 3954 tflops!

yearbook 2018 – chile m&a prospects and trends...25.1 usd...

evolution des processeurs - centre national de la ... ·...

architectural considerations for a 500 tflops heterogeneous...

gpu computing with matlab® @ cbi laboratory. overview gpu...

nvswitch and dgx-2 · 2018-08-21 · ©2018 nvidia...

wmxm-p5200e-vo datasheet [rev 8] - wolf advanced...

what is scientific computing? - boinc · 2002 nec earth...

nvidia’s xavier soc - hot chips · 2018. 8. 19. ·...

introduction to gpus for...

organised by - multibriefs · only workshop normal...

volta: programmability and performance...3 p100 v100 ratio...

technology, innovation and education for industry · +arm...

1 usd = 3.78 shekels 1 usd = 0.71 dinars 1 usd = 1.83 lira 1...