applied ai architecture @ alibaba infrastructure€¦ · @ 2018 alibaba group who we are data...

Post on 05-Jul-2020

18 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

@ 2018 Alibaba Group

Applied AI Architecture @ Alibaba Infrastructure

Lingjie XU, DirectorHeterogeneous Computing

AIS

@ 2018 Alibaba Group

Who I am

Joined Alibaba Spring 2017

Leading Applied AI Architecture team, focusing on

AI HW acceleration and HW/SW co-play

Held multiple senior architect and management

roles in GPU domain

@ 2018 Alibaba Group

Who we are

Data Platform

Alibaba Cloud

Cain

iao

Logistics

Tao

bao

Tm

all

Alib

ab

a.co

m

1688.co

m

Alie

xpre

ss

E-Commerce

Alip

ay

Micro

-Cre

dit

Insu

ran

ce

Fun

ds

Finance

Alibaba Infrastructure Service

Juh

uasu

an

Clo

ud

Partn

ers

Priva

te clo

ud

Sp

ecia

l clou

d

Pu

blic clo

ud

@ 2018 Alibaba Group

Alibaba Infrastructure

IDC

Modular

Eco-Friendly

Automation

Network

100G

SDN

Security

GOC

Monitor

Analyze

Act

Server

High Perf

Low Power

Scalability

@ 2018 Alibaba Group

Technology Overview

Business Platform

BIRecommendation NLP VisionSearch …

Algorithm Platform Data Platform Computing Platform

OS Middleware Storage Database

IDC Server Processor Network Operation

Effic

ien

cy

Se

cu

rity

@ 2018 Alibaba Group

Datacenters

Zhangbei Datacenter

(Fresh air cooling system)

Best PUE <1.2

New FrontierServer immersion cooling

PUE ~1.0

Qiandaohu Datacenter

(Lake water cooling system)

PUE < 1.3

@ 2018 Alibaba Group

Network

Massive Scale + Diverse Applications + Bursty Traffic + Fast Growth

@ 2018 Alibaba Group

Global Infrastructure

@ 2018 Alibaba Group

Compute & Storage

NPU

@ 2018 Alibaba Group

GN6: 8-way GPU Server

• SXM2 or PCIe

• Decoupled modular design

• Configurable topology

Balanced Common Cascade

CPU0

PCIe

Switch

G

P

U

G

P

U

G

P

U

G

P

U

CPU1

PCIe

Switch

PCIe

Switch

G

P

U

G

P

U

G

P

U

G

P

U

PCIe

Switch

CPU0

PCIe

Switch

G

P

U

G

P

U

G

P

U

G

P

U

CPU1

PCIe

Switch

PCIe

Switch

G

P

U

G

P

U

G

P

U

G

P

U

PCIe

Switch

CPU0

PCIe

Switch

G

P

U

G

P

U

G

P

U

G

P

U

CPU1

PCIe

Switch

PCIe

Switch

G

P

U

G

P

U

G

P

U

G

P

U

PCIe

Switch

@ 2018 Alibaba Group

Data Computing Power Algorithm

The Wave of AI Revolution

@ 2018 Alibaba Group

Deep Learning @ Alibaba

CloudSearch

PAI

iDST

Ant

Ads

City Brain

New Retail

Database Acceleration

Video Analysis

NLP

Cloud

28.2

25.8

16.4

11.7

7.36.7

3.572.99

shallow8 layers

19 layers22 layers

152 layers

269 layers

0

50

100

150

200

250

300

0

5

10

15

20

25

30

ILSVRC'10 ILSVRC'11 ILSVRC'12AlexNet

ILSVRC'13 ILSVRC'14VGG

ILSVRC'14GoogleNet

ILSVRC'15ResNet

ILSVRC'16GBD-Net

Layers

Err

or

%

ImageNet Classification Top-5 Error %

Deep Learning Evolution

@ 2018 Alibaba Group

PaiLiTao

• Category Prediction

• Object Detection

• Feature Extraction

• Index Searching

• Soring & Output

@ 2018 Alibaba Group

OCR

• 10s Millions of Image

• CNN Model

• Single character

accuracy 99.6%

• Overall accuracy 93%

• 8 way distributed GPU

solution

• 7x training speed

@ 2018 Alibaba Group

Translation Voice Insurance

Deep Learning Everywhere

@ 2018 Alibaba Group

Heterogeneous Machine Learning Platform

@ 2018 Alibaba Group

*data from NVIDIA GTC 2017

Hardware Accelerated AI

• Training: Compute Intensive, Time Cost

• Inference: Service Oriented, Response Time

• Eco-System: Framework, Libs, Precisions

• Hardware Dividends for everyone

Tipping point:

• Google TPUs

• Volta TensorCore

• New hardware accelerators for AI

@ 2018 Alibaba Group

Edge – Forces of Gravity

Data

Latency

Privacy

TCO

Customer

Device

Premises

Hyperscale

DC

1-8KM

<1km 8-20km

Reginal DC

Core DC

@ 2018 Alibaba Group

Function Computing

Lower Dev Cost

On-demand Use

Platform Differentiation

Increased Utilization

Less Control

Not Portable

Private Cloud

Increased costs for

optimzation

Technology Challenges

• Quality of Service

• Infrastructure Utilization

• Accelerator Efficiency

• Capacity Granularity

• Multi-Tenancy Management

• Demand Projection

• Scheduling

• Compatibility

@ 2018 Alibaba Group

Opportunities & Challenges

Fine Grained

Monitoring

For Efficiency

Perf

Power

Stability

Deep Customization

Best TCO

Competitive

Empower

“Traditional”

Algorithms

@ 2018 Alibaba Group

Thank You!

E-mail: lingjie.xu@alibaba-inc.com

top related