fujitsu world tour 2017 shaping tomorrow with you · data is generated by many iot devices and the...

0 Copyright 2017 FUJITSU

Intel Inside®. Powerful Productivity Outside. Intel Inside®. Powerful Productivity Outside.

shaping tomorrow with you

Fujitsu World Tour 2017 Fujitsu North America Technology Forum 2017


Intel Inside®. Powerful Productivity Outside.

New Computing Paradigms: Architecture Innovations beyond Moore’s Law

TAKESHI HORIE Head of Computer Systems Laboratory

FUJITSU LABORATORIES LTD.


Why computing now?

Data explosion Data is generated by many IoT devices and the amount of data is exploding.

Computing creates knowledge and intelligence from data. But traditional computing cannot handle it.

End of Moore’s law

For 50 years we have enjoyed device technology scaling. But that is ending.

Fundamentally rethink new computing architecture


Demand for Computing and Fujitsu Computer Systems


Computer performance

Since ENIAC was developed 70 year ago, computer performance is increasing twice every 1.5 years.

1.E+00

1.E+03

1.E+06

1.E+09

1.E+12

1930 1950 1970 1990 2010

ENIAC

Com

puta

tion

s pe

r se

con

d pe

r co

mpu

ter

ENIAC, 1946 U.S. federal government

2x / 1.5 years


Computing demand for scientific applications

Although computing has enabled applications in variety of fields, still much higher computing power is required to solve complex problems of the real world.

Heart simulation Joint research with the

University of Tokyo

Tsunami simulation Joint research with Tohoku University

- International Research Institute for Disaster

Life science and drug manufacturing

Global change prediction for reducing disaster

Industrial innovation

New material and energy creation

Origin of matter and the universe


Computing demand for financial applications

Tokyo Stock Exchange, Inc. (TSE) is one of the world's top trading market and lists around 3,800 brands. Daily trading value exceeds three trillion yen.

Trading volume is constantly increasing year by year

For high frequency trading, response time is reduced from 2ms to 500us in 5 years

0

100

200

300

400

500

600

700

800

900

Mill

ion

2015

Trading Volume in TSE 1st section

1949

Response Time of TSE

2ms

900μs

2010 2012 2015

500μs


Fujitsu computer systems

1950 1960 1970 1980 1990 2000 2010

FACOM100 (1954)

FACOM230-10 (1965)

M-190 (1976)

M-780 (1985)

M-1800 (1990)

VPP-500 (1992)

FM V (1993)

OAYSYS100 (1980)

PRIMEHPC FX10 (2011)

VP-100 (1982)

FM TOWNS (1989)

PRIMEQUEST (2005)

GS21 (2002)

DS90 (1991)

Arrows (2011)

SPARC M10 (2013)

Supercomputer

Mainframe

Enterprise Servers

Ubiquitous Terminal


Fujitsu microprocessors

SPARC64 XIfx

2000 - 2003 - 1999

SPARC64

V

SPARC64

GP

GS8900

GS21 600

GS8800B

SPARC64 VII

GS21

SPARC64

V +

SPARC64

VI

GS8800

GS21 900

Mainframe

Hig

h Perform

ance

Hig

h Reliab

ility

Store Ahead Branch History Prefetch

Single-chip CPU

Non-Blocking $ O-O-O Execution Super-Scalar

L2$ on Die

HPC-ACE System on Chip Hardware Barrier

Multi-core Multi-thread

2004 - 2007 2008 - 2011

SPARC64

GP

2012 - 2015

SPARC64 IXfx

Virtual Machine Architecture Software On Chip High-speed Interconnect

SPARC64 X

SPARC64 X+

Supercomputer

UNIX

$ECC Register/ALU Parity Instruction Retry $ Dynamic degradation RC/RT/History

SPARC64 VIIIfx

GS21 M2600

2016 -

K computer

SPARC64

SPARC64 II

GS8600


Fujitsu high performance computing

Fujitsu provides many HPC solutions to satisfy various customer demands.

Support for both supercomputers with original CPU and x86 cluster systems

Post-K will be developed with collaboration with RIKEN and ARM

K computer （Co-developed with RIKEN）

x86 Cluster

Post-K (Co developed with RIKEN and ARM)

PRIMEHPC FX100

Oakforest PACS

Original CPU

BX900 Cluster (Co-developed with JAEA)


IoT and Data Explosion


IoT connects everything

By 2020, 50 billion devices will be connected and generate data constantly.

1990 2010 2020 2000 Year

Bill

ion

s of

dev

ices

10

20

30

40

50

(src: CISCO)

Only 1 million PCs were

connected to the Internet

Number of devices exceeded

the world wide populations

More than 50 billion devices

in 2020

World wide populations


Data explosion

As amount of data is exploding, it exceeds capability of traditional ICT. Need new processing to create valuable information from unstructured data.

1990 2010 2020 2000 Year

Am

oun

t of

dat

a

1 ZB=1021

1 YB=1024

40 ZB 1 ZB 1 YB

Data Explosion

Amount of data will reach: 40 Zetta Byte by 2020 1 Yotta Byte by 2030

Unstructured data IOT, sensors

Structured data Business data, RDB


Data lifecycle and processing

New processing throughout data lifecycle creates knowledge and Intelligence.

Pre-process data at the edge

Collect and distribute Information

Extract value from volume of data

Provide solutions with knowledge and Intelligence

Data Explosion

Cloud

AI

IoT

Knowledge Integration


New computing for data explosion

New computing extracts knowledge and intelligence from data, and enables delivery of new applications and services.

Knowledge

Knowledge and intelligence computing

Data processing

Extract value from volume

Numerical computing

Information

New applications and services

Intelligence


Technology Trend for Computing


Moore’s law and microprocessor trend

100

101

102

103

104

105

106

107

109

1970 1980 1990 2000 2010 2020 2030

108

Year

# of Cores

Source: Estimated based on Stanford, K. Rupp

Performance trend of Microprocessor r

Moore’s law drives processor performance

Power consumption limits performance

End of Moore’s law

2005

2025

(CAGR)


Trade-off line of Moore’s law

Device technology scaling has brought higher performance as well as higher power efficiency for these 50 years.

The trade off line is determined by device technology at each generation. As technology scales, the trade-off line moves upward.

Technology scaling will stop around 2025.

s: Scaling factor

Power efficiency*(Performance)2 = K∝s5

1

10

102

103

104

102 103 104 105

Performance (a.u.)

Pow

er e

ffic

ien

cy (

a.u

.) 1990 2000

2010 2025

Technology scaling will never be a driver for computing

Mobile

Server

Moore’s trade-of line advancement


Computing innovations

Continue to create new computing paradigms for unlimited performance growth Pe

rfor

man

ce

2030 2020 Year 2010

Conventional Computing Paradigm

New Computing Paradigm

Domain Specific Computing

Adapted


Computing Architecture Innovation


Data explosion and challenges

Overcome challenges by innovation for computing and data processing

Unstructured data

Structured data

2020 2030 2010 Year

Am

ou

nt

of

da

ta

2000

Challenges •Process technology •Network bandwidth •Power consumption • Computing power

Data explosion


Our proposal for computing architecture innovation

Create new computing paradigm for data explosion

40ZB(40*1021B)

Unstructured data

Structured data

2020 2030 2010 Year

Am

ou

nt

of

da

ta

電力,伝送, 集積,処理の限界

2000

1YB (1024B)

Challenges • Process Technology • Network Bandwidth • Power Consumption • Computing Power

Data explosion New Computing

Architecture Moore’s

Law Computing

Hyperconnected Cloud

Cloud Computing

System


Hyperconnected Cloud

R&D vision and strategy: “Hyperconnected Cloud” Web scale ICT provides computing and data processing power through service-oriented connection

AI and security are embedded at every layer to create knowledge in safe and secure society


New computing architecture

From numerical to media, knowledge, and intelligence processing

Processing

Conventional Computing

Neural Computing (Learning)

Brain-Inspired Computing

Supercomputer

Quantum Computing

New

met

rics

Approximate Computing

Neural Computing (Inference)

Accelerator

Limit of Moore’s Law


Direction of new computing architecture

Strict Accuracy

General Purpose

Many Core

Conventional

Relaxed Accuracy

Simple and Specific Core

Extreme Parallelism

New Computing


Domain specific computing

Achieve extremely high performance, simple operation and low cost by specializing hardware and software in specific application domains

Processing




Supercomputer

Quantum Computing

New

met

rics



Accelerator


Neural Computing

Quantum- Inspired

Computing

Media Processing



Media Processing


Needs for image retrieval

Office workers routinely create and store numerous documents that contain images like presentation materials.

Stored massive image materials are not reused sufficiently.

10% of work-time is wasted at offices to search for wanted documents.

Needs more intuitive search method “Search by image” increases productivity


Partial image retrieval

Find images based on matches with a part of the query image

General-purpose server takes long processing time for massive calculations of partial matching

Query image Search results

・Partial match ・Enlarged/Reduce image

Search Massive image DB Results

Requires acceleration of partial image retrieval to search a target image intuitively and efficiently


Image search acceleration system: demonstration We developed technology for instantaneous searches of a target image from a

massive volume of images


Image search acceleration system: architecture and implementation

Designed special engines for feature extraction and matching with FPGA

Server

Database

Partial image retrieval engine

CPU FPGA

Matching

Feature Extraction

I/O Processing

Overall Control

Press release on Feb. 2nd 2016

Match 0

Match 1

Match 5

64-way x 6core

Dedicated processing unit for feature extraction

(32-way parallelization)

F.E. 31

F.E. 0

F.E. 1

32 cores

Dedicated processing unit for matching

(384-way parallelization)

F.E. : Feature extraction

H.D. 1

H.D. 0

H.D. 63

H.D. : Hamming distance calculation

Extreme Parallelism

Simple & Specific Core

Relaxed Accuracy


Image search acceleration system: performance and applications

Conventional server

Media domain specific server

200 Image/sec

12,000 Image/sec Th

roug

hpu

t More than 50 times

“Search by image” makes document creation more productive and can be applied to medical and weather applications

Documents Medical Weather

FPGA Many core


Neural Computing


Neural computing comes back again

Deep Learning algorithm and enhanced computing capability have enabled much higher object recognition rate than ever since 2012.

Features Results Input image

Feature extraction

Classification

Manual design

Features Results Input image

Feature extraction

Classification

Automatic extraction（Deep Learning）

Automatic

0.00

0.05

0.10

0.15

0.20

0.25

0.30

2011 2012 2013 2014 2015

Neural computing

Conventional machine learning algorithm

Large difference

Improving every year

1y ny2y

ijw

Output

Input

Learning Inference

Neural network (Feedforwad)

Gen

eral

ob

ject

rec

ogn

itio

n r

ate


Computing for deeper neural network

To achieve higher accuracy, neural network has been deeper and larger Processing speed: computing for learning with deeper neural network is time consuming

Processing capacity: limited memory size on GPU is critical for larger neural network

0

2

4

6

8

10

12

14

16

18

1998 ～

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

Mem

ory

Size

[G

B]

Year

GPU Memory Size

NN Size(Batch=8)

ResNet

AlexNet

VGGNet LeNet

~16GB Neural network size trend


Fastest learning w/ HPC technology

Developed high-speed technology to process deep learning

Using "AlexNet," 64 GPUs in parallel achieve 27 times the speed of a single GPU for world's fastest processing

Press release on Aug. 9th 2016

1.8x faster

Conventional

Same accuracy 64 GPUs 1 GPU

27x faster learning speed (60x faster execution speed)

Our approach

…

(64 GPUs)

(64 GPUs)


Doubles deep learning neural network scale

Developed technology to streamline internal memory of GPUs to support growing neural network scale that works to heighten machine learning accuracy

Enabled neural network machine learning of a scale up to twice what was capable with previous technology

Response after press release

“How A New Technology Promises To Make Learning More Powerful Than It Already Is” By Kelvin Murae, Forbes

4% more accuracy

Conventional Our approach

Same memory

2x more images

Press release on Sep. 21st 2016


Deep learning processor : DLUTM

Dedicated architecture for deep learning

Supercomputer’s interconnect

Extremely low power design

Max 100,000 DLU connection

(Tofu interconnect)

HBM2

Host I/F DPU-0

DPU-1

DPU

DPU

DPU

DPU-n

DPE DPE DPE

DPE DPE DPE DPE DPE DPE

DPE DPE DPE

DPE DPE DPE DPE DPE DPE


Extreme Parallelism

Relaxed Accuracy DLUTM

(Deep Learning Unit)

Press release on Nov. 29th 2016

https://www.google.co.jp/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwijkoism6fQAhVGxbwKHScTAEsQjRwIBw&url=https://ireneses.wordpress.com/2011/06/20/rd-tofu-mesh-torus-6-d-interconnect-de-fujitsu-k/&psig=AFQjCNE21cq7UR-yHgMWjBRtoVR5SlcrRA&ust=1479177379190235


Quantum-Mechanics-Inspired Computing


Motivation: combinatorial optimization problem

Need efficient approach to solve the explosion of combinations

Power delivery Disaster recovery Investment portfolio

depot customer1

port

customer2

customer3

Vehicle routing problem: finding the optimal routes for delivering vehicles to 2,000 customers

Various combinatorial optimization problems in real world

Calculation time increases exponentially depending on the customer numbers

We need to choose optimal route out of enormous number of combinations to minimize the cost

Customer1

…

Customer2000

…

1st 2nd 3rd

~107535 order combinations

…

Rooting pattern

Customer2

Customer2 Customer3

Customer2000

Customer1 Customer2

Customer1

…

Customer2000

…

1st 2nd 3rd


…

Rooting pattern

Customer2

Customer2 Customer3

Customer2000

Customer1 Customer2

Customer1

…

Customer2000

…

1st 2nd 3rd


…

Rooting pattern

Customer2

Customer2 Customer3

Customer2000 Minimum

cost?

Customer1 Customer2


Fast Slow

Applicable to practical problems

Limitation of problems

Conventional processor

Quantum Computer *

Our goal

* Quantum Annealing type

Our strategy to solve optimization problem

Create high-speed and widely applicable architecture

• Locating power grid failure

• Pick-up and delivery of 2000 depots

• Locating failures in 20-breaker power grid

• Map coloring


Quantum-Mechanics-Inspired Computer

Architecture to meet usability and scalability for combinatorial optimization Solve practical problems by using CMOS digital design Realize scalability for larger problems and speed enhancement

Features Simple core reduces data movement and control overheads. Massively-parallel stochastic search is implemented to accelerate search paths.

Multiple engines for larger problems

Further speed up achieved by parallelism

Speed up by parallel score calculation and transition facilitation

Press release on Oct. 20th 2016


Extreme Parallelism

Relaxed Accuracy


Evaluation of our prototype

Engine performance evaluated using FPGA implementation

12,000 speedup confirmed by using 32-city traveling salesman problem

0.1

1

10

100

1,000

10,000

2 x Ti

me

to s

olu

tion

(se

c)

Conventional processor

F P G A Parallel Score

Calculation

1000 x

6 x

Transition Facilitation

T h i s W o r k s

12,000 x

*3.5-GHz Intel Xeon E5


Demonstration


Ecosystem of combinatorial optimizer

Collaborate with universities, research institutes and industries to apply our technologies to practical problems

Combinatorial Optimizer

Software Development Environment Research

Institute

Fujitsu

Universities

Delivery CAD

AI

Practical Application

PoB Early Users

User Community

Open Framework on Cloud Service

Enhanced Engine


Approximate computing

Optimizing accuracy to the target workload enables higher performance and higher energy efficiency at the same time.

Processing




Supercomputer

Quantum Computing

New

met

rics



Accelerator




Summary


Computing innovations beyond Moore’s law

Fujitsu will continue to innovate computing architecture Pe

rfor

man

ce

2030 2020 Year 2010

Conventional Computing Paradigm

New Computing Paradigm

Quantum- Inspired

Computing

Domain Specific Computing

Media Processing

Neural Computing