fujitsu world tour 2017 shaping tomorrow with you · data is generated by many iot devices and the...
TRANSCRIPT
0 Copyright 2017 FUJITSU
Intel Inside®. Powerful Productivity Outside. Intel Inside®. Powerful Productivity Outside.
shaping tomorrow with you
Fujitsu World Tour 2017 Fujitsu North America Technology Forum 2017
1 Copyright 2017 FUJITSU
Intel Inside®. Powerful Productivity Outside.
New Computing Paradigms: Architecture Innovations beyond Moore’s Law
TAKESHI HORIE Head of Computer Systems Laboratory
FUJITSU LABORATORIES LTD.
2 Copyright 2017 FUJITSU
Why computing now?
Data explosion Data is generated by many IoT devices and the amount of data is exploding.
Computing creates knowledge and intelligence from data. But traditional computing cannot handle it.
End of Moore’s law
For 50 years we have enjoyed device technology scaling. But that is ending.
Fundamentally rethink new computing architecture
3 Copyright 2017 FUJITSU
Demand for Computing and Fujitsu Computer Systems
4 Copyright 2017 FUJITSU
Computer performance
Since ENIAC was developed 70 year ago, computer performance is increasing twice every 1.5 years.
1.E+00
1.E+03
1.E+06
1.E+09
1.E+12
1930 1950 1970 1990 2010
ENIAC
Com
puta
tion
s pe
r se
con
d pe
r co
mpu
ter
ENIAC, 1946 U.S. federal government
2x / 1.5 years
5 Copyright 2017 FUJITSU
Computing demand for scientific applications
Although computing has enabled applications in variety of fields, still much higher computing power is required to solve complex problems of the real world.
Heart simulation Joint research with the
University of Tokyo
Tsunami simulation Joint research with Tohoku University
- International Research Institute for Disaster
Life science and drug manufacturing
Global change prediction for reducing disaster
Industrial innovation
New material and energy creation
Origin of matter and the universe
6 Copyright 2017 FUJITSU
Computing demand for financial applications
Tokyo Stock Exchange, Inc. (TSE) is one of the world's top trading market and lists around 3,800 brands. Daily trading value exceeds three trillion yen.
Trading volume is constantly increasing year by year
For high frequency trading, response time is reduced from 2ms to 500us in 5 years
0
100
200
300
400
500
600
700
800
900
Mill
ion
2015
Trading Volume in TSE 1st section
1949
Response Time of TSE
2ms
900μs
2010 2012 2015
500μs
7 Copyright 2017 FUJITSU
Fujitsu computer systems
1950 1960 1970 1980 1990 2000 2010
FACOM100 (1954)
FACOM230-10 (1965)
M-190 (1976)
M-780 (1985)
M-1800 (1990)
VPP-500 (1992)
FM V (1993)
OAYSYS100 (1980)
PRIMEHPC FX10 (2011)
VP-100 (1982)
FM TOWNS (1989)
PRIMEQUEST (2005)
GS21 (2002)
DS90 (1991)
Arrows (2011)
SPARC M10 (2013)
Supercomputer
Mainframe
Enterprise Servers
Ubiquitous Terminal
8 Copyright 2017 FUJITSU
Fujitsu microprocessors
SPARC64 XIfx
2000 - 2003 - 1999
SPARC64
V
SPARC64
GP
GS8900
GS21 600
GS8800B
SPARC64 VII
GS21
SPARC64
V +
SPARC64
VI
GS8800
GS21 900
Mainframe
Hig
h Perform
ance
Hig
h Reliab
ility
Store Ahead Branch History Prefetch
Single-chip CPU
Non-Blocking $ O-O-O Execution Super-Scalar
L2$ on Die
HPC-ACE System on Chip Hardware Barrier
Multi-core Multi-thread
2004 - 2007 2008 - 2011
SPARC64
GP
2012 - 2015
SPARC64 IXfx
Virtual Machine Architecture Software On Chip High-speed Interconnect
SPARC64 X
SPARC64 X+
Supercomputer
UNIX
$ECC Register/ALU Parity Instruction Retry $ Dynamic degradation RC/RT/History
SPARC64 VIIIfx
GS21 M2600
2016 -
K computer
SPARC64
SPARC64 II
GS8600
9 Copyright 2017 FUJITSU
Fujitsu high performance computing
Fujitsu provides many HPC solutions to satisfy various customer demands.
Support for both supercomputers with original CPU and x86 cluster systems
Post-K will be developed with collaboration with RIKEN and ARM
K computer (Co-developed with RIKEN)
x86 Cluster
Post-K (Co developed with RIKEN and ARM)
PRIMEHPC FX100
Oakforest PACS
Original CPU
BX900 Cluster (Co-developed with JAEA)
10 Copyright 2017 FUJITSU
IoT and Data Explosion
11 Copyright 2017 FUJITSU
IoT connects everything
By 2020, 50 billion devices will be connected and generate data constantly.
1990 2010 2020 2000 Year
Bill
ion
s of
dev
ices
10
20
30
40
50
(src: CISCO)
Only 1 million PCs were
connected to the Internet
Number of devices exceeded
the world wide populations
More than 50 billion devices
in 2020
World wide populations
12 Copyright 2017 FUJITSU
Data explosion
As amount of data is exploding, it exceeds capability of traditional ICT. Need new processing to create valuable information from unstructured data.
1990 2010 2020 2000 Year
Am
oun
t of
dat
a
1 ZB=1021
1 YB=1024
40 ZB 1 ZB 1 YB
Data Explosion
Amount of data will reach: 40 Zetta Byte by 2020 1 Yotta Byte by 2030
Unstructured data IOT, sensors
Structured data Business data, RDB
13 Copyright 2017 FUJITSU
Data lifecycle and processing
New processing throughout data lifecycle creates knowledge and Intelligence.
Pre-process data at the edge
Collect and distribute Information
Extract value from volume of data
Provide solutions with knowledge and Intelligence
Data Explosion
Cloud
AI
IoT
Knowledge Integration
14 Copyright 2017 FUJITSU
New computing for data explosion
New computing extracts knowledge and intelligence from data, and enables delivery of new applications and services.
Knowledge
Knowledge and intelligence computing
Data processing
Extract value from volume
Numerical computing
Information
New applications and services
Intelligence
15 Copyright 2017 FUJITSU
Technology Trend for Computing
16 Copyright 2017 FUJITSU
Moore’s law and microprocessor trend
100
101
102
103
104
105
106
107
109
1970 1980 1990 2000 2010 2020 2030
108
Year
# of Cores
Source: Estimated based on Stanford, K. Rupp
Performance trend of Microprocessor r
Moore’s law drives processor performance
Power consumption limits performance
End of Moore’s law
2005
2025
(CAGR)
17 Copyright 2017 FUJITSU
Trade-off line of Moore’s law
Device technology scaling has brought higher performance as well as higher power efficiency for these 50 years.
The trade off line is determined by device technology at each generation. As technology scales, the trade-off line moves upward.
Technology scaling will stop around 2025.
s: Scaling factor
Power efficiency*(Performance)2 = K∝s5
1
10
102
103
104
102 103 104 105
Performance (a.u.)
Pow
er e
ffic
ien
cy (
a.u
.) 1990 2000
2010 2025
Technology scaling will never be a driver for computing
Mobile
Server
Moore’s trade-of line advancement
18 Copyright 2017 FUJITSU
Computing innovations
Continue to create new computing paradigms for unlimited performance growth Pe
rfor
man
ce
2030 2020 Year 2010
Conventional Computing Paradigm
New Computing Paradigm
Domain Specific Computing
Adapted
19 Copyright 2017 FUJITSU
Computing Architecture Innovation
20 Copyright 2017 FUJITSU
Data explosion and challenges
Overcome challenges by innovation for computing and data processing
Unstructured data
Structured data
2020 2030 2010 Year
Am
ou
nt
of
da
ta
2000
Challenges •Process technology •Network bandwidth •Power consumption • Computing power
Data explosion
21 Copyright 2017 FUJITSU
Our proposal for computing architecture innovation
Create new computing paradigm for data explosion
40ZB(40*1021B)
Unstructured data
Structured data
2020 2030 2010 Year
Am
ou
nt
of
da
ta
電力,伝送, 集積,処理 の限界
2000
1YB (1024B)
Challenges • Process Technology • Network Bandwidth • Power Consumption • Computing Power
Data explosion New Computing
Architecture Moore’s
Law Computing
Hyperconnected Cloud
Cloud Computing
System
22 Copyright 2017 FUJITSU
Hyperconnected Cloud
R&D vision and strategy: “Hyperconnected Cloud” Web scale ICT provides computing and data processing power through service-oriented connection
AI and security are embedded at every layer to create knowledge in safe and secure society
23 Copyright 2017 FUJITSU
New computing architecture
From numerical to media, knowledge, and intelligence processing
Processing
Conventional Computing
Neural Computing (Learning)
Brain-Inspired Computing
Supercomputer
Quantum Computing
New
met
rics
Approximate Computing
Neural Computing (Inference)
Accelerator
Limit of Moore’s Law
24 Copyright 2017 FUJITSU
Direction of new computing architecture
Strict Accuracy
General Purpose
Many Core
Conventional
Relaxed Accuracy
Simple and Specific Core
Extreme Parallelism
New Computing
25 Copyright 2017 FUJITSU
Domain specific computing
Achieve extremely high performance, simple operation and low cost by specializing hardware and software in specific application domains
Processing
Conventional Computing
Neural Computing (Learning)
Brain-Inspired Computing
Supercomputer
Quantum Computing
New
met
rics
Approximate Computing
Neural Computing (Inference)
Accelerator
Limit of Moore’s Law
Neural Computing
Quantum- Inspired
Computing
Media Processing
Approximate Computing
26 Copyright 2017 FUJITSU
Media Processing
27 Copyright 2017 FUJITSU
Needs for image retrieval
Office workers routinely create and store numerous documents that contain images like presentation materials.
Stored massive image materials are not reused sufficiently.
10% of work-time is wasted at offices to search for wanted documents.
Needs more intuitive search method “Search by image” increases productivity
28 Copyright 2017 FUJITSU
Partial image retrieval
Find images based on matches with a part of the query image
General-purpose server takes long processing time for massive calculations of partial matching
Query image Search results
・Partial match ・Enlarged/Reduce image
Search Massive image DB Results
Requires acceleration of partial image retrieval to search a target image intuitively and efficiently
29 Copyright 2017 FUJITSU
Image search acceleration system: demonstration We developed technology for instantaneous searches of a target image from a
massive volume of images
30 Copyright 2017 FUJITSU
Image search acceleration system: architecture and implementation
Designed special engines for feature extraction and matching with FPGA
Server
Database
Partial image retrieval engine
CPU FPGA
Matching
Feature Extraction
I/O Processing
Overall Control
Press release on Feb. 2nd 2016
Match 0
Match 1
Match 5
64-way x 6core
Dedicated processing unit for feature extraction
(32-way parallelization)
F.E. 31
F.E. 0
F.E. 1
32 cores
Dedicated processing unit for matching
(384-way parallelization)
F.E. : Feature extraction
H.D. 1
H.D. 0
H.D. 63
H.D. : Hamming distance calculation
Extreme Parallelism
Simple & Specific Core
Relaxed Accuracy
31 Copyright 2017 FUJITSU
Image search acceleration system: performance and applications
Conventional server
Media domain specific server
200 Image/sec
12,000 Image/sec Th
roug
hpu
t More than 50 times
“Search by image” makes document creation more productive and can be applied to medical and weather applications
Documents Medical Weather
FPGA Many core
32 Copyright 2017 FUJITSU
Neural Computing
33 Copyright 2017 FUJITSU
Neural computing comes back again
Deep Learning algorithm and enhanced computing capability have enabled much higher object recognition rate than ever since 2012.
Features Results Input image
Feature extraction
Classification
Manual design
Features Results Input image
Feature extraction
Classification
Automatic extraction(Deep Learning)
Automatic
0.00
0.05
0.10
0.15
0.20
0.25
0.30
2011 2012 2013 2014 2015
Neural computing
Conventional machine learning algorithm
Large difference
Improving every year
1y ny2y
ijw
Output
Input
Learning Inference
Neural network (Feedforwad)
Gen
eral
ob
ject
rec
ogn
itio
n r
ate
34 Copyright 2017 FUJITSU
Computing for deeper neural network
To achieve higher accuracy, neural network has been deeper and larger Processing speed: computing for learning with deeper neural network is time consuming
Processing capacity: limited memory size on GPU is critical for larger neural network
0
2
4
6
8
10
12
14
16
18
1998 ~
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Mem
ory
Size
[G
B]
Year
GPU Memory Size
NN Size(Batch=8)
ResNet
AlexNet
VGGNet LeNet
~16GB Neural network size trend
35 Copyright 2017 FUJITSU
Fastest learning w/ HPC technology
Developed high-speed technology to process deep learning
Using "AlexNet," 64 GPUs in parallel achieve 27 times the speed of a single GPU for world's fastest processing
Press release on Aug. 9th 2016
1.8x faster
Conventional
Same accuracy 64 GPUs 1 GPU
27x faster learning speed (60x faster execution speed)
Our approach
…
(64 GPUs)
(64 GPUs)
36 Copyright 2017 FUJITSU
Doubles deep learning neural network scale
Developed technology to streamline internal memory of GPUs to support growing neural network scale that works to heighten machine learning accuracy
Enabled neural network machine learning of a scale up to twice what was capable with previous technology
Response after press release
“How A New Technology Promises To Make Learning More Powerful Than It Already Is” By Kelvin Murae, Forbes
4% more accuracy
Conventional Our approach
Same memory
2x more images
Press release on Sep. 21st 2016
37 Copyright 2017 FUJITSU
Deep learning processor : DLUTM
Dedicated architecture for deep learning
Supercomputer’s interconnect
Extremely low power design
Max 100,000 DLU connection
(Tofu interconnect)
HBM2
Host I/F DPU-0
DPU-1
DPU
DPU
DPU
DPU-n
DPE DPE DPE
DPE DPE DPE DPE DPE DPE
DPE DPE DPE
DPE DPE DPE DPE DPE DPE
Simple & Specific Core
Extreme Parallelism
Relaxed Accuracy DLUTM
(Deep Learning Unit)
Press release on Nov. 29th 2016
38 Copyright 2017 FUJITSU
Quantum-Mechanics-Inspired Computing
39 Copyright 2017 FUJITSU
Motivation: combinatorial optimization problem
Need efficient approach to solve the explosion of combinations
Power delivery Disaster recovery Investment portfolio
depot customer1
port
customer2
customer3
Vehicle routing problem: finding the optimal routes for delivering vehicles to 2,000 customers
Various combinatorial optimization problems in real world
Calculation time increases exponentially depending on the customer numbers
We need to choose optimal route out of enormous number of combinations to minimize the cost
Customer1
…
Customer2000
…
1st 2nd 3rd
~107535 order combinations
…
Rooting pattern
Customer2
Customer2 Customer3
Customer2000
Customer1 Customer2
Customer1
…
Customer2000
…
1st 2nd 3rd
~107535 order combinations
…
Rooting pattern
Customer2
Customer2 Customer3
Customer2000
Customer1 Customer2
Customer1
…
Customer2000
…
1st 2nd 3rd
~107535 order combinations
…
Rooting pattern
Customer2
Customer2 Customer3
Customer2000 Minimum
cost?
Customer1 Customer2
40 Copyright 2017 FUJITSU
Fast Slow
Applicable to practical problems
Limitation of problems
Conventional processor
Quantum Computer *
Our goal
* Quantum Annealing type
Our strategy to solve optimization problem
Create high-speed and widely applicable architecture
• Locating power grid failure
• Pick-up and delivery of 2000 depots
• Locating failures in 20-breaker power grid
• Map coloring
41 Copyright 2017 FUJITSU
Quantum-Mechanics-Inspired Computer
Architecture to meet usability and scalability for combinatorial optimization Solve practical problems by using CMOS digital design Realize scalability for larger problems and speed enhancement
Features Simple core reduces data movement and control overheads. Massively-parallel stochastic search is implemented to accelerate search paths.
Multiple engines for larger problems
Further speed up achieved by parallelism
Speed up by parallel score calculation and transition facilitation
Press release on Oct. 20th 2016
Simple & Specific Core
Extreme Parallelism
Relaxed Accuracy
42 Copyright 2017 FUJITSU
Evaluation of our prototype
Engine performance evaluated using FPGA implementation
12,000 speedup confirmed by using 32-city traveling salesman problem
0.1
1
10
100
1,000
10,000
2 x Ti
me
to s
olu
tion
(se
c)
Conventional processor
F P G A Parallel Score
Calculation
1000 x
6 x
Transition Facilitation
T h i s W o r k s
12,000 x
*3.5-GHz Intel Xeon E5
43 Copyright 2017 FUJITSU
Demonstration
44 Copyright 2017 FUJITSU
Ecosystem of combinatorial optimizer
Collaborate with universities, research institutes and industries to apply our technologies to practical problems
Combinatorial Optimizer
Software Development Environment Research
Institute
Fujitsu
Universities
Delivery CAD
AI
Practical Application
PoB Early Users
User Community
Open Framework on Cloud Service
Enhanced Engine
45 Copyright 2017 FUJITSU
Approximate Computing
46 Copyright 2017 FUJITSU
Approximate computing
Optimizing accuracy to the target workload enables higher performance and higher energy efficiency at the same time.
Processing
Conventional Computing
Neural Computing (Learning)
Brain-Inspired Computing
Supercomputer
Quantum Computing
New
met
rics
Approximate Computing
Neural Computing (Inference)
Accelerator
Limit of Moore’s Law
Approximate Computing
47 Copyright 2017 FUJITSU
Summary
48 Copyright 2017 FUJITSU
Computing innovations beyond Moore’s law
Fujitsu will continue to innovate computing architecture Pe
rfor
man
ce
2030 2020 Year 2010
Conventional Computing Paradigm
New Computing Paradigm
Quantum- Inspired
Computing
Domain Specific Computing
Media Processing
Neural Computing
Approximate Computing
49 Copyright 2017 FUJITSU