deep learning using the ibm power cluster at virginia tech ...- introduction to distributed deep...

14
Ahmed Ibrahim, Computational Scientist Virginia Tech Join the Conversation #OpenPOWERSummit Deep Learning using the IBM Power cluster at Virginia Tech

Upload: others

Post on 22-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

Ahmed Ibrahim, Computational ScientistVirginia Tech

Join the Conversation #OpenPOWERSummit

Deep Learning using the IBM Power cluster at Virginia Tech

Page 2: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

OverviewAdvanced Research Computing manages centralized high-performance computing infrastructure for the Virginia Tech research community:

• 7 Colleges, 100+ Departments• Biocomplexity Institute of Virginia Tech• Fralin Life Science Institute• Institute for Critical and Applied Science• Institute for Creativity, Arts and Technology• Virginia Tech Transportation Institute• Virginia Tech Carilion Research Institute• Ted and Karyn Hume Center for National Security and Technology• and others…

Page 3: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

Overview• Machine learning — mathematical and statistical tools used to build models from data

○ healthcare analytics○ business analytics○ security analytics

• Deep learning — class of methods that rely on convolutional neural networks (CNNs)• wide range of applications

- image recognition- natural language processing- general kinds of classification tasks

• training — build a CNN to apply to a particular problem• inference — apply CNN to extract information from data• algorithms are “data hungry”• training is bandwidth intensive

Page 4: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

• IBM “Minsky” compute node: higher performance per node, increase research productivity per unit time from a physical rack

• Fourteen nodes, IBM Power8 CPU + 4 NVIDIA P100 GPU per node• Data movement limits performance for most applications — NVLINK accelerates data transfers within a node• Deep learning applications have the largest performance advantages

Image source: NVIDIA

Virginia Tech Deep Learning System (Huckleberry)

Page 5: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

• IBM PowerAI software supports major open source deep learning frameworks optimized for POWER + NVIDIA architecture

Image source: IBM

Virginia Tech Deep Learning System (Huckleberry)

Page 6: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

• System Setup (March-April 2017)- Brought system online, integration with existing infrastructure

• Virginia Tech High Performance Computing Day (24 March 2017)- Financial analytics demo- DIGITS image recognition demo

• Researcher Early Access period (April- Sept 2017)- Feedback from specific research teams at Virginia Tech - Worked with IBM to address any concerns / needs as they arise

• Access opened to entire Virginia Tech research community (Sept 2017)

• OpenPOWER Day (28-29 September 2017)

Huckleberry Deployment

Page 7: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

• 2-Day Workshop hosted jointy by ARC/NLI in Torgersen 1100

• Day 1 (28 September): Speakers from IBM, Mellanox, - OpenPOWER academic partnerships- Software-defined networking and in-network computing- Machine learning with Spark / SystemML - Deep learning with PowerAI- 70 attendees

• Day 2 (29 September): Hands-on tutorials on IBM PowerAI software- Used Huckleberry for hands-on workshop - Introduction to tensorflow with Jupyter notebooks- Introduction to Distributed Deep Learning (DDL)- 60 attendees

Day 1 (28 September): • Speakers from IBM, Mellanox,

- OpenPOWER academic partnerships- Software-defined networking and in-network computing- Machine learning with Spark / SystemML - Deep learning with PowerAI- 70 attendees

IBM OpenPOWER Workshop

Page 8: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

• 2-Day Workshop hosted jointy by ARC/NLI in Torgersen 1100

• Day 1 (28 September): Speakers from IBM, Mellanox, - OpenPOWER academic partnerships- Software-defined networking and in-network computing- Machine learning with Spark / SystemML - Deep learning with PowerAI- 70 attendees

• Day 2 (29 September): Hands-on tutorials on IBM PowerAI software- Used Huckleberry for hands-on workshop - Introduction to tensorflow with Jupyter notebooks- Introduction to Distributed Deep Learning (DDL)- 60 attendees

• Day 2 (29 September): • Hands-on tutorials on IBM PowerAI software

- Used Huckleberry for hands-on workshop - Introduction to tensorflow with Jupyter notebooks- Introduction to Distributed Deep Learning (DDL)- 60 attendees

IBM OpenPOWER Workshop

Page 9: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

Research Highlights Distributed environmental monitoring.

Heterogenous topology control.

Ryan K. Williams, Assistant Professor, Electrical & Computer Engineering

• Distributed autonomous systems- Coordinated motion control for heterogenous multi-agent networks.- Decision-making in multi-agent systems, e.g., auctioned task allocation

and POMDPs. - Distributed algorithms for estimation, optimization, and learning.

• Applications- Cloud/edge computing for distributed autonomy.- Networked dynamical security. - Urban autonomy, vehicular networks, and smart infrastructure.- Agricultural autonomy. - Defense / national security

Page 10: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

Research HighlightsJia-Bin HuangAssistant Professor, Electrical & Computer Engineering, Virginia Tech

Page 11: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

Research Highlights

Virginia Tech using the IBM Power cluster.P100 16GB GPUs with NVLINK.

16k layers still under development.

Ultra Deep Neural Networks

Page 12: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

Research HighlightsDistributed Deep Learning using the IBM Power cluster at Virginia Tech

● Train a simple model on multiple GPUs.

● Use data parallelism.

● Try different batch sizes.

● Measure training time.

Image source: Lecun 98

Page 13: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,

Research HighlightsDistributed Deep Learning using the IBM Power cluster at Virginia Tech

Page 14: Deep Learning using the IBM Power cluster at Virginia Tech ...- Introduction to Distributed Deep Learning (DDL) - 60 attendees Day 1 (28 September): • Speakers from IBM, Mellanox,