words of data science in the presence of heterogenous computing architectures

61
WorDS of Data Science in the Presence of Heterogenous Compu7ng Architectures WorDS.sdsc.edu Dr. Ilkay Al7ntas Founder and Director, Workflows for Data Science (WorDS) Center of Excellence San Diego Supercomputer Center, UC San Diego

Upload: ilkay-altintas-phd

Post on 14-Jul-2015

443 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

WorDS  of  Data  Science  in  the  Presence  of  Heterogenous  Compu7ng  Architectures  

WorDS.sdsc.edu          

Dr.  Ilkay  Al7ntas  Founder  and  Director,  Workflows  for  Data  Science  (WorDS)  

Center  of  Excellence  San  Diego  Supercomputer  Center,  UC  San  Diego  

 

Page 2: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

SAN  DIEGO  SUPERCOMPUTER  CENTER  at  UC  San  Diego  Providing  Cyberinfrastructure  for  Research  and  Educa7on  

•  Established  as  a  na7onal  supercomputer  resource  center  in  1985  by  NSF  

•  A  world  leader  in  HPC,  data-­‐intensive  compu7ng,  and  scien7fic  data  management  

•  Current  strategic  focus  on  “Big  Data”    

1985  

today  

Page 3: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

     

Scien&fic  Workflow    Automa&on  Technologies  

Research  

     

Workflows  for  Cloud  Systems  

   Big  Da

ta  App

lica&

ons  

     Re

prod

ucible  Scien

ce  

     

Workforce  Training  and  Educa&on  

     De

velopm

ent  a

nd  Con

sul&ng  

Services  

Workflows  for  Data  Science  Center  

Focus  on  the  ques&on,    not  the  

technology!   10+ years of data science R&D experience as a Center.  

Page 4: WorDS of Data Science in the Presence of Heterogenous Computing Architectures
Page 5: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

So,  what  is  a  workflow?  

Source:  hZp://www.fastcodesign.com/1663557/how-­‐a-­‐kitchen-­‐design-­‐could-­‐make-­‐it-­‐easier-­‐to-­‐bond-­‐with-­‐neighbors    

Shop   Prepare   Cook   Store  

Page 6: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Let’s  make  pasta  this  evening!  Shop   Prepare   Cook   Store  

30  minutes  

30  minutes  

15  minutes  

3  minutes  

15  minutes  

3  minutes  

Page 7: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

How  to  Cook  Everything  Fast  

“How  to  Cook  Everything  Fast  is  a  book  of  kitchen  innova7ons.  Time  management—  the  essen7al  principle  of  fast  cooking—  is  woven  into  revolu7onary  recipes  that  do  the  thinking  for  you.  You’ll  learn  how  to  take  advantage  of  down&me  to  prepare  vegetables  while  a  soup  simmers  or  toast  croutons  while  whisking  a  dressing.  Just  cook  as  you  read—and  let  the  recipes  guide  you  quickly  and  easily  toward  a  delicious  result.”  

Image  and  quote  source:  amazon.com    

Page 8: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

What  if  you  have  more  than  one  cooks?  

Page 9: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

…  

…  

…  MAP  

•  Input:  veggies  •  User  defined  

func&on(UDF):  chop  •  Output:  Chopped  groups  

of  each  kind  of  veggie  

Page 10: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

…  

…  

REDUCE  •  Input:  chopped  batches  

for  each  veggie  type  •  User  defined  

func&on(UDF):  combine  based  on  veggie  type  as  key  

•  Output:  a  bowl  of  veggies  per  veggie  kind  

Page 11: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Thanksgiving  dinner  prepara7on:    more  planning  and  tasks?  

Menu  Item   Prepara&on  Time  

Cooking  Time  

Cooling  Time  

Turkey   30  minutes   4  hours   15  minutes  

Veggies   30  minutes   45  minutes   None  

Cranberry  Sauce  

5  minutes   30  minutes   2  hours  

Soup   20  minutes   30  minutes   None  

Pie   30  minutes   5  minutes   1  day  

•  When  do  you  start  cooking?    •  What  order  do  you  cook?    •  Can  you  cook  some  menu  items  in  parallel?  •  Who  cooks  what?  •  …  

Page 12: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Data  Science  Workflows  -­‐  Programmable,  Reusable  and  Reproducible  Scalability  -­‐  

•  Access  and  query  data  •  Scale  computa7onal  analysis  •  Increase  reuse    •  Save  7me,  energy  and  money  •  Formalize  and  standardize  

Real-­‐Time  Hazards  Management  wifire.ucsd.edu  

Data-­‐Parallel  Bioinforma7cs  bioKepler.org    

Scalable  Automated  Molecular  Dynamics  and  Drug  Discovery  nbcr.ucsd.edu  

kepler-­‐project.org   WorDS.sdsc.edu  

Page 13: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Why  scalable  and  reproducible  data  science?  

Page 14: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

The Big Picture is Supporting the Scientist

Conceptual SWF

Executable SWF

From  “Napkin  Drawings” to  Executable  Workflows  

Fasta  File  

Circonspect  

 Average  Genome  Size  

 Combine  Results   PHACCS  

Page 15: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

The Big Picture is Supporting the Data Scientist

Conceptual SWF

Executable SWF

From  “Napkin  Drawings” to  Executable  Workflows…  SBNL workflow

Local Learner

Data Quality Evaluation

Local Ensemble Learning

Quality Evaluation & Data Partitioning Big Data

Master Learner

MasterEnsemble Learning

Final BN Structure

Insurance  and  Traffic  Data  Analy&cs  using  Big  Data  Bayesian  Network  

Learning  

Page 16: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Ptolemy II: A laboratory for investigating design

KEPLER: A problem-solving environment for Scientific Workflow KEPLER = “Ptolemy II + X” for Scientific Workflows

Kepler is a Scientific Workflow System

•  A cross-project collaboration… initiated August 2003

•  2.4 released 04/2013

www.kepler-project.org

•  Builds upon the open-source Ptolemy II framework

Page 17: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

A Toolbox with Many Tools

Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!

•   Data  •   Search,  database  access,  IO  opera7ons,  streaming  data  in  real-­‐7me…  

•   Compute  •   Data-­‐parallel  paZerns,  external  execu7on,  …  

•   Network  opera7ons  •   Provenance  and  fault  tolerance  

Page 18: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

So,

how does this relate to data science, big data and supercomputing?

Page 19: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Distributed  Compu7ng  

•  Types  of  distributed  compu7ng:    – Computers  in  local  area  network  – Cluster  or  High-­‐Performance  Compu7ng  – Grid  – Cloud    

Compu7ng   using   more   than   one  computers  connected  through  a  network.  

Page 20: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Cluster  or  High-­‐Performance  Compu7ng  

•  Built  from  mul:ple  computers  

•  May  have    – parallel  file  system  – high-­‐speed  network  

•  Provides  a  scheduler  to  manage  the  machines  and  submiZed  jobs  – SGE/OGE,  PBS,  Condor,  LSF,  SLURM  

Page 21: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Paralleliza7on  

•  Execu7on  environments  – One  machine  – Distributed  machines  

Mul&ple  processes  or  threads  running  at  the  same  &me  

•  Parallelism  Types  – Computa7on/task  parallelism  

– Data  parallelism  – Pipeline  parallelism  

Page 22: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Task 4Task 2

Running Waiting Task 5

WaitingTask 3

Running

Task 1

Finished

Input Data Set

Task 1Runnin

g

Task 2

Waiting

Task 3

WaitingTask 1 Task 2 Task 3

Task 1

Running

Task 2

Waiting

Task 3

Waiting

Input Data Set

Task 1

Running

Task 2

Running

Task 3

Running

Task Parallelism

Data Parallelism

Pipeline Parallelism

There  are  different  styles  of  parallelism!  

Page 23: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Big  Data:    Short  Defini7on  •  Some  features  “V’s”  of  big  data  

–  Volume:  amount  of  data  –  Velocity:  speed  of  data  in  and  out  –  Variety:  range  of  data  types  and  sources  –  Veracity:  trustworthiness  of  data  

Picture  credit:  IBM  2012  

Page 24: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

   •  A parallel and scalable programming model for

Big Data–  Input data is automatically partitioned onto multiple

nodes–  Programs are distributed and executed in parallel on

the partitioned data blocks

Distributed-­‐Data  Parallel  Compu7ng  

Images  from:  hZp://www.stratosphere.eu/projects/

Stratosphere/wiki/PactPM    

MapReduceMove program

to data!

Page 25: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Distributed  Data-­‐Parallel  (DDP)  PaZerns  

•  A  higher-­‐level  programming  model  –  Moving  computa7on  to  data  –  Good  scalability  and  performance  accelera7on  –  Run-­‐7me  features  such  as  fault-­‐tolerance  –  Easier  parallel  programming  than  MPI  and  OpenMP  

PaZerns  for  data  distribu&on  and  parallel  data  processing    

Images  from:  hZp://www.stratosphere.eu/projects/Stratosphere/wiki/PactPM    

Page 26: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Hadoop  •  Open  source  implementa7on  of  MapReduce  

•  A  distributed  file  system  across  compute  nodes  (HDFS)  –  Automa:c  data  par::on  –  Automa:c  data  replica:on  

•  Master  and  workers/slaves  architecture  

•  Automa7c  task  re-­‐execu7on  for  failed  tasks  

Spark  •  Fast  Big  Data  Engine  

–  Keeps  data  in  memory  as  much  as  possible  

•  Resilient  Distributed  Datasets  (RDDs)  –  Evaluated  lazily  –  Keeps  track  of  lineage  for  fault  tolerance  

•  More  operators  than  just  Map  and  Reduce  

•  Can  run  on  YARN  (Hadoop  v2)  

Page 27: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Gepng  Value  out  of  All  This  

Page 28: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

My  favorite  defini7on  of  Data  Science  

“By  "Data  Science",  we  mean  almost  everything  that  has  something  to  do  with  data:  Collec:ng,  analyzing,  modeling......  yet  the  most  important  part  is  its  applica:ons  -­‐-­‐-­‐  all  sorts  of  applica:ons.”    Journal  of  Data  Science  (hZp://www.jds-­‐online.com/about)  

Implies  -­‐-­‐  programming,  data  analysis,  and  problem  solving  

Page 29: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Some  P’s  of  Data  Science  

People

Process

Platforms

Purpose

Programmability

Page 30: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

There  are  more:    provenance,  publica7on,  product,  performance,  policy,  profit,  ...    

Page 31: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

People…  

People

Page 32: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Data  Scien7st  Skill  Set  hZp://datasciencedojo.com/what-­‐are-­‐the-­‐key-­‐skills-­‐of-­‐a-­‐data-­‐scien7st/  

Page 33: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Unicorn?  

hZp://www.anlytcs.com

/2014/01/data-­‐

science-­‐venn

-­‐diagram

-­‐v20.htm

l    

Page 34: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Solu7on:  Scale  the  Data  Scien7sts  

Standardize  the  data  science  process,  not  the  tools!  

   Standardized  processes  enable  data  

scien&sts  to  communicate  with  business  and  programming  partners.    

 Also,  what  these  defini7ons  really  mean  is  “computa&onal  and  data  scien&sts”.  

Page 35: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Some  P’s  of  Data  Science  

Process

Page 36: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Defining  a  Typical  Data  Science  Process  

Find  data    Access  data  Acquire  data  Move  data  

Clean  data  Integrate  data  Subset  data  

Pre-­‐process  data  

Analyze  data  Process  data  

Interpret  results  Summarize  results  Visualize  results  

Post-­‐process  results  

Some  ques7ons  to  ask:  •  Where  and  how  do  I  get  the  data?  •  What  is  the  format  and  frequency  of  the  data,  e.g.,  structured,  textual,  real-­‐7me,  

image,  …?  •  How  do  I  integrate  or  subset  datasets,  e.g.,  knowledge  representa7on,…  ?  •  How  do  I  analyze  the  data  and  what  is  the  analysis  func7on?  •  What  are  the  parameters  to  customize  each  step?  •  What  are  the  compu7ng  needs  to  schedule  and  run  each  step?  •  How  do  I  make  sure  the  results  are  useful  for  the  next  step  or  as  scien7fic  products,  

e.g.,  standards  compliance,  repor7ng,  …?    

configurable  automated  analysis  

Page 37: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Some  P’s  of  Data  Science  

People

Process

Purpose

Page 38: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Purpose…  “You've  got  to  think  about    

                                               big  things      while  you're  doing            small  things,  

so  that  all  the  small  things  go  in  the  right  direc7on.”                                                      –  Alvin  Toffler  

use  cases  =>  purpose  and  value  

Page 39: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

         Need  toolboxes  with  many  tools  for:    •  data  access,    •  analysis,    •  scalable  execu&on,    •  fault  tolerance,    •  provenance  

tracking,    •  repor7ng  •  ...  

Business  Analysis  

Opera&ons  Research  

Adapted  from:    B.  Tierney,  2013    

Integra7on  of  Many  Tools  to  Serve  a  Purpose  

Page 40: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Many  Alterna7ves    

•  Alterna7ve  tools  •  Mul7ple  modes  of  scalability  

•  Support  for  each  step  of  the  development  and  produc7on  process  

•  Different  repor7ng  needs  for  explora7on  and  produc7on  stages  

Build  

Explore    

Scale  

Report  

Page 41: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Build  Once,  Run  Many  Times…  

•  Data  science  process  should  support  experimental  work  and  dynamic  scalability  on  many  plavorms  

•  Scalability  based  on:  –  data  volume  and  velocity  –  dynamic  modeling  needs  –  highly-­‐op7mized  HPC  codes  –  changes  in  network,  storage  and  compu7ng  availability  

Page 42: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Scalability  across  plavorms…  

People

Process

Platforms

Purpose

Page 43: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Running on Heterogeneous Computing Resources

- Execution of programs on where they run most efficiently -

Gordon   Trestles  

Local  Cluster  Resources  

NSF/DOE:  TeraScale  Resources  (XSEDE)  

(Gordon)   (Comet)  

(Stampede)  (Lonestar)  

Private  Cluster:  User  Owned  Resources  

Different  executables  have  different  compu&ng  architecture  needs!    

e.g.,  memory-­‐intensive,  compute-­‐intensive,  I/O-­‐intensive  

Page 44: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Challenges  for  Heterogeneous  Compu7ng    

•  Dynamic  scheduling  op7miza7on  needed  – Based  on  network  availability  – Data  transfer  and  locality    – Energy  efficiency  – Availability  of  exascale  memory  hierarchies    – Workload  changes  

•  BeZer  programmable  communica7on  between  workflow  systems  and  infrastructure  for  compu7ng,  storage  and  network  

Page 45: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Programmability  for  scalability,  reusability  and  reproducibility  

People

Process

Platforms

Purpose

Programmability

Page 46: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Using Big Data Computing in Bioinformatics- Improving Programmability, Scalability and Reproducibility-

biokepler.org  

Page 47: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Gateways  and  other  user  environments  

bioKepler  Kepler  and  Provenance  Framework  

BioLinux     Galaxy   Clovr     Hadoop  

CLOUD  and  OTHER  COMPUTING  RESOURCES  e.g.,  SGE,  Amazon,  FutureGrid,  XSEDE  

www.bioKepler.org

A coordinated ecosystem of biological and technological packages for bioinformatics!

Page 48: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Same  approach  can  be  applied  to  machine  learning  and  other  

applica7on  areas!      

-­‐  REUSABILITY  and  REPURPOSABILITY-­‐  

Page 49: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Flexible  programming  of  K-­‐means    •  R:  Programming  

language  and  soyware  environment  for  sta7s7cal  compu7ng  and  graphics.  

•  KNIME:  Plavorm  for  data  analy7cs.  

•  MlLib:  Scalable  machine  learning  library  running  on  Spark  cluster  compu7ng  framework  

•  Mahout:  Scalable  machine  learning  library  based  on  MapReduce.    

Page 50: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Scalable  Bayesian  Network  Learning  SBNL workflow

Local Learner

Data Quality Evaluation

Local Ensemble Learning

Quality Evaluation & Data Partitioning Big Data

Master Learner

MasterEnsemble Learning

Final BN Structure

Kepler Workflow

Page 51: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

BN  Workflow   •  Top  level  workflow  –  Par77onData:  RExpression  actor  that  contains  R  script  for  the  data  par77oning  step  

–  DDPNetworkLearner:  Composite  actor  using  MapReduce  to  perform  parallel  ensemble  learning  

Page 52: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

WorDS  –  Simple  and  Scalable  Big  Data  Solu7ons  using  Workflows  

Focus  on  the  use  case,    not  the  

technology!  

 

•  Develop   new   big   data   science  technologies  and  infrastructure  

•  Develop   data   science   workflow  applica&ons   through   combina7on   of  tools,  technologies  and  best  prac&ces  

•  Hands   on   consul&ng   on   workflow  technologies   for   big   data   and   cloud  systems,   e.g.,   MapReduce,   Hadoop,  Yarn,  Cascading  

•  Technology   briefings   and   applied  classes   on   end-­‐to-­‐end   support   for  data  science  

Page 53: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Using Workflows and Cyberinfrastructure for Wildfire Resilience

- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -

wifire.ucsd.edu  

Page 54: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

A  Scalable  Data-­‐Driven  Monitoring,  Dynamic  Predic7on  and  Resilience  Cyberinfrastructure  for  Wildfires                                                                                                                    (WIFIRE)  

Development  of:    “cyberinfrastructure”  for  “analysis  of  large  dimensional  heterogeneous  real-­‐7me  sensed  data”  for  fire  resilience  before,  during  and  aMer  a  wildfire  

Page 55: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

What  is  lacking  in  disaster  management  today  is…    

 a  system  integra7on  of  real-­‐7me  sensor  networks,  satellite  imagery,  near-­‐real  7me  data  management  

tools,  wildfire  simula7on  tools,  and  connec7vity  to  emergency  command  centers    

 .  ….  before,  during  and  ayer  a  firestorm.  

Page 56: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

hZp://nbcr.ucsd.edu/    

Integrated  Mul7-­‐Scale  Biomedical  Modeling  Workflows  in  NBCR  

Page 57: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Local  Execu7on  Op7on  

 User  MD-­‐Parameter  Configura&on  Op&on    

Molecular  Dynamic  CADD  Workflow  

 Amber  Molecular  Dynamics  Package  

Local:  NBCR  Cluster  Resources  

NSF/DOE:  TeraScale  Resources  (XSEDE)  

(Stampede)  

NBCR  and  User  Owned  Cloud  Resources  

(Comet)  

BENEFITS:  •  Enable    users  to  configure  MD  job  parameters                    through  command-­‐line,  GUI  or  web  interface.    •  Scale  for  mul7ple  compounds  in  parallel  •  Run  on  Mul7ple  Compu7ng  plavorms  •  Increase  reuse  •  Provenance  

GPU  or  Gordon  Execu7on  Op7on  

Page 58: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

hZp://hpc.pnl.gov/IPPD/    

Predic7ng  Workflow  Performance  from  Provenance  

Page 59: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

hZps://smartmanufacturingcoali7on.org/    

Workflows-­‐as-­‐a-­‐Service  

Page 60: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

To Sum Up•  Workflows and provenance are well-adopted in scientific

infrastructures today, with success•  WorDS Center applies these concepts to advanced

dynamic data-driven analytics applications

•  One size does not fit all! •  Many diverse environments and requirements•  Need to orchestrate at a higher level•  Higher level programming components for each domain

•  Lots of future challenges on•  Optimized execution on heterogeneous platforms

•  Programmable interface to workload, storage and network needed•  Increasing reuse within and across application domains•  Querying and integration of workflow provenance data into

performance prediction

Page 61: WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Que

s7on

s?  

WorDS

 Dire

ctor:    Ilkay  Al7ntas,  Ph.D.  

Email:  al7n

tas@

sdsc.edu