towards a data science...

24
www.cineca.it Symposium on HPC & Data-Intensive Applications in Earth Science November 14 th , Trieste Carlo Cavazzoni Giuseppe Fiameni Towards a data science infrastructure Challenges and treats to afford the data deluge

Upload: others

Post on 20-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Symposium on HPC & Data-Intensive Applications in Earth Science

November 14th, Trieste

Carlo Cavazzoni Giuseppe Fiameni

Towards a data science infrastructure

Challenges and treats to afford the data deluge

Page 2: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

http://www.wired.com/wired/issue/16-07 September 2008

26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 2

Page 3: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

The 4 paradigms of Scientific Research

1. Theory 2. Experiment or Observation 3. Simulation of theory or model

Supercomputers 4. Data-driven (Big Data) or The Fourth

Paradigm: Data-Intensive Scientific Discovery (aka Data Science) •  More data; less models

26/11/2014 3 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 4: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it 4

The Shift Towards a data centric view

ü  All science is becoming data-dominated •  Experiment, computation, theory •  Fourth paradigm

§  http://research.microsoft.com/en-us/collaboration/fourthparadigm/

ü  Classes of data •  Collections, observations, experiments, simulations •  Software •  Publications

ü  Totally new methodologies •  Algorithms, mathematics, culture

ü Data become the medium for multidisciplinarity science

26/11/2014 4 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 5: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

DIKW Process

ü  Data becomes ü  Information becomes ü  Knowledge becomes ü  Wisdom or Decisions

•  Community acceptance of results is important (experiments reproducibility)

•  Volume of bits&bytes decreases as we proceed down DIKW pipeline

26/11/2014 5 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 6: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Data as an asset

“One of the most significant changes of the past decade has been the widespread recognition of data as an asset rather than the refuse of research.”

26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 6

Page 7: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it Source: Ricardo Guimarães Herrmann

Data Deluge is also Information/Knowledge/Wisdom/Decision/

System Deluge?

26/11/2014 7 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 8: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it 26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 8

Courtesy of Geoffrey Fox

Page 9: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Pyramid or Iceberg

26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 9

HPC

HPDA/HTC

Utility Computing, Services, Visualization,

Analytics, Cloud

DATA management

Varie

ty

Velo

city

Fundamental question:

how to remove boundaries among layers? How do we

enable inter-operability/integration?

Page 10: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Science Computing Environments

ü  Large Scale Supercomputers – multicore nodes linked by high performance low latency network •  Increasingly with GPU/Accelerator enhancement •  Suitable for highly parallel simulations

ü  High Throughput Systems – cluster/farm typically aimed at pleasingly parallel jobs •  Classic example is LHC data analysis

ü  High Performance Data Analytics - combination of high computation resources and analytics methods •  MapReduce •  Graph •  Relation and NoSQL database

ü  Federation of resources to enable convenient access to multiple backend systems including supercomputers

ü  Use Services (SaaS) •  Portals make access convenient and •  Workflow integrates multiple processes into a single job •  Visualization either remote or in-situ

10 26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 11: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Classic HPC systems do help

HPC

Data Analytics

• 3D physics simulations

• Linear Algebra • FLOPS

• Graph oriented algorithm

• GTEPS/IOPS

How to overcome this bottleneck? (Non volatile memory, SSD, MapReduce)

26/11/2014 11 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 12: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Emerging use case: the HBP case study

ü  Human Brain Project: understand the human brain •  six ICT platforms to Neuroinformatics, Brain

Simulation, High Performance Computing, Medical Informatics, Neuromorphic Computing and Neurorobotics

Data Management

Data Analytics

High Performance

Visualization

SaaS/Cloud

26/11/2014 12 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 13: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

ü  Neuroimage processing ü  Genome data processing (NGS) ü  Geo Science ü High Throughput Material Science ü  Pharmacy, Oil&Gas and Finance ü  Astrophysics

They require more and more data processing & analytics Together with the typical HPC Number crunching

Work-load

Other emerging use cases

26/11/2014 13 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 14: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Workspace Workspace Workspace

Front End Cluster

DB Web serv.

Web Archive FTP

Tape

HPC Engine

Laboratories

PRACE EUDAT

Other Data Sources

External Data Sources

Human Brain Prj

HPC Engine

HPC “island” Infrastructure

26/11/2014 14 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 15: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Workspace 6PByte

Core Data Processing

viz Big mem DB

Data mover processing

Web serv.

Web Hadoop

Core Data Store

Repository 5PByte

Tape 12

PByte Internal data

sources

Cloud service

Scale-Out Data Processing

FERMI

X86 Cluster

Laboratories

PRACE EUDAT

Other Data Sources

External Data Sources

Human Brain Prj

SaaS APP

Analytics APP Parallel APP

(data centric) Infrastructure

= NEW

26/11/2014 15 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 16: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

The PICO system

Total Nodes CPU Cores per Nodes Memory (RAM) Notes

Compute login node 66 Intel Xeon E5 2670

v2 @2.5Ghz 20 128 GB

Visualization node 2 Intel Xeon E5 2670 v2 @ 2.5Ghz 20 128 GB 2 GPU Nvidia K40

Big Mem node 2 Intel Xeon E5 2650 v2 @ 2.6 Ghz 16 512 GB 1 GPU Nvidia K20

BigInsight node 4 Intel Xeon E5 2650 v2 @ 2.6 Ghz 16 64 GB 32TB of local disk

SSD Storage 40 TB

http://www.hpc.cineca.it/hardware/pico 26/11/2014 16 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 17: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

The PICO System

26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 17

Page 18: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

New Services

ü  Data management •  Preservation •  Sharing •  Moving

ü  Data analytics J •  MPI/OpenMP •  Higher GTEPS/IOPS •  MapReduce (Hadoop/Spark)

ü  Relational and NoSQL database ü  Visualization ü  Cloud

•  Support for RDMA •  Various O.S. (Microsoft)

ü  Higher integration among the systems •  Everything is connected through IB

ü  ??

26/11/2014 18 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 19: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

What Applications work in Clouds

ü  Embarrassing parallel applications with roughly independent data or spawning independent simulations •  Long tail of science and integration of distributed

sensors ü  Commercial and Science Data analytics

•  that can use MapReduce and need an interactive environment

•  that use legacy applications ü  Industries that require high isolation (tenancy)

26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 19

Page 20: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Virtualization is dead, long live to virtualization

ü HPC in a container •  light weight virtualization mechanism which isolates each

application by packaging it into virtual containers •  every application has its own dependencies, which

include both software (services, libraries) and hardware (CPU, memory) resources

•  run on any Linux server making the application portable and giving the users flexibility to run the application on any framework

26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 20

http://www.docker.com/whatisdocker

Page 21: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Conclusion

“Large-scale science and engineering problems require collaborative use of many compute and data resources, including supercomputers and large-scale data storage systems, all of which must be integrated with applications and data that are developed by different teams of researchers or that are obtained from different instruments and all of which are at different geographic locations.”

26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 21

Page 22: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Initiatives

ü  Tools and techniques for massive data analysis , Cineca Bologna (PATC Course), Dec 15th-16th 2014 (complete L )

ü  PICO: the CINECA solutions for Big Data Science , Cineca-Bologna, Dec 5th 2014

ü  Call for interest at this link . Deadline is Monday, November 17th at 9:00 am http://goo.gl/Wtv8oM

26/11/2014 22 Symposium on HPC & Data-Intensive Applications in Earth Science

Page 23: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

The EUDAT Project

ü  European Data Infrastructure for addressing data management challenges at EU level

ü  5 core services •  Long Term Preservation •  Staging towards computational facility •  Sharing •  Search through meta-data •  Deposit (like DropBox)

ü Many scientific communities involved •  EPOS •  ENES •  LifeWatch •  …..

26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 23

Page 24: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application

www.cineca.it

Thanks for your attention! www.hpc.cineca.it

26/11/2014 24 Symposium on HPC & Data-Intensive Applications in Earth Science