towards a data science...
TRANSCRIPT
![Page 1: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/1.jpg)
www.cineca.it
Symposium on HPC & Data-Intensive Applications in Earth Science
November 14th, Trieste
Carlo Cavazzoni Giuseppe Fiameni
Towards a data science infrastructure
Challenges and treats to afford the data deluge
![Page 2: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/2.jpg)
www.cineca.it
http://www.wired.com/wired/issue/16-07 September 2008
26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 2
![Page 3: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/3.jpg)
www.cineca.it
The 4 paradigms of Scientific Research
1. Theory 2. Experiment or Observation 3. Simulation of theory or model
Supercomputers 4. Data-driven (Big Data) or The Fourth
Paradigm: Data-Intensive Scientific Discovery (aka Data Science) • More data; less models
26/11/2014 3 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 4: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/4.jpg)
www.cineca.it 4
The Shift Towards a data centric view
ü All science is becoming data-dominated • Experiment, computation, theory • Fourth paradigm
§ http://research.microsoft.com/en-us/collaboration/fourthparadigm/
ü Classes of data • Collections, observations, experiments, simulations • Software • Publications
ü Totally new methodologies • Algorithms, mathematics, culture
ü Data become the medium for multidisciplinarity science
26/11/2014 4 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 5: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/5.jpg)
www.cineca.it
DIKW Process
ü Data becomes ü Information becomes ü Knowledge becomes ü Wisdom or Decisions
• Community acceptance of results is important (experiments reproducibility)
• Volume of bits&bytes decreases as we proceed down DIKW pipeline
26/11/2014 5 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 6: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/6.jpg)
www.cineca.it
Data as an asset
“One of the most significant changes of the past decade has been the widespread recognition of data as an asset rather than the refuse of research.”
26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 6
![Page 7: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/7.jpg)
www.cineca.it Source: Ricardo Guimarães Herrmann
Data Deluge is also Information/Knowledge/Wisdom/Decision/
System Deluge?
26/11/2014 7 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 8: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/8.jpg)
www.cineca.it 26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 8
Courtesy of Geoffrey Fox
![Page 9: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/9.jpg)
www.cineca.it
Pyramid or Iceberg
26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 9
HPC
HPDA/HTC
Utility Computing, Services, Visualization,
Analytics, Cloud
DATA management
Varie
ty
Velo
city
Fundamental question:
how to remove boundaries among layers? How do we
enable inter-operability/integration?
![Page 10: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/10.jpg)
www.cineca.it
Science Computing Environments
ü Large Scale Supercomputers – multicore nodes linked by high performance low latency network • Increasingly with GPU/Accelerator enhancement • Suitable for highly parallel simulations
ü High Throughput Systems – cluster/farm typically aimed at pleasingly parallel jobs • Classic example is LHC data analysis
ü High Performance Data Analytics - combination of high computation resources and analytics methods • MapReduce • Graph • Relation and NoSQL database
ü Federation of resources to enable convenient access to multiple backend systems including supercomputers
ü Use Services (SaaS) • Portals make access convenient and • Workflow integrates multiple processes into a single job • Visualization either remote or in-situ
10 26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 11: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/11.jpg)
www.cineca.it
Classic HPC systems do help
HPC
Data Analytics
• 3D physics simulations
• Linear Algebra • FLOPS
• Graph oriented algorithm
• GTEPS/IOPS
How to overcome this bottleneck? (Non volatile memory, SSD, MapReduce)
26/11/2014 11 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 12: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/12.jpg)
www.cineca.it
Emerging use case: the HBP case study
ü Human Brain Project: understand the human brain • six ICT platforms to Neuroinformatics, Brain
Simulation, High Performance Computing, Medical Informatics, Neuromorphic Computing and Neurorobotics
Data Management
Data Analytics
High Performance
Visualization
SaaS/Cloud
26/11/2014 12 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 13: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/13.jpg)
www.cineca.it
ü Neuroimage processing ü Genome data processing (NGS) ü Geo Science ü High Throughput Material Science ü Pharmacy, Oil&Gas and Finance ü Astrophysics
They require more and more data processing & analytics Together with the typical HPC Number crunching
Work-load
Other emerging use cases
26/11/2014 13 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 14: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/14.jpg)
www.cineca.it
Workspace Workspace Workspace
Front End Cluster
DB Web serv.
Web Archive FTP
Tape
HPC Engine
Laboratories
PRACE EUDAT
Other Data Sources
External Data Sources
Human Brain Prj
HPC Engine
HPC “island” Infrastructure
26/11/2014 14 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 15: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/15.jpg)
www.cineca.it
Workspace 6PByte
Core Data Processing
viz Big mem DB
Data mover processing
Web serv.
Web Hadoop
Core Data Store
Repository 5PByte
Tape 12
PByte Internal data
sources
Cloud service
Scale-Out Data Processing
FERMI
X86 Cluster
Laboratories
PRACE EUDAT
Other Data Sources
External Data Sources
Human Brain Prj
SaaS APP
Analytics APP Parallel APP
(data centric) Infrastructure
= NEW
26/11/2014 15 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 16: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/16.jpg)
www.cineca.it
The PICO system
Total Nodes CPU Cores per Nodes Memory (RAM) Notes
Compute login node 66 Intel Xeon E5 2670
v2 @2.5Ghz 20 128 GB
Visualization node 2 Intel Xeon E5 2670 v2 @ 2.5Ghz 20 128 GB 2 GPU Nvidia K40
Big Mem node 2 Intel Xeon E5 2650 v2 @ 2.6 Ghz 16 512 GB 1 GPU Nvidia K20
BigInsight node 4 Intel Xeon E5 2650 v2 @ 2.6 Ghz 16 64 GB 32TB of local disk
SSD Storage 40 TB
http://www.hpc.cineca.it/hardware/pico 26/11/2014 16 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 17: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/17.jpg)
www.cineca.it
The PICO System
26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 17
![Page 18: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/18.jpg)
www.cineca.it
New Services
ü Data management • Preservation • Sharing • Moving
ü Data analytics J • MPI/OpenMP • Higher GTEPS/IOPS • MapReduce (Hadoop/Spark)
ü Relational and NoSQL database ü Visualization ü Cloud
• Support for RDMA • Various O.S. (Microsoft)
ü Higher integration among the systems • Everything is connected through IB
ü ??
26/11/2014 18 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 19: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/19.jpg)
www.cineca.it
What Applications work in Clouds
ü Embarrassing parallel applications with roughly independent data or spawning independent simulations • Long tail of science and integration of distributed
sensors ü Commercial and Science Data analytics
• that can use MapReduce and need an interactive environment
• that use legacy applications ü Industries that require high isolation (tenancy)
26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 19
![Page 20: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/20.jpg)
www.cineca.it
Virtualization is dead, long live to virtualization
ü HPC in a container • light weight virtualization mechanism which isolates each
application by packaging it into virtual containers • every application has its own dependencies, which
include both software (services, libraries) and hardware (CPU, memory) resources
• run on any Linux server making the application portable and giving the users flexibility to run the application on any framework
26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 20
http://www.docker.com/whatisdocker
![Page 21: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/21.jpg)
www.cineca.it
Conclusion
“Large-scale science and engineering problems require collaborative use of many compute and data resources, including supercomputers and large-scale data storage systems, all of which must be integrated with applications and data that are developed by different teams of researchers or that are obtained from different instruments and all of which are at different geographic locations.”
26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 21
![Page 22: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/22.jpg)
www.cineca.it
Initiatives
ü Tools and techniques for massive data analysis , Cineca Bologna (PATC Course), Dec 15th-16th 2014 (complete L )
ü PICO: the CINECA solutions for Big Data Science , Cineca-Bologna, Dec 5th 2014
ü Call for interest at this link . Deadline is Monday, November 17th at 9:00 am http://goo.gl/Wtv8oM
26/11/2014 22 Symposium on HPC & Data-Intensive Applications in Earth Science
![Page 23: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/23.jpg)
www.cineca.it
The EUDAT Project
ü European Data Infrastructure for addressing data management challenges at EU level
ü 5 core services • Long Term Preservation • Staging towards computational facility • Sharing • Search through meta-data • Deposit (like DropBox)
ü Many scientific communities involved • EPOS • ENES • LifeWatch • …..
26/11/2014 Symposium on HPC & Data-Intensive Applications in Earth Science 23
![Page 24: Towards a data science infrastructureindico.ictp.it/event/a13229/session/19/contribution/78/material/0/0.pdf · application by packaging it into virtual containers • every application](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7f22c28a15719c8453ac0/html5/thumbnails/24.jpg)
www.cineca.it
Thanks for your attention! www.hpc.cineca.it
26/11/2014 24 Symposium on HPC & Data-Intensive Applications in Earth Science