big process for big data

30
www.ci.anl.gov www.ci.uchicago.edu Big process for big data Process automation for data- driven science Ian Foster Computation Institute Mathematics and Computer Science Division Department of Computer Science Argonne National Laboratory & The University

Upload: ian-foster

Post on 10-May-2015

959 views

Category:

Technology


2 download

DESCRIPTION

Talk at DOE CIO's Big Data Tech Summit -- latest take on why and wherefore of software as a service (SaaS) for science, and the Globus Online work we are doing, with various DOE examples.

TRANSCRIPT

  • 1.Big process for big dataProcess automation for data-driven scienceIan FosterComputation InstituteMathematics and Computer Science DivisionDepartment of Computer ScienceArgonne National Laboratory & The University of ChicagoTalk at DOE Big Data Technology Summit, Washington DC, October 9, 2012 www.ci.anl.gov www.ci.uchicago.edu

2. Big data is not new at DOELarge Hadron Collider Higgs discovery onlypossible because of theextraordinaryachievements of gridcomputing15 PB/yearRolf Heuer, CERN DG173 TB/day500 MB/secLHC ComputingGrid (10+ GB/sec)www.ci.anl.gov2www.ci.uchicago.edu 3. But it is now ubiquitous: e.g., genomics www.ci.anl.gov3 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu 4. But it is now ubiquitous: e.g., genomics 6 years Computing x10 (x30 at DOE)www.ci.anl.gov4 Kahn, Science, 331 (6018): 728-729www.ci.uchicago.edu 5. But it is now ubiquitous: e.g., genomics 6 years Computing x10 (x30 at DOE) Genome sequencing x105www.ci.anl.gov5 Kahn, Science, 331 (6018): 728-729www.ci.uchicago.edu 6. Now ubiquitous: e.g., light sources18 ordersof magnitude12 orders ofin 5 decades!magnitudein 6 decadeswww.ci.anl.gov 6 Credit: Linda Youngwww.ci.uchicago.edu 7. Now ubiquitous: e.g., light sourceswww.ci.anl.gov7 Source: Francesco de Carlowww.ci.uchicago.edu 8. Local flows already exceed those of LHC External Argonne data sources 163flows in TB/day 99(estimates)Advanced Photon SourceArgonne143 10Short-Long-Leadership term termComputing 100 storage 50 storageFacility 150 100Other sourcesOther sourcesthat remain tothat remain to be quantified be quantified Data analysiswww.ci.anl.gov8www.ci.uchicago.edu 9. Big data demands new analysis modelsToday Desired www.ci.anl.gov9 Source: Francesco de Carlo www.ci.uchicago.edu 9 10. Its velocity and variety as well as volume Proteomics Phenotypes Transcriptomics Genomes Growth curves MetabolomicsMetabolicReconciled Phenotype Model Modelpredictions Flux Integrated predictions Assembly Annotation modelHypothesesRegulonRegulatoryPathway predictionmodel designs www.ci.anl.gov10Credit: Chris Henry et al. www.ci.uchicago.edu 11. Exponentially increasing complexity Run experimentCollect dataMove dataCheck dataAnnotate dataShare data Find similar data Link to literature Analyze data Publish datawww.ci.anl.gov11www.ci.uchicago.edu 12. www.ci.anl.gov12 www.ci.uchicago.edu 13. Tripit exemplifies process automationMe Other services Book flights Record flightsSuggest hotel Book hotel Record hotelGet weatherPrepare mapsShare infoMonitor pricesMonitor flight www.ci.anl.gov13 www.ci.uchicago.edu 14. Big data requires big process Run experiment OutsourcedCollect dataIntuitiveMove dataIntegrativeCheck dataAnnotate data Research ITShare dataas a service Find similar data Link to literatureSecure Performant Analyze dataReliable Publish data www.ci.anl.gov14 www.ci.uchicago.edu 15. Characterizing big process requirementsTelescopeIn millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reportingStaging IngestRegistry Community Repository AnalysisNext-gengenome ArchiveMirrorsequencerAccelerate discovery and innovation by outsourcing difficult tasks 15 www.ci.anl.gov www.ci.uchicago.edu 16. Characterizing big process requirements Telescope In millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex Data movement is a frequentburdensome reporting protocols,challenge Between facilities, archives,Registry researchersStagingIngest Many files, large data volumes Community With security, reliability, performance RepositoryAnalysisNext-gengenomeArchiveMirrorsequencerAccelerate discovery and innovation by outsourcing difficult tasks 16 www.ci.anl.gov www.ci.uchicago.edu 17. Globus Online: Big process for big dataData movement as a serviceSecure, automated, reliable, high-speed movement, synchronization of many files www.ci.anl.gov17 www.ci.uchicago.edu 18. 6,000 users500 M files, 7 PB moved99.9% availability 19. Examples of Globus Online in actionK. Heitmann (ANL) moves 22TB cosmology data at 5 Gb/s LANL ANLB. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA - NERSCDan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilienceSupercomputer centers, genome facilities, light sources, universities all recommend itwww.ci.anl.gov19www.ci.uchicago.edu 20. Sizes of transfers Jan-Jun; size of circles prop. to log size Automation expands use of networksRed=NERSC/LBL/ESnet; Green=ORNL/BNL; Blue=ANL;Yellow=FNAL; Grey=OtherTransfers Jan-June 2012,1e+12Size (bytes) vs timeSize log(transfer rate)Red: NERSC/LBL/Esnet1e+09Green: ORNL, LBLBlue: ANL bytes_xferedYellow: FNAL1e+06Grey: Other1e+031e+00JanMar MayJulwww.ci.anl.gov20www.ci.uchicago.edu 21. Need much more than data movementTelescopeIn millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reportingStaging IngestRegistry Community Repository AnalysisNext-gengenome ArchiveMirrorsequencerAccelerate discovery and innovation by outsourcing difficult tasks 21 www.ci.anl.gov www.ci.uchicago.edu 22. Need much more than data movement Ingest, cata loging, inteSharing, collaboration,Identity, grou ps, security Analysis, sim ulation, visu ... grationannotation alizationStagingIngestRegistry Community RepositoryAnalysisNext-gengenome ArchiveMirrorsequencerAccelerate discovery and innovation by outsourcing difficult tasks 22 www.ci.anl.gov www.ci.uchicago.edu 23. Earth System Grid: Data movementOutsource data transfer Client data download Replication between sitesNo ESGF client software needed20+ times faster than HTTP www.ci.anl.gov23 earthsystemgrid.org www.ci.uchicago.edu 24. Kbase: Identity, group, data movementwww.ci.anl.gov24 kbase.science.energy.gov www.ci.uchicago.edu 25. Genomics: Data movement and analysisGalaxy-based workflow managementPublic Globus Online Data Integrated Galaxy Web-based UI data Drag-n-drop Sequenc- SequencinGlobus Online providesStorage libraries workflow creationing g Centers Easily add newcenters High-performance Fault-tolerant Lab Research tools Secure Local Cluster/ Analytical toolsSeq Cloudfile transfer between all Center run on scalabledata endpointscomputersGalaxy in CloudData managementData analysis www.ci.anl.gov25 Source: Ravi Madduriwww.ci.uchicago.edu 26. Integrating observation and simulation1Cloud properties and precipitation characteristics in large-scale models and cloud- resolving models (e.g., CMIP5 models, GCRM)Percentage of mapped radar domain in Darwin with returns>10 dBz over the period 19 to 22 January 2006. Retrieve CompareConstruct structured4-D atmosphericstate (CAN)2 Precipitating storm structures; storm lifecycles; Analytics Analytics statistical representation of storm scale properties;3 predictive cloud models www.ci.anl.gov26 Scott Colliswww.ci.uchicago.edu 27. Integrating observation and simulationLevel 1Level 2 Level 3 PBsTBsGBs www.ci.anl.gov27 Salman Habib, Katrin Heitmann www.ci.uchicago.edu 28. Integrating observation and simulation www.ci.anl.gov28 Salman Habib, Katrin Heitmann www.ci.uchicago.edu 29. In summary: Big process for big dataAccelerate discovery and innovation worldwideby providing research IT as a serviceOutsource time-consuming tasks to provide large numbers of researchers with unprecedented access to powerful tools; enable a massive shortening of cycle times in time-consuming research processes; and reduce research IT costs via economies of scaleAccelerate existing science; enable new science www.ci.anl.gov29 www.ci.uchicago.edu 30. Thank you!foster@anl.govwww.ci.anl.govwww.mcs.anl.govwww.globusonline.org www.ci.anl.gov www.ci.uchicago.edu