kepler/spa extensions for scientific workflows – now and upcoming
DESCRIPTION
UC DAVIS Department of Computer Science. San Diego Supercomputer Center. Kepler/SPA Extensions for Scientific Workflows – Now and Upcoming. Ilkay Altintas SWAT lead San Diego Supercomputer Center [email protected] Bertram Lud ä scher Dept. of Computer Science & Genome Center - PowerPoint PPT PresentationTRANSCRIPT
1
Kepler/SPA Extensions for Kepler/SPA Extensions for
Scientific Workflows – Now and Scientific Workflows – Now and
UpcomingUpcoming
Ilkay AltintasIlkay AltintasSWAT leadSWAT leadSan Diego Supercomputer CenterSan Diego Supercomputer Center
[email protected]@sdsc.edu
Bertram LudBertram LudääscherscherDept. of Computer Science & Genome CenterDept. of Computer Science & Genome CenterUniversity of California, Davis University of California, Davis
[email protected]@ucdavis.edu
+ many other SDM/SPA & Kepler contributors!+ many other SDM/SPA & Kepler contributors!
San Diego Supercomputer Center
UC DAVISDepartment ofComputer Science
2
KEPLER/KEPLER/CSPCSP: : CContributors, ontributors, SSponsors, ponsors, PProjectsrojects
Ilkay Altintas Ilkay Altintas SDM, NLADR, Resurgence, EOL, … SDM, NLADR, Resurgence, EOL, …
Kim Baldridge Kim Baldridge Resurgence, NMIResurgence, NMI
Chad Berkley Chad Berkley SEEKSEEK
Shawn Bowers Shawn Bowers SEEKSEEK
Terence Critchlow Terence Critchlow SDMSDM
Tobin Fricke Tobin Fricke ROADNetROADNet
Jeffrey Grethe Jeffrey Grethe BIRNBIRN
Christopher H. Brooks Christopher H. Brooks Ptolemy IIPtolemy II
Zhengang Cheng Zhengang Cheng SDMSDM
Dan Higgins Dan Higgins SEEKSEEK
Efrat Jaeger Efrat Jaeger GEONGEON
Matt Jones Matt Jones SEEKSEEK
Werner Krebs, Werner Krebs, EOLEOL
Edward A. Lee Edward A. Lee Ptolemy IIPtolemy II
Kai Lin Kai Lin GEONGEON
Bertram Ludaescher Bertram Ludaescher SDM, SEEKSDM, SEEK, , GEONGEON, , BIRN,BIRN, ROADNetROADNet
Mark Miller Mark Miller EOLEOL
Steve Mock Steve Mock NMINMI
Steve Neuendorffer Steve Neuendorffer Ptolemy IIPtolemy II
Jing Tao Jing Tao SEEKSEEK
Mladen Vouk Mladen Vouk SDMSDM
Xiaowen Xin Xiaowen Xin SDMSDM
Yang Zhao Yang Zhao Ptolemy IIPtolemy II
Bing Zhu Bing Zhu SEEKSEEK
••••••
Ptolemy IIPtolemy II
www.kepler-project.orgwww.kepler-project.org
LLNL, NCSU, SDSC, UCB, UCD, UCSB, UCSD, …,
Zurich
Collab. tools: IRC, cvs, skype, Wiki: hotTopics, FAQs, ..Collab. tools: IRC, cvs, skype, Wiki: hotTopics, FAQs, ..
Ptolemy IIPtolemy II
SPASPA
3
GEON Dataset Generation & GEON Dataset Generation & RegistrationRegistration
(a co-development in KEPLER)(a co-development in KEPLER)
Xiaowen (SDM/SPA)
Edward et al.(Ptolemy)
Yang (Ptolemy)
Efrat(GEON)
Ilkay(SDM/SPA)
SQL database access (JDBC)Matt,Chad,
Dan et al. (SEEK)
% Makefile$> ant run
% Makefile$> ant run
4
Update: endo-SPA (exo-Kepler), endo-Update: endo-SPA (exo-Kepler), endo-Kepler (exo-SPA), … w/o counting Kepler (exo-SPA), … w/o counting
peas…peas…• No/minor changes: No/minor changes: – XSLT, email, …
• Web service actor (SDM)Web service actor (SDM)– Updated: dynamic operation display, error reporting
• Command line actor (SDM)Command line actor (SDM)– Updated: improved interface and error handling
• SSH2 actor (SDM)SSH2 actor (SDM)– New: implements ssh2 protocol for remote execution (no plain password sent over the
wire)• Timestamp actor (SDM)Timestamp actor (SDM)
– New: for logging• BrowserUIv2.0 (SDM)BrowserUIv2.0 (SDM)
– reimplemented, improved interface– v3.0 planned (“catching” http-get/post via localhost)
• Execution logger (SDM)Execution logger (SDM)– New: workflow “black box” for keeping track of runs
• Documentation framework (SDM)Documentation framework (SDM)– Autogenerated actor documentation (new doclets and taglets)
• Ontology-based actor and dataset classification (SEEK)Ontology-based actor and dataset classification (SEEK)– Finding relevant components: actors and datasets, suggesting possible connections, …
• Kepler/SRB toolkit (GEON, SDM, SEEK, …)Kepler/SRB toolkit (GEON, SDM, SEEK, …)– improved interfaces, new functions
• … …
5
Application Pull vs Technology Application Pull vs Technology PushPush
• Use case drivenUse case driven (application pull) (application pull)– PIW, TSI-1, TSI-2, … – Solve technology issues along the way(+) solve the particular scientists’ problem(-) one-of-a-kind solutions, few generic &
reusable technology Example: – TSI-1 and TSI-2 are conceptually almost
identical scientific (“Grid/HPC/HTC”) workflows– but implemented very differently limited reuse, e.g., evolving/customizing
one into the other is hard/impossible…
6
Application Pull vs Technology Application Pull vs Technology PushPush
• Technology drivenTechnology driven (technology push) (technology push)– Generic application integration mechanisms:
• web service actor, harvester, command-line actors, ssh2 actor, BrowserUI, …
– Specialized interfaces to HPC/HTC systems: • Large-scale data management:
– SDSC SRB toolkit (set of SRB actors), – SRM?, PVFS2?, MPI-IO?, …
• Interfacing with generic job schedulers: – NIMROD, Condor, APST, …
– Interfacing with scientific packages: – Statistics toolkit (R, …), GIS (Grass, ArcIMS, Mapserver…)– GAMESS toolkit, APBS (visualization)…
(+) developing a reusable technology / toolkits(!) still need guidance by domain scientists’ problems,
but need to lift one-of solutions into a general SWF engineering methodology
7
Increasing number of Kepler Increasing number of Kepler actors…actors…
8
… … creating creating prototype workflowsprototype workflows and and test cases test cases (for automated (for automated
tests) … tests) …
9
… … putting them together in putting them together in generic, reusable generic, reusable
packagespackages, e.g., e.g.Kepler/SRB toolkitKepler/SRB toolkit
SRB holdings @ SDSC only: 404 TB in 59 million files across 5167 users (12/16/’04, Reagan Moore)
10
KEPLER/R Toolkit KEPLER/R Toolkit (under development)(under development)
Source: Dan Higgins, Kepler/SEEKSource: Dan Higgins, Kepler/SEEK
11
12
New Developments & New Developments & DirectionsDirections
13
Ontology-based Actor & Ontology-based Actor & Dataset DiscoveryDataset Discovery
Ontology based actor (service) and dataset
search
Result Display
14
15
16
Example: GAMESS Quantum-mechanics Example: GAMESS Quantum-mechanics cheminformatics workflowcheminformatics workflow
• Job management infrastructure in place• Results database: under development• Goal: 1000’s of GAMESS jobs (quantum mechanics)
17
Towards a Framework for Towards a Framework for “Grid/HPC/HTC” WFs & Job “Grid/HPC/HTC” WFs & Job
ManagementManagement
18
Technology-oriented meeting: May 12th Technology-oriented meeting: May 12th Ptolemy/Kepler Miniconference in BerkeleyPtolemy/Kepler Miniconference in Berkeley
19
What’s needed, what’s nextWhat’s needed, what’s next• Build generic toolkits / packagesBuild generic toolkits / packages
• Don’t reinvent – Reuse!Don’t reinvent – Reuse!– Improved R coupling, SCIRun coupling, …
• SWF Framework that SWF Framework that lets scientists choose…lets scientists choose… – SRB (Sput, Sget,…), SRM, MPI-IO, GlobusTK (GridFTP,
…) , Sabul, …, pNetCDF, parallel-R, … packages– Condor, Nimrod, … schedulers– GRASS, …
• General purpose SWF system/PSE that General purpose SWF system/PSE that scientists can use themselvesscientists can use themselves
20
Towards a Towards a KEPLER School of KEPLER School of ExpressionExpression
(Flow-based Design Patterns) (Flow-based Design Patterns)• Generality vs specialization of actorsGenerality vs specialization of actors
– also loosely coupled vs tightly coupled
• Data transformation pipelinesData transformation pipelines– alternate compute and data transformation steps
• Stage-execute-fetch pattern (Grid/HPC/HTC-WFs)Stage-execute-fetch pattern (Grid/HPC/HTC-WFs)• Loops, higher-order functions (map, foldr, …)Loops, higher-order functions (map, foldr, …)
– cf. Taverna’s automatic loop insertion based on data types
F-mapproducer
[f1, f2, …fn]
methodsfunctions
map
f
[x1, x2, …xn]
producer [f(x)1,…,f(xn)]
X
A B C
connectJDBC/SRB connection tokens, proxies, certificates
21
Blurring Blurring Design (ToDo)Design (ToDo) and Execution and Execution
22
Kepler@UC Davis Genome Center: Kepler@UC Davis Genome Center: Scientific Workflows to Support the Scientific Workflows to Support the
Complete (Wet-lab) Experiment Complete (Wet-lab) Experiment LifecycleLifecycle
• Try to capture and (semi-)automate the Experiment Try to capture and (semi-)automate the Experiment Lifecycle:Lifecycle:– Discover similar experiments, … – reuse, customize, – execute, monitor,– manage results,– Register back to an experiment repository
Support Experiment Design, Execution, & ReuseSupport Experiment Design, Execution, & Reuse
Scientific workflows and semantic extensions Scientific workflows and semantic extensions (ontologies, metadata++) (ontologies, metadata++)
23
Summary: What we could/should Summary: What we could/should dodo
• Push technology:Push technology:– Distributed Kepler & “detached” execution– Making Kepler more X-aware, where …
• … X=Data plumbing (SRB toolkit, GridTK, others, …) • … X=Grid & Scheduling (need a “Grid director”? Condor
director?), • … X=Parameter-sweep (“Nimrod/APST”… director?)• … X=Statistics & other specialized packages (R, parallel-R?, …,
Grass, … )• … X=Visualization (SciRUN, …)
– Semantic extensions• Actors and datasets have “semantic types” to support reource
discovery, WF design, …
• Create “Packages” or “Rolls” Create “Packages” or “Rolls” – … targeting certain scientific user groups & communities
• SWF Life-cycle support:SWF Life-cycle support:– Design, execution, monitoring, archival, re-use/re-run– Design patterns, “Kepler School of Expression”