a common and sustainable big data infrastructure in ......brian ancell (texas tech), william...

14
A common and sustainable big data infrastructure in support of weather prediction research and education in universities Unidata Modeling Research in the Cloud Workshop, 5/31/17

Upload: others

Post on 28-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

Acommonandsustainablebigdatainfrastructureinsupportofweatherpredictionresearchandeducationinuniversities

Unidata ModelingResearchintheCloudWorkshop,5/31/17

Page 2: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

CarlosMaltzahn Background• CurrentResearch

• High-performanceultra-scalestorageanddatamanagement

• End-to-endPerformancemanagementandQoS

• ReproducibleEvaluationofSystems

• NetworkIntermediaries

• OtherResearch• DataManagementGames• InformationRetrieval• CooperationDynamics

2

• AdjunctProfessor,ComputerScience,UCSantaCruz

• Director,UCSCSystemsResearchLaboratory(SRL)

• Director,CenterforResearchinOpenSourceSoftware(CROSS)cross.ucsc.edu

• Director,UCSC/LANLInstituteforScalableScientificDataManagement(ISSDM)

• 1999-2004:PerformanceEngineer,Netapp

• Advising 6Ph.D.students.• Graduated 5 Ph.D.students• Idothis100%ofmytime!

Page 3: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

Project• NSF-fundedScientificSoftwareIntegrationproject(SSI)oftheSoftwareInfrastructureforSustainedInnovation(SI2)program• Goal:sustainablecommunitySWframework

Collaborators

CarlosMaltzahn,IvoJimenez(UCSantaCruz),

JoshHacker,JohnExby,KateFossell (NCAR),

MohanRamamurthy(Unidata),

GretchenMullendore,TimothySee(UND),

BrianAncell (TexasTech),

WilliamCapehart (SDSM),

ClarkEvans(UWMilwaukee),

RobertFowell,KevinTyle (UAlbany),

StevenGreybush (PennState),

RussSchumacher(CSU).

Page 4: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

ProblemInformedbyEarthCube Usersworkshops• Poorreproducibilityofdata-intensivescience

• Impactoneducationandresearch

• Impairedavailabilityofintermediateresults• Unnecessaryduplicationofwork,steeplearningcurves

• Communitiesofpracticearefallingbehind• Limitedabilitytoadoptnewtechnologies

4

Page 5: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

Domain:NumericalWeatherPrediction• NWPgroupsatuniversitiesusesupercomputingtimetocreatelargeensembles• Currentpractice:

• keepensemblesinscratchspaceordownloadtolocalinfrastructure• Don’tshareensembleproducts,don’tsharetools• Rewardsforresults,notdata

5

Page 6: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

GeneralApproach

• Establish“nuclei”:piecesoftechnologythat• Areeasilyshareable• Havetheabilitytogrow&improveovertime• Ensure“buy-in”fromresearchersandstudents

• Examples:• Wikipedia• Linuxkernel

• Infrastructurestoenablecommunity-drivenreviewandimprovement

6

Page 7: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

BigWeatherWebNuclei

1. Largeensembledistributedover7universities:GretchenMullendore (UND),BrianAncell (TexasTech),WilliamCapehart (SDSM),ClarkEvans(UWMilwaukee),RobertFowell (SUNYAlbany),StevenGreybush (PennState),RussSchumacher(CSU).

2. Commonstorage,linking,andcatalogingmethodology:DataInvestigationandSharingEnvironment• Permanentnamingandhighavailabilityofdataandexperiments• Connectingdata,platform,tools,analysis

3. SoftwareContainertechnologiesforeasydeploymentandreproducibility• Self-contained:softwarecanbeinstantlydeployedincommonenvironments• Namingandversioning:compactreferencemechanismsforcomplexenvironments• Goodforreproducibilityandeducation

7

Page 8: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

Nucleus1:Large,distributedensembles

• Testingthedistributedensembleframeworkandtools• Sharingof“knowledgeproducts”

• Initializationmethods• Physicsoptions• Workflowscriptsforproducing&analyzingdata• Success:BWWPis areusingtheBWWensembletodoscience

• Trackingdataauthorshipandcommunityimpact• WehaveaDOIbutaccesshastobemanaged(expenseofdataegress,seebelow)• Ensembleisevolvingovertime

• Disseminationofframework&tools• SeeNCAR’s“WRFinabox”work

Page 9: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

Nucleus1:Large,distributedensembles

• Testingthedistributedensembleframeworkandtools• Sharingof“knowledgeproducts”

• Initializationmethods• Physicsoptions• Workflowscriptsforproducing&analyzingdata• Success:BWWPIsareusingtheBWWensembletodoscience

• Trackingdataauthorshipandcommunityimpact• WehaveaDOIbutaccesshastobemanaged(expenseofdataegress,seebelow)• Ensembleisevolvingovertime

• Disseminationofframework&tools• SeeNCAR’s“WRFinabox”work

Page 10: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

Nucleus2:Commonstorage,linking,andcatalogingmethodology• Enablefiguresinpublications&teachingmaterialstolinktoenvironments,tools,anddatathatproducedthem• Providedinaformthatisreusable

• Easyinstallofenvironmentandtools• Creationandaccesstodataproductswithoutneedtodownloadeverything• Dataproductsbythemselveslinkbacktotheirantecedentsinareusableway.

Page 11: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

Usecloudservicesinsteadofon-premise installations

• Convertshardtechnical,management,andfundingquestionsintojustfundingquestions• Startedwithcommercialcloud:AWS($9k/monthcredit)

• 50TBsofaronS3:$800/month• THREDDSserveronEC2:$160/month• ParticularthankstoJohnExby andKevinTyle

• Challenges:• Commercialcloud:costofdataegress.PlanningmovetoXSEDE/TACC/Wrangler• Long-termmanagementofstoragecommons(withbetter-than-scratch-spacepolicies)

• Long-termnaming:gettingaDOIistheeasypart-- long-termavailability?

11

Nucleus2:Commonstorage,linking,andcatalogingmethodology

Page 12: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

Nucleus3:SoftwareContainers

• Seeearliertalk”CollaborativeWRF-basedresearch&education,enabledbysoftwarecontainers”byJoshHacker,JohnExby,andKateFossell• PracticalFalsifiableResearch(Popper,falsifiable.us,seeposter)

• Applyopen-sourcesoftwarecommunitypracticestoexperimentmanagement• Scripteverything,leverageworkflowsystems&DevOpstools

• SeealsoEricKlavins’AquariumProject:klavinslab.org/aquarium.html• Keepeverythingingit repositories• Usesoftwarecontainersforallsoftware

• Namingconventiontoautomaterunningandvalidatingexperiments• Conventionsforcompactcomputingenvironmentdescription• SeeposterbyIvoJimenez

12

Page 13: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

BWWOutreach

• 2015Unidata UsersMeeting• 2015AGUTownhall

• ~50attendees• 10newbww-userssubscribers

• 2016PresentationatAMS• 2016Unidata Workshops• WRFinaboxintheclassroom

• 2016UNDclassbyTimSee(UND)• 2017UNDclassbyGretchenMullendore (UND)• Wiliam Capehart (SDSM)usingBWWensembles(2papers)

• Popper• Jimenezetal.VarSys’16,Chicago,IL• Jimenezetal.USENIX;login:,Winter’16• Guestlecturein2017UNDclassbyGretchenMullendore

13

Page 14: A common and sustainable big data infrastructure in ......Brian Ancell (Texas Tech), William Capehart (SDSM), Clark Evans (UW Milwaukee), Robert Fowell, Kevin Tyle (U Albany), Steven

• Websites:• bigweatherweb.org• www.ral.ucar.edu/projects/ncar-docker-wrf• falsifiable.us

• Emaillist:[email protected]• Slack:bigwxweb.slack.com,invitations:KevinTyle,[email protected]• Contact:CarlosMaltzahn,[email protected]