data intensive scalable omputing - ed lazowskalazowska.cs.washington.edu/snowbird08/disc.pdfcloud...
TRANSCRIPT
DDataataIIntensiventensive
SScalablecalableCComputingomputing
http://www.cs.cmu.edu/~bryant
Randal E. BryantCarnegie Mellon University
– 2 –
Examples of Big Data SourcesExamples of Big Data Sources
WalWal--MartMart267 million items/day, sold at 6,000 storesHP building them 4PB data warehouseMine data to manage supply chain, understand market trends, formulate pricing strategies
Sloan Digital Sky SurveySloan Digital Sky SurveyNew Mexico telescope captures 200 GB image data / dayLatest dataset release: 10 TB, 287 million celestial objectsSkyServer provides SQL access Next generation LSST even bigger
– 3 –
Our Data-Driven WorldOur Data-Driven World
ScienceScienceData bases from astronomy, genomics, natural languages, seismic modeling, …
HumanitiesHumanitiesScanned books, historic documents, …
CommerceCommerceCorporate sales, stock market transactions, census, airline traffic, …
EntertainmentEntertainmentInternet images, Hollywood movies, MP3 files, …
MedicineMedicineMRI & CT scans, patient records, …
– 4 –
Cloud Computing VarietiesCloud Computing Varieties
““I’ve got terabytes of data. I’ve got terabytes of data. Tell me what they mean.”Tell me what they mean.”
Very large, shared data repositoryComplex analysisData-intensive scalable computing (DISC)
““I don’t want to be a system I don’t want to be a system administrator. You handle my administrator. You handle my data & applications.”data & applications.”
Hosted servicesDocuments, web-based email, etc.Can access from anywhereEasy sharing and collaboration
– 5 –
CS Research IssuesCS Research IssuesApplicationsApplications
Language translation, image processing, …
Application SupportApplication SupportMachine learning over very large data setsWeb crawling
ProgrammingProgrammingAbstract programming models to support large-scale computationDistributed databases
System DesignSystem DesignError detection & recovery mechanismsResource scheduling and load balancingDistribution and sharing of data across system
– 6 –
Getting StartedGetting StartedGoalGoal
Get faculty & students active in DISC
Software: HadoopSoftware: HadoopOpen source project inspired by Google infrastructure
Distributed file systemMapReduce programming environment
Supported and used by YahooPrototype on single machine, map onto cluster
– 7 –
Hardware: Rely on Kindness of OthersHardware: Rely on Kindness of Others
Google setting up dedicated cluster for university useLoaded with open-source software
Including HadoopIBM providing additional software supportNSF will determine how facility should be used.
– 8 –
More Sources of KindnessMore Sources of KindnessYahoo: Major supporter of HadoopYahoo plans to work with other universities
– 9 –
Big-Data Computing Study GroupBig-Data Computing Study Group
Co-organized by REB & Thomas Kwan (Yahoo!)Supported by Computing Community Consortium
– 10 –
BDCSG ActivitiesBDCSG Activities
Hadoop SummitHadoop Summit350+ people showed upPower of Open Source
DataData--Intensive Computing SymposiumIntensive Computing Symposium~100 from universities, companies, govt. labs, NSF14 invited speakers
Google, Yahoo!, Microsoft, IntelCMU, UC Berkeley, Cornell, MIT, Johns Hopkins, UIUC, UWNSF
– 11 –
NSF InvolvementNSF Involvement
– 12 –
Curriculum DevelopmentCurriculum Development
Workshop for educators July 16–18, 2008
– 13 –
Christophe Christophe BiscigliaBisciglia
UW/GoogleCatalyst / instigator
– 14 –
Future WorkshopsFuture Workshops
– 15 –
Concluding ThoughtsConcluding Thoughts
The World is Ready for a New Approach to LargeThe World is Ready for a New Approach to Large--Scale Scale ComputingComputing
Optimized for data-driven applicationsTechnology favoring centralized facilities
Storage capacity & computer power growing faster than network bandwidth
Industry is Catching on QuicklyIndustry is Catching on QuicklyLarge crowd for Hadoop Summit
University Researchers / Educators Eager to Get University Researchers / Educators Eager to Get InvolvedInvolved
Spans wide range of CS disciplinesAcross multiple institutions