data intensive scalable omputing - ed lazowskalazowska.cs.washington.edu/snowbird08/disc.pdfcloud...

15
D D ata ata I I ntensive ntensive S S calable calable C C omputing omputing http://www.cs.cmu.edu/~bryant Randal E. Bryant Carnegie Mellon University

Upload: others

Post on 25-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

DDataataIIntensiventensive

SScalablecalableCComputingomputing

http://www.cs.cmu.edu/~bryant

Randal E. BryantCarnegie Mellon University

Page 2: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 2 –

Examples of Big Data SourcesExamples of Big Data Sources

WalWal--MartMart267 million items/day, sold at 6,000 storesHP building them 4PB data warehouseMine data to manage supply chain, understand market trends, formulate pricing strategies

Sloan Digital Sky SurveySloan Digital Sky SurveyNew Mexico telescope captures 200 GB image data / dayLatest dataset release: 10 TB, 287 million celestial objectsSkyServer provides SQL access Next generation LSST even bigger

Page 3: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 3 –

Our Data-Driven WorldOur Data-Driven World

ScienceScienceData bases from astronomy, genomics, natural languages, seismic modeling, …

HumanitiesHumanitiesScanned books, historic documents, …

CommerceCommerceCorporate sales, stock market transactions, census, airline traffic, …

EntertainmentEntertainmentInternet images, Hollywood movies, MP3 files, …

MedicineMedicineMRI & CT scans, patient records, …

Page 4: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 4 –

Cloud Computing VarietiesCloud Computing Varieties

““I’ve got terabytes of data. I’ve got terabytes of data. Tell me what they mean.”Tell me what they mean.”

Very large, shared data repositoryComplex analysisData-intensive scalable computing (DISC)

““I don’t want to be a system I don’t want to be a system administrator. You handle my administrator. You handle my data & applications.”data & applications.”

Hosted servicesDocuments, web-based email, etc.Can access from anywhereEasy sharing and collaboration

Page 5: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 5 –

CS Research IssuesCS Research IssuesApplicationsApplications

Language translation, image processing, …

Application SupportApplication SupportMachine learning over very large data setsWeb crawling

ProgrammingProgrammingAbstract programming models to support large-scale computationDistributed databases

System DesignSystem DesignError detection & recovery mechanismsResource scheduling and load balancingDistribution and sharing of data across system

Page 6: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 6 –

Getting StartedGetting StartedGoalGoal

Get faculty & students active in DISC

Software: HadoopSoftware: HadoopOpen source project inspired by Google infrastructure

Distributed file systemMapReduce programming environment

Supported and used by YahooPrototype on single machine, map onto cluster

Page 7: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 7 –

Hardware: Rely on Kindness of OthersHardware: Rely on Kindness of Others

Google setting up dedicated cluster for university useLoaded with open-source software

Including HadoopIBM providing additional software supportNSF will determine how facility should be used.

Page 8: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 8 –

More Sources of KindnessMore Sources of KindnessYahoo: Major supporter of HadoopYahoo plans to work with other universities

Page 9: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 9 –

Big-Data Computing Study GroupBig-Data Computing Study Group

Co-organized by REB & Thomas Kwan (Yahoo!)Supported by Computing Community Consortium

Page 10: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 10 –

BDCSG ActivitiesBDCSG Activities

Hadoop SummitHadoop Summit350+ people showed upPower of Open Source

DataData--Intensive Computing SymposiumIntensive Computing Symposium~100 from universities, companies, govt. labs, NSF14 invited speakers

Google, Yahoo!, Microsoft, IntelCMU, UC Berkeley, Cornell, MIT, Johns Hopkins, UIUC, UWNSF

Page 11: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 11 –

NSF InvolvementNSF Involvement

Page 12: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 12 –

Curriculum DevelopmentCurriculum Development

Workshop for educators July 16–18, 2008

Page 13: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 13 –

Christophe Christophe BiscigliaBisciglia

UW/GoogleCatalyst / instigator

Page 14: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 14 –

Future WorkshopsFuture Workshops

Page 15: Data Intensive Scalable omputing - Ed Lazowskalazowska.cs.washington.edu/snowbird08/DISC.pdfCloud Computing VarietiesCloud Computing Varieties “I’ve got terabytes of data. Tell

– 15 –

Concluding ThoughtsConcluding Thoughts

The World is Ready for a New Approach to LargeThe World is Ready for a New Approach to Large--Scale Scale ComputingComputing

Optimized for data-driven applicationsTechnology favoring centralized facilities

Storage capacity & computer power growing faster than network bandwidth

Industry is Catching on QuicklyIndustry is Catching on QuicklyLarge crowd for Hadoop Summit

University Researchers / Educators Eager to Get University Researchers / Educators Eager to Get InvolvedInvolved

Spans wide range of CS disciplinesAcross multiple institutions