data intensive scalable omputing - ed lazowskalazowska.cs.washington.edu/snowbird08/disc.pdfcloud...

Post on 25-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DDataataIIntensiventensive

SScalablecalableCComputingomputing

http://www.cs.cmu.edu/~bryant

Randal E. BryantCarnegie Mellon University

– 2 –

Examples of Big Data SourcesExamples of Big Data Sources

WalWal--MartMart267 million items/day, sold at 6,000 storesHP building them 4PB data warehouseMine data to manage supply chain, understand market trends, formulate pricing strategies

Sloan Digital Sky SurveySloan Digital Sky SurveyNew Mexico telescope captures 200 GB image data / dayLatest dataset release: 10 TB, 287 million celestial objectsSkyServer provides SQL access Next generation LSST even bigger

– 3 –

Our Data-Driven WorldOur Data-Driven World

ScienceScienceData bases from astronomy, genomics, natural languages, seismic modeling, …

HumanitiesHumanitiesScanned books, historic documents, …

CommerceCommerceCorporate sales, stock market transactions, census, airline traffic, …

EntertainmentEntertainmentInternet images, Hollywood movies, MP3 files, …

MedicineMedicineMRI & CT scans, patient records, …

– 4 –

Cloud Computing VarietiesCloud Computing Varieties

““I’ve got terabytes of data. I’ve got terabytes of data. Tell me what they mean.”Tell me what they mean.”

Very large, shared data repositoryComplex analysisData-intensive scalable computing (DISC)

““I don’t want to be a system I don’t want to be a system administrator. You handle my administrator. You handle my data & applications.”data & applications.”

Hosted servicesDocuments, web-based email, etc.Can access from anywhereEasy sharing and collaboration

– 5 –

CS Research IssuesCS Research IssuesApplicationsApplications

Language translation, image processing, …

Application SupportApplication SupportMachine learning over very large data setsWeb crawling

ProgrammingProgrammingAbstract programming models to support large-scale computationDistributed databases

System DesignSystem DesignError detection & recovery mechanismsResource scheduling and load balancingDistribution and sharing of data across system

– 6 –

Getting StartedGetting StartedGoalGoal

Get faculty & students active in DISC

Software: HadoopSoftware: HadoopOpen source project inspired by Google infrastructure

Distributed file systemMapReduce programming environment

Supported and used by YahooPrototype on single machine, map onto cluster

– 7 –

Hardware: Rely on Kindness of OthersHardware: Rely on Kindness of Others

Google setting up dedicated cluster for university useLoaded with open-source software

Including HadoopIBM providing additional software supportNSF will determine how facility should be used.

– 8 –

More Sources of KindnessMore Sources of KindnessYahoo: Major supporter of HadoopYahoo plans to work with other universities

– 9 –

Big-Data Computing Study GroupBig-Data Computing Study Group

Co-organized by REB & Thomas Kwan (Yahoo!)Supported by Computing Community Consortium

– 10 –

BDCSG ActivitiesBDCSG Activities

Hadoop SummitHadoop Summit350+ people showed upPower of Open Source

DataData--Intensive Computing SymposiumIntensive Computing Symposium~100 from universities, companies, govt. labs, NSF14 invited speakers

Google, Yahoo!, Microsoft, IntelCMU, UC Berkeley, Cornell, MIT, Johns Hopkins, UIUC, UWNSF

– 11 –

NSF InvolvementNSF Involvement

– 12 –

Curriculum DevelopmentCurriculum Development

Workshop for educators July 16–18, 2008

– 13 –

Christophe Christophe BiscigliaBisciglia

UW/GoogleCatalyst / instigator

– 14 –

Future WorkshopsFuture Workshops

– 15 –

Concluding ThoughtsConcluding Thoughts

The World is Ready for a New Approach to LargeThe World is Ready for a New Approach to Large--Scale Scale ComputingComputing

Optimized for data-driven applicationsTechnology favoring centralized facilities

Storage capacity & computer power growing faster than network bandwidth

Industry is Catching on QuicklyIndustry is Catching on QuicklyLarge crowd for Hadoop Summit

University Researchers / Educators Eager to Get University Researchers / Educators Eager to Get InvolvedInvolved

Spans wide range of CS disciplinesAcross multiple institutions

top related