cs 4604: introduction to database management...
TRANSCRIPT
![Page 1: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/1.jpg)
CS4604:IntroductiontoDatabaseManagementSystems
B.AdityaPrakashLecture#12:NoSQLandMapReduce
![Page 2: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/2.jpg)
NOSQL(someslidesfromXiaoYu)
Prakash2018 VTCS4604 2
![Page 3: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/3.jpg)
WhyNoSQL?
Prakash2018 VTCS4604 3
![Page 4: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/4.jpg)
RDBMS§ Thepredominantchoiceinstoringdata
– Notsotruefordataminerssincewemuchintxtfiles.
§ Firstformulatedin1969byCodd– WeareusingRDBMSeverywhere
Prakash2018 VTCS4604 4
![Page 5: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/5.jpg)
Slidefromneotechnology,“ANoSQLOverviewandtheBenefitsofGraphDatabases"
Prakash2018 VTCS4604 5
![Page 6: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/6.jpg)
WhenRDBMSmetWeb2.0
SlidefromLorenzoAlberton,"NoSQLDatabases:Why,whatandwhen"Prakash2018 VTCS4604 6
![Page 7: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/7.jpg)
Whattodoifdataisreallylarge?
§ Peta-bytes(exabytes,zettabytes…..)
§ Googleprocessed24PBofdataperday(2009)
§ FBadds0.5PBperday
Prakash2018 VTCS4604 7
![Page 8: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/8.jpg)
Prakash2018 VTCS4604 8
BIGdata
![Page 9: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/9.jpg)
What’sWrongwithRelationalDB?
§ Nothingiswrong.Youjustneedtousetherighttool.
§ Relationalishardtoscale.– Easytoscalereads– Hardtoscalewrites
Prakash2018 VTCS4604 9
![Page 10: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/10.jpg)
What’sNoSQL?
§ Themisleadingterm“NoSQL”isshortfor“NotOnlySQL”.
§ non-relational,schema-free,non-(quite)-ACID– MoreonACIDtransactionslaterinclass
§ horizontallyscalable,distributed,easyreplicationsupport
§ simpleAPI
Prakash2018 VTCS4604 10
![Page 11: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/11.jpg)
Four(emerging)NoSQLCategories
§ Key-value(K-V)stores– BasedonDistributedHashTables/Amazon’sDynamopaper*
– Datamodel:(global)collectionofK-Vpairs– Example:Voldemort
§ ColumnFamilies– BigTableclones**– Datamodel:bigtable,columnfamilies– Example:HBase,Cassandra,Hypertable
*GDeCandiaetal,Dynamo:Amazon'sHighlyAvailableKey-valueStore,SOSP07**FChangetal,Bigtable:ADistributedStorageSystemforStructuredData,OSDI06
Prakash2018 VTCS4604 11
![Page 12: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/12.jpg)
Four(emerging)NoSQLCategories
§ Documentdatabases– InspiredbyLotusNotes– Datamodel:collectionsofK-VCollections– Example:CouchDB,MongoDB
§ Graphdatabases– InspiredbyEuler&graphtheory– Datamodel:nodes,relations,K-Vonboth– Example:AllegroGraph,VertexDB,Neo4j
Prakash2018 VTCS4604 12
![Page 13: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/13.jpg)
FocusofDifferentDataModels
Slidefromneotechnology,“ANoSQLOverviewandtheBenefitsofGraphDatabases"
Prakash2018 VTCS4604 13
![Page 14: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/14.jpg)
C-A-P“theorem"
Consistency
Availability
PartitionTolerance
RDBMS
NoSQL(most)
Prakash2018 VTCS4604 14
![Page 15: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/15.jpg)
WhentouseNoSQL?§ Bigness§ Massivewriteperformance
– Twittergenerates7TB/perday(2010)§ Fastkey-valueaccess§ Flexibleschemaordatatypes§ Schemamigration§ Writeavailability
– Writesneedtosucceednomatterwhat(CAP,partitioning)§ Easiermaintainability,administrationandoperations§ Nosinglepointoffailure§ Generallyavailableparallelcomputing§ Programmereaseofuse§ Usetherightdatamodelfortherightproblem§ Avoidhittingthewall§ Distributedsystemssupport§ TunableCAPtradeoffs fromhttp://highscalability.com/
Prakash2018 VTCS4604 15
![Page 16: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/16.jpg)
Key-ValueStoresid hair_color age height
1923 Red 18 6’0”
3371 Blue 34 NA
… … … …
Tableinrelationaldb Store/DomaininKey-Valuedb
Finduserswhoseageisabove18?Findallattributesofuser1923?FinduserswhosehaircolorisRedandageis19?(Joinoperation)Calculateaverageageofallgradstudents?
Prakash2018 VTCS4604 16
![Page 17: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/17.jpg)
VoldemortinLinkedIn
SidAnand,LinkedInDataInfrastructure(QConLondon2012)
Prakash2018 VTCS4604 17
![Page 18: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/18.jpg)
VoldemortvsMySQL
SidAnand,LinkedInDataInfrastructure(QConLondon2012)
Prakash2018 VTCS4604 18
![Page 19: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/19.jpg)
ColumnFamilies–BigTablelike
FChang,etal,Bigtable:ADistributedStorageSystemforStructuredData,osdi06 Prakash2018 VTCS4604 19
![Page 20: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/20.jpg)
BigTableDataModel
The row name is a reversed URL. The contents column family contains the pagecontents, and the anchor column family contains the text of any anchors thatreferencethepage.
Prakash2018 VTCS4604 20
![Page 21: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/21.jpg)
BigTablePerformance
Prakash2018 VTCS4604 21
![Page 22: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/22.jpg)
DocumentDatabase-mongoDB
Tableinrelationaldb
Documentsinacollection
Initialrelease2009
Opensource,documentdbJson-likedocumentwithdynamicschema
Prakash2018 VTCS4604 22
![Page 23: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/23.jpg)
mongoDBProductDeployment
Andmuchmore…Prakash2018 VTCS4604 23
![Page 24: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/24.jpg)
GraphDatabase
DataModelAbstraction:• Nodes• Relations• Properties
Prakash2018 VTCS4604 24
![Page 25: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/25.jpg)
Neo4j-BuildaGraph
Slidefromneotechnology,“ANoSQLOverviewandtheBenefitsofGraphDatabases"
Prakash2018 VTCS4604 25
![Page 26: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/26.jpg)
ADebatablePerformanceEvaluation
Prakash2018 VTCS4604 26
![Page 27: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/27.jpg)
Conclusion
§ Usetherightdatamodelfortherightproblem
Prakash2018 VTCS4604 27
![Page 28: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/28.jpg)
THEHADOOPECOSYSTEM
Prakash2018 VTCS4604 28
![Page 29: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/29.jpg)
VTCS4604 29Prakash2018
![Page 30: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/30.jpg)
SinglevsCluster
§ 4TBHDDsarecomingout§ Cluster?
– Howmanymachines?– Handlemachineanddrivefailure– Needredundancy,backup..
Prakash2018 VTCS4604 30
How to analyze such large datasets?
First thing, how to store them?
Single machine? 4TB drive is out
Cluster of machines?
• How many machines?• Need to worry about
machine and drive failure. Really?
• Need data backup, redundancy, recovery, etc.
5
3% of 100,000 hard drives fail within first 3 months
Failure Trends in a Large Disk Drive Populationhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf
3%of100KHDDsfailin<=3months
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf
![Page 31: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/31.jpg)
Hadoop
§ Opensourcesoftware– Reliable,scalable,distributedcomputing
§ Canhandlethousandsofmachines§ WritteninJAVA§ Asimpleprogrammingmodel§ HDFS(HadoopDistributedFileSystem)
– Faulttolerant(canrecoverfromfailures)
Prakash2018 VTCS4604 31
Open-source software for reliable, scalable, distributed computing
Written in Java
Scale to thousands of machines
• Linear scalability: if you have 2 machines, your job runs twice as fast
Uses simple programming model (MapReduce)
Fault tolerant (HDFS)
• Can recover from machine/disk failure (no need to restart computation)
7http://hadoop.apache.org
![Page 32: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/32.jpg)
IdeaandSolution§ Issue:Copyingdataoveranetworktakestime§ Idea:
– Bringcomputationclosetothedata– Storefilesmultipletimesforreliability
§ Map-reduceaddressestheseproblems– Google’scomputational/datamanipulationmodel– Elegantwaytoworkwithbigdata– StorageInfrastructure–Filesystem
• Google:GFS.Hadoop:HDFS– Programmingmodel
• Map-ReduceVTCS4604 32Prakash2018
![Page 33: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/33.jpg)
Map-Reduce[DeanandGhemawat2004]
§ Abstractionforsimplecomputing– Hidesdetailsofparallelization,fault-tolerance,data-balancing
– MUSTRead!http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf
Prakash2018 VTCS4604 33
![Page 34: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/34.jpg)
HadoopVSNoSQL
§ Hadoop:computingframework– Supportsdata-intensiveapplications– IncludesMapReduce,HDFSetc.(wewillstudyMRmainlynext)
§ NoSQL:NotonlySQLdatabases– CanbebuiltONhadoop.E.g.HBase.
Prakash2018 VTCS4604 34
![Page 35: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/35.jpg)
StorageInfrastructure
§ Problem:– Ifnodesfail,howtostoredatapersistently?
§ Answer:– DistributedFileSystem:
• Providesglobalfilenamespace• GoogleGFS;HadoopHDFS;
§ Typicalusagepattern– Hugefiles(100sofGBtoTB)– Dataisrarelyupdatedinplace– Readsandappendsarecommon
VTCS4604 35Prakash2018
![Page 36: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/36.jpg)
DistributedFileSystem§ Chunkservers
– Fileissplitintocontiguouschunks– Typicallyeachchunkis16-64MB– Eachchunkreplicated(usually2xor3x)– Trytokeepreplicasindifferentracks
§ Masternode– a.k.a.NameNodeinHadoop’sHDFS– Storesmetadataaboutwherefilesarestored– Mightbereplicated
§ Clientlibraryforfileaccess– Talkstomastertofindchunkservers– Connectsdirectlytochunkserverstoaccessdata
VTCS4604 36Prakash2018
![Page 37: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/37.jpg)
ProgrammingModel:MapReduce
Warm-uptask:§ Wehaveahugetextdocument
§ Countthenumberoftimeseachdistinctwordappearsinthefile
§ Sampleapplication:– AnalyzewebserverlogstofindpopularURLs
VTCS4604 37Prakash2018
![Page 38: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/38.jpg)
Task:WordCount
Case1:– Filetoolargeformemory,butall<word,count>pairsfitinmemory
Case2:§ Countoccurrencesofwords:
– words(doc.txt) | sort | uniq -c • wherewordstakesafileandoutputsthewordsinit,oneperaline
§ Case2capturestheessenceofMapReduce– Greatthingisthatitisnaturallyparallelizable
VTCS4604 38Prakash2018
![Page 39: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/39.jpg)
MapReduce:Overview
§ Sequentiallyreadalotofdata§ Map:
– Extractsomethingyoucareabout
§ Groupbykey:SortandShuffle§ Reduce:
– Aggregate,summarize,filterortransform
§ Writetheresult
Outlinestaysthesame,MapandReducechangetofittheproblem
VTCS4604 39Prakash2018
![Page 40: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/40.jpg)
MapReduce:TheMapStep
vk
k v
k v
mapvk
vk
…
k vmap
Input key-value pairs
Intermediate key-value pairs
…
k v
VTCS4604 40Prakash2018
![Page 41: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/41.jpg)
MapReduce:TheReduceStep
k v
…
k v
k v
k v
Intermediate key-value pairs
Groupbykey
reduce
reduce
k v
k v
k v
…
k v
…
k v
k v v
v v
Key-value groups Output key-value pairs
VTCS4604 41Prakash2018
![Page 42: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/42.jpg)
MoreSpecifically§ Input:asetofkey-valuepairs§ Programmerspecifiestwomethods:
– Map(k, v) → <k’, v’>* • Takesakey-valuepairandoutputsasetofkey-valuepairs
– E.g.,keyisthefilename,valueisasinglelineinthefile
• ThereisoneMapcallforevery(k,v)pair
– Reduce(k’, <v’>*) → <k’, v’’>* • Allvaluesv’withsamekeyk’arereducedtogetherandprocessedinv’order
• ThereisoneReducefunctioncallperuniquekeyk’
VTCS4604 42Prakash2018
![Page 43: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/43.jpg)
MapReduce:WordCounting
The crew of the space shuttle Endeavor recently re turned to Ear th as ambassadors, harbingers of a new era o f space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing -- is what we're going to need ……………………..
Big document
(The,1)(crew,1)(of,1)(the,1)(space,1)(shuttle,1)(Endeavor,1)(recently,1)
….
(crew,1)(crew,1)(space,1)(the,1)(the,1)(the,1)
(shuttle,1)(recently,1)
…
(crew,2)(space,1)(the,3)
(shuttle,1)(recently,1)
…
MAP:Readinputandproducesasetofkey-valuepairs
Groupbykey:Collectallpairswithsamekey
Reduce:Collectallvaluesbelongingtothekeyandoutput
(key, value)
Provided by the programmer
Provided by the programmer
(key, value) (key, value)
Sequ
entia
llyre
adth
edata
Onlysequ
entia
lreads
VTCS4604 43Prakash2018
![Page 44: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/44.jpg)
WordCountUsingMapReduce
map(key, value): // key: document name; value: text of the document for each word w in value:
emit(w, 1)
reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(key, result)
VTCS4604 44Prakash2018
![Page 45: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/45.jpg)
Map-Reduce(MR)asSQL
§ selectcount(*)fromDOCUMENTgroupbyword
Prakash2018 VTCS4604 45
Mapper
Reducer
![Page 46: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/46.jpg)
Map-Reduce:Environment
Map-Reduceenvironmenttakescareof:§ Partitioningtheinputdata§ Schedulingtheprogram’sexecutionacrossasetofmachines
§ Performingthegroupbykeystep§ Handlingmachinefailures§ Managingrequiredinter-machinecommunication
VTCS4604 46Prakash2018
![Page 47: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/47.jpg)
Map-Reduce:Adiagram
VTCS4604 47
Bigdocument
MAP:Readinputandproducesasetofkey-valuepairs
Groupbykey:Collectallpairswith
samekey(Hashmerge,Shuffle,
Sort,Partition)
Reduce:Collectallvalues
belongingtothekeyandoutput
Prakash2018
![Page 48: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/48.jpg)
Map-Reduce:InParallel
VTCS4604 48AllphasesaredistributedwithmanytasksdoingtheworkPrakash2018
![Page 49: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/49.jpg)
Map-Reduce§ Programmerspecifies:
– MapandReduceandinputfiles§ Workflow:
– Readinputsasasetofkey-value-pairs– Maptransformsinputkv-pairsintoa
newsetofk'v'-pairs– Sorts&Shufflesthek'v'-pairstooutput
nodes– Allk’v’-pairswithagivenk’aresentto
thesamereduce– Reduceprocessesallk'v'-pairsgrouped
bykeyintonewk''v''-pairs– Writetheresultingpairstofiles
§ Allphasesaredistributedwithmanytasksdoingthework
Input0
Map0
Input1
Map1
Input2
Map2
Reduce0 Reduce1
Out0 Out1
Shuffle
49VTCS4604Prakash2018
![Page 50: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/50.jpg)
DataFlow
§ Inputandfinaloutputarestoredonadistributedfilesystem(FS):– Schedulertriestoschedulemaptasks“close”tophysicalstoragelocationofinputdata
§ IntermediateresultsarestoredonlocalFSofMapandReduceworkers
§ OutputisofteninputtoanotherMapReducetask
VTCS4604 50Prakash2018
![Page 51: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/51.jpg)
Coordination:Master
§ Masternodetakescareofcoordination:– Taskstatus:(idle,in-progress,completed)– Idletasksgetscheduledasworkersbecomeavailable– Whenamaptaskcompletes,itsendsthemasterthelocationandsizesofitsRintermediatefiles,oneforeachreducer
– Masterpushesthisinfotoreducers
§ Masterpingsworkersperiodicallytodetectfailures
VTCS4604 51Prakash2018
![Page 52: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/52.jpg)
DealingwithFailures
§ Mapworkerfailure– Maptaskscompletedorin-progressatworkerareresettoidle
– Reduceworkersarenotifiedwhentaskisrescheduledonanotherworker
§ Reduceworkerfailure– Onlyin-progresstasksareresettoidle– Reducetaskisrestarted
§ Masterfailure– MapReducetaskisabortedandclientisnotified
VTCS4604 52Prakash2018
![Page 53: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/53.jpg)
PROBLEMSSUITEDFORMAP-REDUCE
Prakash2018 VTCS4604 53
![Page 54: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/54.jpg)
Example:Hostsize
§ Supposewehavealargewebcorpus§ Lookatthemetadatafile
– Linesoftheform:(URL,size,date,…)§ Foreachhost,findthetotalnumberofbytes
– Thatis,thesumofthepagesizesforallURLsfromthatparticularhost
§ Otherexamples:– Linkanalysisandgraphprocessing– MachineLearningalgorithms
VTCS4604 54Prakash2018
![Page 55: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/55.jpg)
Example:LanguageModel
§ Statisticalmachinetranslation:– Needtocountnumberoftimesevery5-wordsequenceoccursinalargecorpusofdocuments
§ VeryeasywithMapReduce:– Map:
• Extract(5-wordsequence,count)fromdocument
– Reduce:• Combinethecounts
VTCS4604 55Prakash2018
![Page 56: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/56.jpg)
InHW5
§ You’lldealwithn-grams – n-gramisacontiguoussequenceofnitemsfromagivensequenceoftextorspeech
§ Example § Sentence:“theraininSpainfallsmainlyontheplain”– 2grams:therain,rainin,inSpain,Spainfalls,etc.– 3grams:therainin,raininSpain,inSpainfalls,….
Prakash2018 VTCS4604 56
![Page 57: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/57.jpg)
InHW5§ YouwillworkwiththeGoogle4-gramcorpus.Example:
– analysis is often described 1991 10 1 1 – analysis is often described 1992 30 2 1
§ Wewillaskyouto– Findtotaloccurrencecounts(thiswillbesimilartojustwordcount)
• intheexampleabove“analysis is often described” occurstotalof10+30=40times.
– Convert4-gramsto2-grams(thinkwhatshouldbethemapperandreducerforthis)
• Example:“analysis is often described” willgiverisetothefollowing2grams:analysis is, is often, often described
Prakash2018 VTCS4604 57
![Page 58: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/58.jpg)
DegreeofgraphExample
§ FinddegreeofeverynodeinagraphExample:Inafriendshipgraph,whatisthenumberoffriendsofeveryperson:Node6=1Node2=3Node4=3Node1=2Node3=2Node5=3
Prakash2018 VTCS4604 58
![Page 59: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/59.jpg)
Degreeofeachnodeinagraph
§ Supposeyouhavetheedgelist === ==atable!
Schema? Edges(from,to)
Prakash2018 VTCS4604 59
6 4 4 6 4 3 3 4 4 5 5 4 ...
![Page 60: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/60.jpg)
Degreeofeachnodeinagraph
§ Supposeyouhavetheedgelist === ==atable!
Schema? Edges(from,to)
SQLfordegreelist?
Prakash2018 VTCS4604 60
SELECTfrom,count(*)FROMEdgesGROUPBYfrom
6 4 4 6 4 3 3 4 4 5 5 4 ...
![Page 61: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/61.jpg)
Degreeofeachnodeinagraph
§ SoinSQL:§ MapReduce?Mapper:emit(from,1)
Reducer:emit(from,count())
Prakash2018 VTCS4604 61
SELECTfrom,count(*)FROMEdgesGROUPBYfrom
Remember
6 4 4 6 4 3 3 4 4 5 5 4 ...
I.E.essentiallyequivalenttothe‘word-count’exampleJ
![Page 62: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/62.jpg)
Conclusions
§ Hadoopisadistributeddata-intesivecomputingframework
§ MapReduce– Simpleprogrammingparadigm– Surprisinglypowerful(maynotbesuitableforalltasksthough)
§ HadoophasspecializedFileSystem,Master-SlaveArchitecturetoscale-up
Prakash2018 VTCS4604 62
![Page 63: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/63.jpg)
NoSQLandHadoop
§ Hotareawithseveralnewproblems– Goodforacademicresearch– Goodforindustry
=FunANDProfitJ
Prakash2018 VTCS4604 63
![Page 64: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/64.jpg)
POINTERSANDFURTHERREADING
Prakash2018 VTCS4604 64
![Page 65: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/65.jpg)
Implementations
§ Google– NotavailableoutsideGoogle
§ Hadoop– Anopen-sourceimplementationinJava– UsesHDFSforstablestorage– Download:http://lucene.apache.org/hadoop/
§ AsterData– Cluster-optimizedSQLDatabasethatalsoimplementsMapReduce
VTCS4604 65Prakash2018
![Page 66: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/66.jpg)
CloudComputing
§ Abilitytorentcomputingbythehour– Additionalservicese.g.,persistentstorage
§ Amazon’s“ElasticComputeCloud”(EC2)
§ AsterDataandHadoopcanbothberunonEC2
VTCS4604 66Prakash2018
![Page 67: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/67.jpg)
Reading
§ JeffreyDeanandSanjayGhemawat:MapReduce:SimplifiedDataProcessingonLargeClusters– http://labs.google.com/papers/mapreduce.html
§ SanjayGhemawat,HowardGobioff,andShun-TakLeung:TheGoogleFileSystem– http://labs.google.com/papers/gfs.html
VTCS4604 67Prakash2018
![Page 68: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/68.jpg)
Resources§ HadoopWiki
– Introduction• http://wiki.apache.org/lucene-hadoop/
– GettingStarted• http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop
– Map/ReduceOverview• http://wiki.apache.org/lucene-hadoop/HadoopMapReduce• http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses
– EclipseEnvironment• http://wiki.apache.org/lucene-hadoop/EclipseEnvironment
§ Javadoc– http://lucene.apache.org/hadoop/docs/api/
VTCS4604 68Prakash2018
![Page 69: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/69.jpg)
Resources
§ ReleasesfromApachedownloadmirrors– http://www.apache.org/dyn/closer.cgi/lucene/hadoop/
§ Nightlybuildsofsource– http://people.apache.org/dist/lucene/hadoop/nightly/
§ Sourcecodefromsubversion– http://lucene.apache.org/hadoop/version_control.html
VTCS4604 69Prakash2018
![Page 70: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-12.pdf · § simple API Prakash 2018 VT CS 4604 10 Four (emerging) NoSQL Categories](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6e4c39268f941a8e28fc57/html5/thumbnails/70.jpg)
FurtherReading§ Programmingmodelinspiredbyfunctionallanguageprimitives§ Partitioning/shufflingsimilartomanylarge-scalesortingsystems
– NOW-Sort['97]§ Re-executionforfaulttolerance
– BAD-FS['04]andTACC['97]§ LocalityoptimizationhasparallelswithActiveDisks/Diamondwork
– ActiveDisks['01],Diamond['04]§ BackuptaskssimilartoEagerSchedulinginCharlottesystem
– Charlotte['96]§ DynamicloadbalancingsolvessimilarproblemasRiver's
distributedqueues– River['99]
VTCS4604 70Prakash2018