compsci516 database systems lecture 20 (additional ... · mongodb • mongodb is an open source...

29
CompSci 516 Database Systems Lecture 20 (Additional/Optional Slides) NoSQL/MongoDB Instructor: Sudeepa Roy Duke CS, Fall 2018 CompSci 516: Database Systems 1

Upload: others

Post on 08-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

CompSci 516DatabaseSystems

Lecture20(Additional/OptionalSlides)

NoSQL/MongoDB

Instructor:Sudeepa Roy

DukeCS,Fall2018 CompSci516:DatabaseSystems 1

Page 2: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

DukeCS,Fall2018 CompSci516:DatabaseSystems 2

AdditionalandOptionalSlidesonMongoDB

(MaybeusefulforHW3)https://docs.mongodb.comhttps://docs.mongodb.com/manual/reference/sql-comparison/

Page 3: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

MongoDB

• MongoDBisanopensourcedocumentstorewritteninC++• providesindexesoncollections• lockless• providesadocumentquerymechanism• supportsautomaticsharding• Replicationismostlyusedforfailover• doesnotprovidetheglobalconsistencyofatraditionalDBMS

– butyoucangetlocalconsistencyontheup-to-dateprimarycopyofadocument

• supportsdynamicquerieswithautomaticuseofindices,likeRDBMSs

• alsosupportsmap-reduce– helpscomplexaggregationsacrossdocs

• providesatomicoperationsonfields

DukeCS,Fall2018 CompSci516:DatabaseSystems 3

Optionalslide:Readyourself

Page 4: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

MongoDB:AtomicOpsonFields• Theupdatecommandsupports“modifiers”thatfacilitateatomic

changestoindividualvalues– $setsetsavalue– $inc incrementsavalue– $pushappendsavaluetoanarray– $pushAll appendsseveralvaluestoanarray– $pullremovesavaluefromanarray,and$pullAll removesseveral

valuesfromanarray• Sincetheseupdatesnormallyoccur“inplace”,theyavoidthe

overheadofareturntriptotheserver• Thereisan“updateifcurrent”conventionforchangingadocument

onlyiffieldvaluesmatchagivenpreviousvalue• MongoDBsupportsafindAndModify commandtoperforman

atomicupdateandimmediatelyreturntheupdateddocument– usefulforimplementingqueuesandotherdatastructuresrequiring

atomicity

DukeCS,Fall2018 CompSci516:DatabaseSystems 4

Optionalslide:Readyourself

Page 5: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

MongoDB:Index• MongoDBindicesareexplicitlydefinedusinganensureIndex call– anyexistingindicesareautomaticallyusedforqueryprocessing

• Tofindallproductsreleasedlastyear(2015)orlatercostingunder$100youcouldwrite:

• db.products.find({released:{$gte:newDate(2015,1,1,)},price{‘$lte’:100},})

DukeCS,Fall2018 CompSci516:DatabaseSystems 5

Optionalslide:Readyourself

Page 6: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

MongoDB:Data

• MongoDBstoresdatainabinaryJSON-likeformatcalledBSON– BSONsupportsboolean,integer,float,date,stringandbinarytypes

–MongoDBcanalsosupportlargebinaryobjects,eg.imagesandvideos

– Thesearestoredinchunksthatcanbestreamedbacktotheclientforefficientdelivery

DukeCS,Fall2018 CompSci516:DatabaseSystems 6

Optionalslide:Readyourself

Page 7: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

MongoDB:Replication

• MongoDBsupportsmaster-slavereplicationwithautomaticfailoverandrecovery– Replication(andrecovery)isdoneatthelevelofshards

– Replicationisasynchronousforhigherperformance,sosomeupdatesmaybelostonacrash

DukeCS,Fall2018 CompSci516:DatabaseSystems 7

Optionalslide:Readyourself

Page 8: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

AdditionalSlidesonMongoDBandJSON

DukeCS,Fall2018 CompSci516:DatabaseSystems 8

ACK:SlidesbyProf.JunYang

Page 9: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

JSON(JavaScriptObjectNotation)

• Verylightweightdataexchangeformat– MuchlessverboseandeasiertoparsethanXML– IncreasinglyusedfordataexchangeoverWeb:manyWebAPIsuseJSONtoreturnresponses/results

• BasedonJavaScript– ConformstoJavaScriptobject/arraysyntax—youcandirectlymanipulateJSONrepresentationsinJavaScript

• Butithasgainedwidespreadsupportbyallprogramminglanguages

9

Page 10: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

ExampleJSONvs.XML

10

<bibliography><book ISBN="ISBN-10" price="80.00"><title>Foundations of Databases</title><author>Abiteboul</author><author>Hull</author><author>Vianu</author><publisher>Addison Wesley</publisher><year>1995</year><section><title>Section 1</title><section><title>Section 1.1</title></section><section><title>Section 1.2</title></section>

</section><section><title>Section 2</title>

</section></book>

</bibliography>

[{ "ISBN": "ISBN-10","price": 80.00,"title": "Foundations of Databases","authors": [ "Abiteboul", "Hull", "Vianu" ],"publisher": "Addison Wesley","year": 1995,"sections": [{ "title": "Section 1","sections": [{ "title": "Section 1.1” },{ "title": "Section 1.2” }

]},{ "title": "Section 2" }

]}, … …

]

Page 11: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

JSONdatamodel

• Twobasicconstructs– Array:comma-separatedlistof“things”enclosedbybrackets

• Orderisimportant– Object:comma-separatedsetofpairsenclosedbybraces;eachpairconsistsofanattributename(string)andavalue(any“thing”)• Orderisunimportant• Attributenames“should”beuniquewithinanobject

• Simpletypes:numbers,strings(indoublequotes),andspecialvalues“true”,“false”,and“null”

• Thing=asimplevalueoranarrayoranobject11

[{ "ISBN": "ISBN-10","price": 80.00,"title": "Foundations of Databases","authors": [ "Abiteboul", "Hull", "Vianu" ],"publisher": "Addison Wesley","year": 1995,"sections": [{ "title": "Section 1","sections": [{ "title": "Section 1.1” },{ "title": "Section 1.2” }

]},{ "title": "Section 2" }

]}, … …

]

Page 12: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

JSONSchema

• Recalltheadvantagesofhavingaschema– Definesastructure,helpscatcherrors,facilitatesexchange/automation,informsoptimization…

• JustlikerelationaldataandXML,JSONisgettingaschemastandardtoo!– Upandcoming,butstilladraftatthisstage

12

{"definitions": {"sections": {"type": "array","description": "Sections.","sections": {"$ref":"#definitions/sections"},"minItems": 0

}},"title": "Book","type": "object","properties": {"ISBN": {"type": "string","description": "The book's ISBN number."

},"price": {"type": "number","description": "The book's price.","exclusiveMinimum": 0

},... ..."sections": {"$ref":"#definitions/sections"},}

}... ...

}

Page 13: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

MongoDB

• Oneofthe“NoSQL”posterchildren• Startedin2007• Targetingsemi-structureddatainJSON• Designedtobeeasyto“scaleout”• Goodsupportforindexing,partitioning,replication

• NiceintegrationinWebdevelopmentstacks• Not-so-greatsupportforjoins(orcomplexqueries)ortransactions

13

Page 14: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

InsideaMongoDBdatabase

• Database=anumberof“collections”• Collection=alistof“documents”• Document=aJSONobject– Musthavean_id attributewhosevaluecanuniquelyidentifyadocumentwithinthecollection

☞Inotherwords,adatabasehascollectionsofsimilarlystructured“documents”– Muchliketablesofrecords,asopposedtoonebigXMLdocumentthatcontainsalldata

14

Page 15: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

QueryingMongoDB

• find() andsort()– Analogoustosingle-tableselection/projection/sort

• “Aggregation”pipeline– With“stages”analogoustorelationaloperators– Join,group-by,restructuring,etc.

• MapReduce:– Supportsuser-definedfunctions– Wewillsavethistopicuntillaterinthiscourse

☞Wewon’tcoversyntaxforcreating/updatingMongoDBdatabasesinlecture– See“Help”ofthecoursewebsiteandreadthemanuals!

15

Page 16: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

Keyfeaturestolookoutfor

• QuerieswrittenasJSONobjectsthemselves!– Naturalinsomecases(e.g.,forspecifyingconditionsonsubsetsofattributes),butawkward/misleadinginothers

• Simplepathexpressionsusingthe“dotnotation”– AnalogoustoXPath“/”

• Arrayswithinobjects– Workonnestedarraydirectlyusingconstructslikedot-indexnotation,$elemMatch,$map,and$filter

– Or“unnest”anarraysoitselementsgetpairedwiththeownerobjectinturnforpipelineprocessing• Afundamentalconceptinworkingwithnesteddata

16

Page 17: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

BasicMongoDBfind()

• Allbooksdb.bib.find()

• Bookswithtitle“FoundationsofDatabases”db.bib.find({ title: "Foundations of Databases" })

• Bookswhosetitlecontains“Database”or“database”andwhosepriceislowerthan$50db.bib.find({ title:/[dD]atabase/, price:{$lt:50} })

• Bookswithpricebetween$70and$100db.bib.find({$and:[{price:{$gte:70}}, {price:{$lte:100}}]})

– Bytheway,whywouldn’tthefollowingwork?db.bib.find({ price:{$gte:70}, price:{$lte:100} })

• BooksauthoredbyWidomdb.bib.find({ authors: "Widom" })

– Notetheimplicitexistentialquantification 17

• Assumedb referstothedatabaseanddb.bib referstothecollectionofbooks

• Add.toArray() atendtogetprettyoutput• YouneedtodothisforHomework3!

Page 18: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

Nogeneral“twig”matching!

• Supposeforamomentpublisher isanobjectitself,withattributesname,state,andcountry

• Thefollowingquerywon’tgetyoudatabasebooksbyUSpublishers:db.bib.find({ title: /[dD]atabase/,

publisher: { country: "US" } })

– Instead,theconditiononpublisher issatisfiedonlyifitisanobjectwithexactlyoneattribute,andthisattributemustbenamedcountry andhasvalue"US"

– WhathappensisthatMongoDBcheckstheequalityagainst{country: "US"} asanobject,notasapattern!

18

Page 19: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

Moreonnestedstructures

• DotnotationforXPath-likepathexpressions– Bookswheresomesubsectiontitlecontains“1.1”

db.bib.find({ "sections.sections.title": /1\.1/ })• Notewethatneedtoquotetheexpression• Again,iftheexpressionreturnsmultiplethings,theconditiononlyneedstoholdforatleastoneofthem

• Use$elemMatch toensurethatthesamearrayelementsatisfiesmultipleconditions,e.g.:db.bib.find({ sections: { $elemMatch: {

title: /Section/,"sections.title": /1\.1/

}}})

• Dotnotationforspecifyingarrayelements– BookswhosefirstauthorisAbiteboul

db.bib.find({ "authors.0": "Abiteboul" })• Note0-basedindexing;again,needtoquotetheexpression

19

Page 20: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

find() withprojectionandsorting

• Listjustthebookpricesandnothingelsedb.bib.find({ price: { $exists: true } },

{ _id: 0, price: 1 })– The(optional)secondargumenttofind() specifieswhattoproject:1meanstoreturn,0meanstoomit• _id isreturnedbydefaultunlessotherwisespecified

• Listbooksbutnotsubsections,orderedbyISBNdb.bib.find({}, {"sections.sections":0}).sort({ISBN:1})

– Outputfromfind() isfurthersortedbysort() ,where1/-1meanascending/descendingorder

☞“Aggregationpipelines”(next)arebettersuitedforconstructingmorecomplexoutput

20

Page 21: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

MongoDBaggregationpipeline

• Idea:thinkofaqueryasperformingasequenceof“stages,”eachtransforminganinputsequenceofJSONobjectstoanoutputsequenceofJSONobjects

• “Aggregation”isamisnomer:thereareallkindsofstages– Selection($match),projection($project),sorting($sort)

• Muchofwhichfind() andsort() alreadydo– Computing/addingattributeswithgeneralizedprojection($project/$addFields),unnesting embeddedarrays($unwind),andrestructuringoutput($replaceRoot)• Operatorstotransform/filterarrays($map/$filter)

– Join($lookup)– Groupingandaggregation($group)

• Operatorstoaggregate(e.g.,$sum)orcollectintoanarray($push)21

Page 22: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

Example:ThecongressMongoDBdatabase

• Twocollections,people andcommittees– Eachobjectinpeople isalegislator• roles =arrayofobjects

– Eachobjectincommittees isacommittee• members =arrayofobjects• subcommittees =anarrayofsubcommitteeobjects,eachwithitsownmembers array• Eachmemberobject’sid fieldreferencesalegislator

_id

22

Page 23: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

23

[{

"_id" : "B000944","birthday" : ISODate("1952-11-09T00:00:00Z"),"gender" : "M","name" : "Sherrod Brown","roles" : [{"district" : 13,"enddate" : ISODate("1995-01-03T00:00:00Z"),"party" : "Democrat","startdate" : ISODate("1993-01-05T00:00:00Z"),"state" : "OH","type" : "rep"

},{"district" : 13,"enddate" : ISODate("1997-01-03T00:00:00Z"),"party" : "Democrat","startdate" : ISODate("1995-01-04T00:00:00Z"),"state" : "OH","type" : "rep"

}, … …]

},… …

]

[{"_id" : "HSAG","displayname" : "House Committee on Agriculture","type" : "house","members" : [{"id" : "C001062","role" : "Chair"

},{"id" : "G000289"

}, … …],"subcommittees" : [{"code" : "15","displayname" : "Conservation and Forestry","members" : [{"id" : "L000491","role" : "Chair"

},{"id" : "T000467"

}, … …]

}, … …]

},… …

]

Page 24: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

Selection/projection/sorting

FindRepublicanlegislators,outputonlytheirnameandgender,sortbyname

db.people.aggregate([{ $match: {

"roles.party": "Republican”} },{ $project: {

_id: false,name: true,gender: true

} },{ $sort: {

name: 1} }

])

24

• aggregate() takesanarrayofstages• Noteagainquotingthedotnatation• Noteagainthesemanticsofcomparinga

listofvalues:i.e.,thequeryfindslegislatorswhohaveeverservedrolesasRepublicans

Page 25: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

Generalizedprojection

FindRepublicanlegislators,outputtheirname,gender,androlesasanarrayoftypes(sen orrep)

db.people.aggregate([{ $match: {

"roles.party": "Republican"} },{ $addFields: {

compact_roles: {$map: { input: "$roles",

as: "role",in: "$$role.type" }

}} },{ $project: {

_id: false,name: true,gender: true,roles: "$compact_roles"

} }])

25

• Use“: "$xxx" ”totellMongoDBtointerpretxxx asafieldinthe“current”objectinsteadofjustastringliteral

• In$map,as definesanewvariabletoloopoverelementsintheinput array

• Foreachinputelement,$mapcomputesthein expressionandappendsitsvaluetotheoutputarray• Use“: "$$xxx" ”totellMongoDB

thatxxx isanewvariablecreatedduringexecution(asopposedtoafieldinthecurrentobject)

Page 26: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

Unnesting andrestructuring

Createalistofsubcommittees:foreach,simplydisplayitsnameandthenameofthecommitteeitbelongsto

db.committees.aggregate([{ $unwind: "$subcommittees" },{ $replaceRoot: { newRoot: {

committee: "$displayname",subcommittee: "$subcommittees.displayname”

} } }])

26

Foreachinputcommittee,$unwind loopsoveritssubcommittees array,oneelementatatime,andoutputsacopyofthecommitteeobject,withitssubcommittees valuereplacedwiththissingleelement

Page 27: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

Join

Foreachcommittee(ignoreitssubcommittees),displayitsnameandthenameofitschairdb.committees.aggregate([{ $addFields: {

chair_member: { $filter: {input: "$members",as: "member",cond: { $eq: ["$$member.role",

"Chairman"] }} }

} },{ $lookup: {

from: "people",localField: "chair_member.id",foreignField: "_id",as: "chair_person"

} },{ $project: {

_id: false,name: "$displayname",chair: { $arrayElemAt:["$chair_person.name",0] }

} },])

27

• In$lookup,localField specifiestheattributeinthecurrentobjectwhosevaluewillbeusedforlookup

• from specifiesthecollectioninwhichtolookforjoiningobjects;foreignField specifiestheattributethereintobejoined

• $lookup createsanattributeinthecurrentobjectwiththenamespecifiedbyas,andsetsitvaluetoanarrayholdingalljoiningobjects

☞ Non-equalityjoinsarealsopossible,withmorecomplexsyntax

$arrayElemAt extractsanarrayelementbyitsindex("chair_person.0.name" doesn’tworkhere)

• $filter filtersinput arrayaccordingtocond andproducesandoutputarray

Page 28: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

Groupingandaggregation

• Countlegislatorsbygender,andlistthenamesoflegislatorsforeachgender

db.people.aggregate([{ $group: {

_id: "$gender",count: { $sum: 1 },list: { $push: "$name" }

} }])

28

• Therequired_id specifiesthegroupingexpression,whosevaluebecomestheidentifyingattributeofoutputobjects(onepergroup)

• Otherattributesholdaggregatevalues,computedusingaggregationoperators• $sum computeatotalbyaddingeachinput• $push createsanarraybyappendingeachinput

Page 29: CompSci516 Database Systems Lecture 20 (Additional ... · MongoDB • MongoDB is an open source document store written in C++ • provides indexes on collections • lockless •

Summaryanddiscussion

• JSONislikemuchmorelightweightversionofXML– Butperhapsnotasgoodformixedcontents

• WritingqueriesJSONissometimesconvenient,butconfusinginmanysituations

• Queryasaspipeline:lessdeclarative,butarguablyeasiertoimplement(especiallytoparallelize)

• Nestedstructuresrequiresmorequeryconstructs– $unwind stage,$elemMatch/$map/$filter/$push/$arrayElemAt operators,etc.– Distinctionbetweenthetop-levelandnestedarraysisannoying

• E.g.,$match stageand$filter operatorbasicallydothesamething• XQueryismuchnicerinthisregard(withabilitytonestqueriesinreturn)

☞ ThereisactuallyXQuery-likelanguageforJSONcalled“JSONiq,”butitremainslessknown

29