l21: joins 2 - northeastern university

48
208 L21: Joins 2 CS3200 Database design (sp18 s2) https://course.ccs.neu.edu/cs3200sp18s2/ 4/2/2018

Upload: others

Post on 23-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: L21: Joins 2 - Northeastern University

208

L21:Joins2

CS3200 Databasedesign(sp18 s2)https://course.ccs.neu.edu/cs3200sp18s2/4/2/2018

Page 2: L21: Joins 2 - Northeastern University

209

Announcements!

• Pleasepickupyourexamifyouhavenotyet• Changedclasscalendar• Outlinetoday- Joins- Relationalalgebra

• Nextclass- QueryOptimizations

Page 3: L21: Joins 2 - Northeastern University

210

Page 4: L21: Joins 2 - Northeastern University

211

Page 5: L21: Joins 2 - Northeastern University

212

GroupProjects:whatisyourexperience?

Source:FoundontheWebasvariationofhttp://www.inquisitr.com/160288/graph-what-i-learned-from-group-projects/

Page 6: L21: Joins 2 - Northeastern University

213

Page 7: L21: Joins 2 - Northeastern University

214

BNLJ:Somequickfacts.

• WeuseM bufferpagesas:- 1pageforS- 1pageforoutput- M-2PagesforR

• IfP(R)<=M-2- thenwedoonepassoverS,andwerunintimeP(R)+P(S)+OUT.- Note:Thisisoptimalforourcostmodel!- Thus,ifmin{P(R),P(S)}<=M-2weshouldalwaysuseBNLJ

• Weusethisattheendofhashjoin.Wedefineendcondition,oneofthebucketsissmallerthanM-2!

P 𝑅 +k l?@$

𝑃(𝑆) +OUT

Page 8: L21: Joins 2 - Northeastern University

215

SmarterthanCross-Products:FromQuadratictoNearlyLinear

• Alljoinsthatcomputethefullcross-product havesomequadraticterm- Forexamplewesaw:

• Nextwe’llseesome(nearly)linearjoins:- ~O(P(R)+P(S)+OUT),whereagainOUTcouldbequadraticbutisusuallybetter

P R +q rA@$

P(S) +OUT

P(R)+T(R)P(S)+OUTNLJ

BNLJ

Wegetthisgainbytakingadvantageofstructure- movingtoequalityconstraints(“equijoin”)only!

Page 9: L21: Joins 2 - Northeastern University

216

IndexNestedLoopJoin(INLJ)

Compute R ⋈ 𝑆𝑜𝑛𝐴:Given index idx on S.A: for r in R:s in idx(r[A]):yield r,s

P(R)+T(R)*L+OUT

àWecanuseanindex (e.g.B+Tree)toavoiddoingthefullcross-product!

whereListheIOcosttoaccessallthedistinctvaluesintheindex;assumingthesefitononepage,L~3 isgoodest.

Cost:

Page 10: L21: Joins 2 - Northeastern University

217

BetterJoinAlgorithms

• 2.Sort-MergeJoin(SMJ)

• 3.HashJoin(HJ)

• Comparison:SMJ vs.HJ

Page 11: L21: Joins 2 - Northeastern University

218

2.Sort-MergeJoin(SMJ)

Page 12: L21: Joins 2 - Northeastern University

219

Whatwewilllearnnext

• Sort-MergeJoin

• “Backup”&TotalCost

• Optimizations

Page 13: L21: Joins 2 - Northeastern University

220

SortMergeJoin(SMJ):BasicProcedure

• TocomputeR ⋈ 𝑆𝑜𝑛𝐴:

• SortR,SonAusingexternalmergesort

• Scan sortedfilesand“merge”

• [Mayneedto“backup”- seenextsubsection]

NotethatifR,SarealreadysortedonA,SMJwillbeawesome!

Notethatweareonlyconsideringequalityjoinconditionshere

Page 14: L21: Joins 2 - Northeastern University

221

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• Forsimplicity:Leteachpagebeonetuple,andletthefirstvaluebeA

Disk

Main Memory

BufferR (5,b) (3,j)(0,a)

S (7,f) (0,j)(3,g)

WeshowthefileHEAD,whichisthenextvaluetoberead!

Page 15: L21: Joins 2 - Northeastern University

222

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 1.SorttherelationsR,Sonthejoinkey(firstvalue)

Disk

Main Memory

BufferR (5,b) (3,j)(0,a)

S (7,f) (0,j)(3,g)

(3,j) (5,b)(0,a)

(3,g) (7,f)(0,j)

Page 16: L21: Joins 2 - Northeastern University

223

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Scanand“merge”onjoinkey!

Disk

Main Memory

BufferR

S (3,g) (7,f)

(3,j) (5,b)

Output

(0,j)

(0,a)(0,a)

(0,j)

Page 17: L21: Joins 2 - Northeastern University

224

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Scanand“merge”onjoinkey!

Disk

Main Memory

BufferR

S (3,g) (7,f)

(3,j) (5,b)

Output

(0,j)(0,a)

(0,a)

(0,j)(0,a,j)

Page 18: L21: Joins 2 - Northeastern University

225

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Scanand“merge”onjoinkey!

Disk

Main Memory

BufferR

S (3,g) (7,f)

(3,j) (5,b)

Output

(0,a)

(0,j)

(0,a,j)

(3,j,g)

(3,j)

(3,g)

(5,b)

(7,f)

Page 19: L21: Joins 2 - Northeastern University

226

SMJExample:R ⋈ 𝑆𝑜𝑛𝐴with3pagebuffer• 2.Done!

Disk

Main Memory

BufferR

S 3,g 7,f

3,j 5,b

Output

(0,a)

(0,j)

(0,a,j)

(3,j)

(3,g)

(3,j,g)

(5,b)

(7,f)

Page 20: L21: Joins 2 - Northeastern University

227

Whathappenswithduplicatejoinkeys?

Page 21: L21: Joins 2 - Northeastern University

228

MultipletupleswithSameJoinKey:“Backup”

• 1.Startwithsortedrelations,andbeginscan/merge…

Disk

Main Memory

BufferR

S 3,g 7,f

3,j 5,b

Output

(0,j)

(0,g)

(0,b)

(7,f)

(0,a)

(0,j)

(0,a)

(0,j)

Page 22: L21: Joins 2 - Northeastern University

229

MultipletupleswithSameJoinKey:“Backup”

• 1.Startwithsortedrelations,andbeginscan/merge…

Disk

Main Memory

BufferR

S 3,g 7,f

3,j 5,b

Output

(0,j)

(0,g)

(0,b)

(7,f)

(0,a)

(0,a)(0,j)

(0,j) (0,a,j)

Page 23: L21: Joins 2 - Northeastern University

230

MultipletupleswithSameJoinKey:“Backup”

• 1.Startwithsortedrelations,andbeginscan/merge…

Disk

Main Memory

BufferR

S (0,g) 7,f

(0,j) 5,b

Output

(0,b)

(7,f)

(0,a)

(0,a)(0,j)

(0,a,j)

(0,a,g)(0,g)

(0,j)

Page 24: L21: Joins 2 - Northeastern University

231

MultipletupleswithSameJoinKey:“Backup”

• 1.Startwithsortedrelations,andbeginscan/merge…

Disk

Main Memory

BufferR

S 0,g 7,f

0,j 5,b

Output

(0,j) (0,b)

(7,f)

(0,a)

(0,a,j)

(0,g)

(0,a,g)

(0,j)

Haveto“backup”inthescanofSandreadtuplewe’vealreadyread!

(0,j)(0,j)

Page 25: L21: Joins 2 - Northeastern University

232

Backup

• Atbest,nobackupà scantakesP(R)+P(S) reads- Forex:ifnoduplicatevaluesinjoinattribute

• Atworst(e.g.fullbackupeachtime),scancouldtakeP(R)*P(S) reads!- Forex:ifallduplicate valuesinjoinattribute,i.e.alltuplesinRandShavethesame

valueforthejoinattribute- Roughly:ForeachpageofR,we’llhavetobackup andreadeachpageofS…

• Oftennotthatbadhowever,pluswecan:- Leavemoredatainbuffer(forlargerbuffers)- Can“zig-zag”(seeanimation)

Page 26: L21: Joins 2 - Northeastern University

233

SMJ:Totalcost

• CostofSMJ iscostofsorting RandS…

• Plusthecostofscanning:~P(R)+P(S)- Becauseofbackup:inworstcaseP(R)*P(S);butthiswouldbeveryunlikely

• Plusthecostofwritingout:~P(R)+P(S)butinworstcaseT(R)*T(S)

~Sort(P(R))+Sort(P(S))+P(R)+P(S) +OUT

Recall:Sort(N)≈ 2𝑁 log?@"𝑵𝟐𝑴

+ 1Note:thisisusingrepacking,whereweestimatethatwecancreateinitialrunsoflength~2M

Externalmerge:slidesp26Externalmergesort:slidesp43

Page 27: L21: Joins 2 - Northeastern University

234

Merge/JoinPhase

SortPhase(Ext.MergeSort)

SMJ Illustrated

SR

Split&sortSplit&sort

MergeMerge

MergeMerge

GivenM bufferpages

Joinedoutputfilecreated!

Unsortedinputrelations

Page 28: L21: Joins 2 - Northeastern University

235

SMJ vs.BNLJ:Comparison

• IfwehaveM=100bufferpages,P(R)= 1000pagesandP(S)=500pages:• CostforSMJ:- Sort:- Merge:- Sum:

• WhatisBNLJ?

Page 29: L21: Joins 2 - Northeastern University

236

SMJ vs.BNLJ:Comparison

• IfwehaveM=100bufferpages,P(R)= 1000pagesandP(S)=500pages:• CostforSMJ:- Sort:- Merge:- Sum:

• WhatisBNLJ?- 500+1000* wTT

xy=5,500IOs+OUT

• But,ifwehaveM=35bufferpages?- SortMergehassamebehavior(still2passes)- BNLJ?15,500IOs+OUT!

SMJis~linearvs.BNLJisquadratic…Butit’sallaboutthememory.

Sortbothintwopasses:2*2*1000+2*2*500=6,000IOsMergephase1000+500=1,500IOs7,500IOs+OUT

Page 30: L21: Joins 2 - Northeastern University

237

TakeawaypointsfromSMJ

• Ifinputalreadysortedonjoinkey,skipthesorts.- SMJ isbasicallylinear.- Nastybutunlikelycase:Manyduplicatejoinkeys.

• SMJ needstosortboth relations- Ifmax{P(R),P(S)}<M2 thencostis3(P(R)+P(S))+OUT

Page 31: L21: Joins 2 - Northeastern University

239

L21:TheRelationalMOdel

CS3200 Databasedesign(sp18 s2)https://course.ccs.neu.edu/cs3200sp18s2/4/2/2018

Page 32: L21: Joins 2 - Northeastern University

240

Ournextfocus

• TheRelationalModel

• RelationalAlgebra

• RelationalAlgebraPt.II[Optional:mayskip]

Page 33: L21: Joins 2 - Northeastern University

241

1.TheRelationalModel&RelationalAlgebra

Page 34: L21: Joins 2 - Northeastern University

242

Whatyouwilllearnaboutinthissection

• TheRelationalModel

• RelationalAlgebra:BasicOperators

• Execution

Page 35: L21: Joins 2 - Northeastern University

243

Motivation

TheRelationalmodelisprecise,implementable,andwecanoperateonit

(query/update,etc.)

Databasemapsinternallyintothisprocedurallanguage.

Page 36: L21: Joins 2 - Northeastern University

244

ALittleHistory

• RelationalmodelduetoEdgar“Ted”Codd,amathematicianatIBMin1970- ARelationalModelofDataforLarge

SharedDataBanks". CommunicationsoftheACM 13 (6):377–387

• IBMdidn’twanttouserelationalmodel(takemoneyfromIMS)- Apparentlyusedinthemoonlanding…

WonTuringaward1981

Page 37: L21: Joins 2 - Northeastern University

245

TheRelationalModel:Schemata

• RelationalSchema:

Students(sid: string, name: string, gpa: float)

AttributesString, float, int, etc. are the domains of the attributes

Relationname

Page 38: L21: Joins 2 - Northeastern University

246

TheRelationalModel:Data

sid name gpa

001 Bob 3.2

002 Joe 2.8

003 Mary 3.8

004 Alice 3.5

Student

Anattribute (orcolumn)isatypeddataentrypresentineachtupleintherelation

Thenumberofattributesisthearity oftherelation

Page 39: L21: Joins 2 - Northeastern University

247

TheRelationalModel:Data

sid name gpa

001 Bob 3.2

002 Joe 2.8

003 Mary 3.8

004 Alice 3.5

Student

Atuple orrow (orrecord)isasingleentryinthetablehavingtheattributesspecifiedbytheschema

Thenumberoftuplesisthecardinality oftherelation

Page 40: L21: Joins 2 - Northeastern University

248

TheRelationalModel:Data

Arelationalinstance isaset oftuplesallconformingtothesameschema

Recall:InpracticeDBMSsrelaxthesetrequirement,andusemultisets (orbags).

sid name gpa

001 Bob 3.2

002 Joe 2.8

003 Mary 3.8

004 Alice 3.5

Student

Page 41: L21: Joins 2 - Northeastern University

249

ToReiterate

• Arelationalschema describesthedatathatiscontainedinarelationalinstance

LetR(f1:Dom1,…,fm:Domm)bearelationalschema then,aninstanceofRisasubsetofDom1 xDom2 x…xDomn

Inthisway,arelationalschema Risatotalfunctionfromattributenames totypes

Page 42: L21: Joins 2 - Northeastern University

250

OneMoreTime

• Arelationalschema describesthedatathatiscontainedinarelationalinstance

ArelationRofarity t isafunction:R:Dom1 x…xDomt à {0,1}

Then,theschemaissimplythesignatureofthefunction

I.e.returnswhetherornotatupleofmatchingtypesisamemberofit

Noteherethatordermatters,attributenamedoesn’t…We’ll(mostly)workwiththeothermodel(lastslide)in

whichattributenamematters,orderdoesn’t!

Page 43: L21: Joins 2 - Northeastern University

251

Arelationaldatabase

• Arelationaldatabaseschema isasetofrelationalschemata,oneforeachrelation

• Arelationaldatabaseinstance isasetofrelationalinstances,oneforeachrelation

Twoconventions:1. Wecallrelationaldatabaseinstancesassimplydatabases2. Weassumeallinstancesarevalid,i.e.,satisfythedomainconstraints

Page 44: L21: Joins 2 - Northeastern University

252

ACourseManagementSystem(CMS)

• RelationDBSchema- Students(sid:string,name:string,gpa:float)- Courses(cid:string,cname:string,credits:int)- Enrolled(sid:string,cid:string,grade:string)

Sid Name Gpa101 Bob 3.2123 Mary 3.8

Students

cid cname credits564 564-2 4308 417 2

Coursessid cid Grade123 564 A

Enrolled

RelationInstances

Notethattheschemasimposeeffectivedomain/typeconstraints,i.e.Gpacan’tbe“Apple”

Page 45: L21: Joins 2 - Northeastern University

253

2ndPartoftheModel:Querying

“FindnamesofallstudentswithGPA>3.5”

Wedon’ttellthesystem howorwhere togetthedata- justwhatwewant,i.e.,Queryingisdeclarative

Actually,Ishowedhowtodothistranslationforamuchricherlanguage!

SELECT S.nameFROM Students SWHERE S.gpa > 3.5;

Tomakethishappen,weneedtotranslatethedeclarativequeryintoaseriesofoperators…we’llseethisnext!

Page 46: L21: Joins 2 - Northeastern University

254

Virtuesofthemodel

• Physicalindependence(logicaltoo),Declarative

• Simple,elegantclean:Everythingisarelation

• Whydidittakemultipleyears?- Doubteditcouldbedoneefficiently.

Page 47: L21: Joins 2 - Northeastern University

255

2.RelationalAlgebra

Page 48: L21: Joins 2 - Northeastern University

256

RDBMSArchitecture

• HowdoesaSQLenginework?

SQLQuery

RelationalAlgebra(RA)

Plan

OptimizedRAPlan Execution

Declarativequery(fromuser)

Translatetorelationalalgebraexpresson

Findlogicallyequivalent- butmoreefficient- RAexpression

Executeeachoperatoroftheoptimizedplan!