index selection and external sorting€¦ · understanding the workload • a workload is a mix of...

Post on 14-May-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CompSci 516DatabaseSystems

Lecture9IndexSelection

andExternalSorting

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

Announcements• Privateprojectthreadscreatedonpiazza– Pleaseusethesethreads(andnotemails)forallcommunicationsonyourproject

• Projectproposaldeadlineextendeduntil9/28Thursday5pm

• HW2tobepostedsoon–WillreceiveemailwithyourCSemailIDforAWSaccountsfromKeping/Yilin

DukeCS,Fall2017 CompSci516:DatabaseSystems 2

Today

• Indexselection• Externalsort

DukeCS,Fall2017 CompSci516:DatabaseSystems 3

ReadingMaterial

• Index:asinLecture7/8

• Externalsorting:• [RG]– Externalsorting:Chapter13

• [GUW]– Chapter15.4.1

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 4

Acknowledgement:Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.Ramakrishnan andDr.Gehrke.

SelectionofIndexes

DukeCS,Fall2017 CompSci516:DatabaseSystems 5

DifferentFileOrganizations

Searchkey=<age,sal>

Considerfollowingoptions:

• Heapfiles– randomorder;insertatend-of-file

• Sortedfiles– sortedon<age,sal>

• ClusteredB+treefile– searchkey<age,sal>

• Heapfilewithunclustered B+-treeindex– onsearchkey<age,sal>

• Heapfilewithunclustered hashindex– onsearchkey<age,sal>

DukeCS,Fall2017 CompSci516:DatabaseSystems 6

Weneedtounderstandtheimportanceofappropriatefileorganizationandindex

PossibleOperations

• Scan– Fetchallrecordsfromdisktobufferpool

• Equalitysearch– Findallemployeeswithage=23andsal =50– Fetchpagefromdisk,thenlocatequalifyingrecordinpage

• Rangeselection– Findallemployeeswithage>35

• Insertarecord– identifythepage,fetchthatpagefromdisk,insetrecord,writeback

todisk(possiblyotherpagesaswell)

• Deletearecord– similartoinsert

DukeCS,Fall2017 CompSci516:DatabaseSystems 7

UnderstandingtheWorkload• Aworkloadisamixofqueries andupdates

• Foreachqueryintheworkload:– Whichrelationsdoesitaccess?– Whichattributesareretrieved?– Whichattributesareinvolvedinselection/joinconditions?How

selectivearetheseconditionslikelytobe?

• Foreachupdateintheworkload:– Whichattributesareinvolvedinselection/joinconditions?How

selectivearetheseconditionslikelytobe?– Thetypeofupdate(INSERT/DELETE/UPDATE),andtheattributesthatare

affected

DukeCS,Fall2017 CompSci516:DatabaseSystems 8

ChoiceofIndexes

• Whatindexesshouldwecreate?– Whichrelationsshouldhaveindexes?Whatfield(s)shouldbethesearchkey?Shouldwebuildseveralindexes?

• Foreachindex,whatkindofanindexshoulditbe?– Clustered?Hash/tree?

DukeCS,Fall2017 CompSci516:DatabaseSystems 9

MoreonChoiceofIndexes• Oneapproach:– Considerthemostimportantqueries– Considerthebestplanusingthecurrentindexes– Seeifabetterplanispossiblewithanadditionalindex.– Ifso,createit.– Obviously,thisimpliesthatwemustunderstandhowaDBMSevaluatesqueriesandcreatesqueryevaluationplans

– Wewilllearnqueryexecutionandoptimizationlater- Fornow,wediscusssimple1-tablequeries.

• Beforecreatinganindex,mustalsoconsidertheimpactonupdatesintheworkload

DukeCS,Fall2017 CompSci516:DatabaseSystems 10

Trade-offsforIndexes• Indexescanmake– queriesgofaster– updatesslower

• Requirediskspace,too

DukeCS,Fall2017 CompSci516:DatabaseSystems 11

IndexSelectionGuidelines• AttributesinWHEREclausearecandidatesforindexkeys

– Exactmatchconditionsuggestshashindex– Rangequerysuggeststreeindex– Clusteringisespeciallyusefulforrangequeries

• canalsohelponequalityqueriesiftherearemanyduplicates

• Trytochooseindexesthatbenefitasmanyqueriesaspossible– Sinceonlyoneindexcanbeclusteredperrelation,chooseitbasedon

importantqueriesthatwouldbenefitthemostfromclustering

• Multi-attributesearchkeysshouldbeconsideredwhenaWHEREclausecontainsseveralconditions– Orderofattributesisimportantforrangequeries

• Note:clusteredindexshouldbeusedjudiciously– expensiveupdates,althoughcheaperthansortedfiles

DukeCS,Fall2017 CompSci516:DatabaseSystems 12

ExamplesofClusteredIndexes• B+treeindexonE.age canbeusedtogetqualifyingtuples

• Howselectiveisthecondition?– everyone>40,indexnotofmuchhelp,scanisasgood

– Suppose10%>40.Then?

• Dependsoniftheindexisclustered– otherwisecanbemoreexpensivethanalinearscan

– ifclustered,10%I/O(+indexpages)

SELECT E.dnoFROM Emp EWHERE E.age>40

DukeCS,Fall2017 CompSci516:DatabaseSystems 13

Whatisagoodindexingstrategy?

Whichattribute(s)?Clustered/Unclustered?B+tree/Hash?

ExamplesofClusteredIndexesGroup-Byquery

• UseE.age assearchkey?– BadIfmanytupleshaveE.age >10orifnot

clustered….– …usingE.age indexandsortingtheretrieved

tuplesbyE.dno maybecostly

• ClusteredE.dno indexmaybebetter– Firstgroupby,thencounttupleswithage>

10– goodwhenage>10isnottooselective

• Note:thefirstoptionisgoodwhentheWHEREconditionishighlyselective(fewtupleshaveage>10),thesecondisgoodwhennothighlyselective

SELECT E.dno, COUNT (*)FROM Emp EWHERE E.age>10GROUP BY E.dno

DukeCS,Fall2017 CompSci516:DatabaseSystems 14

Whatisagoodindexingstrategy?

Whichattribute(s)?Clustered/Unclustered?B+tree/Hash?

ExamplesofClusteredIndexes

Equalityqueriesandduplicates

• ClusteringonE.hobby helps– hobbynotacandidatekey,several

tuplespossible

• Doesclusteringhelpnow?– (eid =key)– Notmuch– atmostonetuplesatisfiesthe

condition

SELECT E.dnoFROM Emp EWHERE E.hobby=‘Stamps’

DukeCS,Fall2017 CompSci516:DatabaseSystems 15

SELECT E.dnoFROM Emp EWHERE E.eid=50

Whatisagoodindexingstrategy?

Whichattribute(s)?Clustered/Unclustered?B+tree/Hash?

IndexeswithCompositeSearchKeys• CompositeSearchKeys:Searchona

combinationoffields

• Equalityquery:Everyfieldvalueisequaltoaconstantvalue.E.g.wrt<sal,age>index:– age=20andsal =75

• Rangequery:Somefieldvalueisnotaconstant.E.g.:– sal >10– whichcombination(s)

wouldhelp?

– <age,sal>doesnothelp– B+tree on<sal>or<sal,age>helps– hastobeaprefix

sue 13 75

bobcaljoe 12

10

208011

12

name age sal

<sal, age>

<age, sal> <age>

<sal>

12,2012,1011,80

13,75

20,1210,12

75,1380,11

11121213

10207580

Data recordssorted by name

Data entries in indexsorted by <sal,age>

Data entriessorted by <sal>

Examples of composite keyindexes using lexicographic order.

CompositeSearchKeys• ToretrieveEmp recordswithage=30AND sal=4000,anindexon

<age,sal>wouldbebetterthananindexonage oranindexonsal– firstfindage=30,amongthemsearchsal =4000

• Ifconditionis:20<age<30AND 3000<sal<5000:– Clusteredtreeindexon<age,sal>or<sal,age>isbest.

• Ifconditionis:age=30AND 3000<sal<5000:– Clustered<age,sal>indexmuchbetterthan<sal,age>index– moreindexentriesareretrievedforthelatter

• Compositeindexesarelarger,updatedmoreoftenDukeCS,Fall2017 CompSci516:DatabaseSystems 17

Index-OnlyPlans• Anumberofqueriescanbeansweredwithoutretrievingany

tuplesfromoneormoreoftherelationsinvolvedifasuitableindexisavailable

SELECT E.dno, COUNT(*)FROM Emp EGROUP BY E.dno

SELECT E.dno, MIN(E.sal)FROM Emp EGROUP BY E.dno

SELECT AVG(E.sal)FROM Emp EWHERE E.age=25 AND

E.sal BETWEEN 3000 AND 5000

<E.dno>

<E.dno,E.sal>Tree index!

<E. age,E.sal>Tree index!

DukeCS,Fall2017 CompSci516:DatabaseSystems 18

• Forindex-onlystrategies,clusteringisnotimportant

ExternalSorting

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 19

quickreviewofmergesort onwhiteboard

WhySort?• Aclassicproblemincomputerscience• Datarequestedinsortedorder

– e.g.,findstudentsinincreasinggpa order

• SortingisfirststepinbulkloadingB+treeindex• Sortingusefulforeliminatingduplicatecopiesinacollectionofrecords

• Sort-mergejoinalgorithminvolvessorting• Problem:sort1Gbofdatawith1MbofRAM

– needtominimizethecostofdiskaccess

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 20

2-WaySort:Requires3Buffers• SupposeN=2k pagesinthefile• Pass0:Readapage,sortit,writeit.

– repeatforall2k pages– onlyonebufferpageisused

• Pass1:– Readtwopages,sort(merge)themusingoneoutputpage,writethemtodisk– repeat2k-1times– threebufferpagesused

• Pass2,3,4,…..continue

Main memory buffers

INPUT 1

INPUT 2

OUTPUT

DiskDiskDukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 21

Two-WayExternalMergeSort• Eachsortedsub-fileiscalledarun– eachruncancontainmultiplepages

• Eachpassweread+writeeachpageinfile.

• Npagesinthefile,• =>thenumberofpasses

• Sotoal costis:

• Nottoopractical,butusefultolearnbasicconceptsforexternalsorting

é ù= +log2 1N

é ù( )2 12N Nlog +

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS0

PASS1

PASS2

PASS3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,62,6 4,9 7,8 1,3 2

2,34,6

4,78,9

1,35,6 2

2,34,46,78,9

1,23,56

1,22,33,44,56,67,8

9

GeneralExternalMergeSort

• TosortafilewithNpagesusingBbufferpages:– Pass0:useBbufferpages:

• Produce⌈N/B⌉ sortedrunsofBpageseach.– Pass1,2,…,etc.:mergeB-1runs tooneoutputpage

• keepwritingtodiskoncetheoutputpageisfull

B Main memory buffers

INPUT 1

INPUT B-1

OUTPUT

DiskDisk

INPUT 2. . . . . .. . .

• Suppose we have more than 3 buffer pages. • How can we utilize them?

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 23

CostofExternalMergeSort

• Numberofpasses:1+⌈logB-1⌈N/B⌉⌉• Cost=2N*(#ofpasses)– why2times?• E.g.,with5bufferpages,tosort108pagefile:• Pass0:sorting5pagesatatime

– ⌈108/5⌉ =22sortedrunsof5pageseach(lastrunisonly3pages)

• Pass1:4-waymerge– ⌈22/4⌉ =6sortedrunsof20pageseach(lastrunisonly8pages)

• Pass2:4-waymerge– (but2-wayforthelasttworuns)– [6/4⌉ =2sortedruns,80pagesand28pages

• Pass3:2-waymerge(only2runsremaining)– Sortedfileof108pages

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 24

NumberofPassesofExternalSort

N B=3 B=5 B=9 B=17 B=129 B=257100 7 4 3 2 1 11,000 10 5 4 3 2 210,000 13 7 5 4 2 2100,000 17 9 6 5 3 31,000,000 20 10 7 5 3 310,000,000 23 12 8 6 4 3100,000,000 26 14 9 7 4 41,000,000,000 30 15 10 8 5 4

HighBisgood,althoughCPUcostincreases

12

I/OforExternalMergeSort

• If10bufferpages– eithermerge9runsatatimewithoneoutputbuffer– or8runswithtwooutputbuffers

• If#pageI/Oisthemetric– goalisminimizethe#passes– eachpageisreadandwrittenineachpass

• Ifwedecidetoreadablock ofbpagessequentially– Suggestsweshouldmakeeachbuffer(input/output)beablock ofpages

– Butthiswillreducefan-outduringmergepasses• i.e.notasmanyrunscanbemergedagainanymore

– Inpractice,mostfilesstillsortedin2-3passes

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 26

DoubleBuffering• ToreduceCPUwaittimeforI/Orequesttocomplete,canprefetch into`shadowblock’.

OUTPUT

OUTPUT'

Disk Disk

INPUT 1

INPUT k

INPUT 2

INPUT 1'

INPUT 2'

INPUT k'

block sizeb

B main memory buffers, k-way mergeDukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 27

UsingB+TreesforSorting

• Scenario:TabletobesortedhasB+treeindexonsortingcolumn(s)

• Idea:Canretrievedataentries(thenrecords)inorderbytraversingleafpages.

• Isthisagoodidea?• Casestoconsider:

– B+treeisclustered: Goodidea!– B+treeisnotclustered: Couldbeaverybadidea!

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 28

ClusteredB+TreeUsedforSorting

• Cost:roottotheleft-mostleaf,thenretrieveallleafpages(Alternative1)

• IfAlternative2isused?Additionalcostofretrievingdatarecords:eachpagefetchedjustonce

☛ Always better than external sorting!

(Directs search)

Data Records

Index

Data Entries("Sequenceset")

17

UnclusteredB+TreeUsedforSorting

• Alternative(2)fordataentries;eachdataentrycontainsrid ofadatarecord

• Ingeneral,oneI/Operdatarecord!

(Directs search)

Data Records

Index

Data Entries("Sequenceset")

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 30

Summary

• Externalsortingisimportant;DBMSmaydedicatepartofbufferpoolforsorting!

• ExternalmergesortminimizesdiskI/Ocost:– Pass0:ProducessortedrunsofsizeB(#bufferpages)– Laterpasses:mergeruns– #ofrunsmergedatatimedependsonB,andblocksize.– LargerblocksizemeanslessI/Ocostperpage.– Largerblocksizemeanssmaller#runsmerged.– Inpractice,#ofrunsrarelymorethan2or3

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 31

top related