compsci516 database systems - duke university•from “big data” wiki: –healthcare:...

Post on 20-Jun-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CompSci 516DatabaseSystems

Lecture1Introduction

andDataModels

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

CourseWebsite

• http://www.cs.duke.edu/courses/fall17/compsci516/

• Pleasecheckfrequentlyforupdates!

• NewRoom:LSRCD106

DukeCS,Fall2017 CompSci516:DatabaseSystems 2

Instructor• Sudeepa Roy

– sudeepa@cs.duke.edu– https://users.cs.duke.edu/~sudeepa/– officehour:Mondays11:30am-12:30pm,LSRCD325

• Aboutmyself– AssistantProfessorinCS– PhD:UPenn,Postdoc:Univ.ofWashington– JoinedDukeCSinFall2015– Researchinterests:

• Databases(theoryandapplications)• DataAnalysis,causality,explaininganswers• Uncertaindata,dataprovenance,crowdsourcing

3DukeCS,Fall2017 CompSci516:DatabaseSystems

Two(half-)TAs

• YilinGao– yilin.gao@duke.edu– officehour:Wed,3-4pm,Location:TBD

• Keping Wang– keping.wang@duke.edu– officehour:Thurs,3-4pm,Location:TBD

• BothCompSci 516veterans!4DukeCS,Fall2017 CompSci516:DatabaseSystems

Logistics

• Homeworksubmission:Sakai– Allenrolledstudentsarealreadythere

• Discussionforum:Piazza– Allenrolledstudentsarealreadythere– SendmeanemailifyouhavenotreceivedawelcomeemailfromPiazza

• Lectureslideswillbeuploadedbeforetheclass– butwillbeupdatedaftertheclass

DukeCS,Fall2017 CompSci516:DatabaseSystems 5

Grading

• ThreeHomework:30%• Project:15%• TwoMidterms:25+25=50%• Classparticipation:5%

6DukeCS,Fall2017 CompSci516:DatabaseSystems

GradingStrategy• Relativegrading

– Theactualgradedistributionattheendwilldependontheperformanceoftheentireclassonallthecomponents.

– TopperoftheclassgetsA+irrespectiveofthenumber,andonly“aboveexpectation”performancesgetA+.

– Nofixedlowestgradeorgradedistribution.– SEveryone cangetgoodgradebyworkinghard!

7DukeCS,Fall2017 CompSci516:DatabaseSystems

Homework• Duein2-3weeksaftertheyareposted/previoushw isdue

– ALWAYSstartearly!

• Nolatedays– contacttheinstructorifyouhavea*valid*reasontobelate– Anotherexam,project,hw isNOTavalidreason– wewillalwaysbe

fairtoall– Computercrash/suddeninterviewtrips/medicalissues(following

officialprocedures)maycountasvalidreasons– Noguaranteethatyourrequestwillbegranted– again,startearly!

• Tobedoneindividually

DukeCS,Fall2017 CompSci516:DatabaseSystems 8

HomeworkOverview• Youwilllearnhowtousetraditionalandnewdatabase

systemsinthehomework– Havetolearnthemmostlyonyourownfollowingtutorialsavailable

onlineandwithsomehelpfromtheTA

• HW1:TraditionalDBMS– SQLandPostgres

• HW2:Distributeddataprocessing– SparkandAWS

• HW3:NOSQL– e.g.MongoDB

DukeCS,Fall2017 CompSci516:DatabaseSystems 9

Exams• Midterm-1– Oct11(Wed)• Midterm-2– Nov29(Wed)

• Inclass• Closedbook,closednotes,noelectronicdevices• Totalweight:25+25%=50%• Examswilltestyourunderstandingofthematerial

• Bothexamsarecomprehensive– wouldincludeeverylectureuptothemidterm

DukeCS,Fall2017 CompSci516:DatabaseSystems 10

Projects• 15%weight• Ingroupsof3-4

– YoucanlookforgroupmembersthroughPiazzabyannouncingyourgeneralareaofinterestorifyouhaveaprobleminmind

– Eachgroupmembershoulddoapprox.equalwork

• Showyourcreativityandresearcher-side!• Workdoneshouldbeatleastequivalentto

– ahw *no.ofgroupmembers

• Allgroupmemberswillgetthesamegrade

DukeCS,Fall2017 CompSci516:DatabaseSystems 11

ProjectTopics• Anythingrelatedto“Data”

– Datamanagement/processing/cleaning– Datavisualization– Dataexplorationoranalysis– Applicationsofdata(toanyfield)– Theoreticalfindingswithdata– Newtoolfordataanalysis

• Chooseaprojectaccordingtoyourresearchinterest• Youcancheckoutmajordatabaseconferencesforideas,e.g.

– Demonstrations (buildaprototypesolvingaproblemorimprovingUI)• SIGMOD’17:http://sigmod2017.org/sigmod-program/#posters• SIGMOD’16:http://sigmod2016.org/sigmod_demo_list.shtml• VLDB’17:http://www.vldb.org/2017/accepted_papers_demo_track.php• VLDB’16:http://vldb2016.persistent.com/demonstrations.php

– Researchpapers(solveaproblem,doexperimentswithdata)• CheckoutpapersinSIGMODandVLDBfromrecentyears

– Youcancheckoutpreviousyearstoo,andconferencesfromyourownresearcharea

DukeCS,Fall2017 CompSci516:DatabaseSystems 12

ProjectDeliverables1. Projectproposal(due:9/20(W),1-3pages)

– problemselectionispartoftheproject– 3weeksfromnow– butstartasap,lookforproblems,dorelatedworkstudy,findan

interestingquestion,letmeknowyourinitialthoughts,allbythedeadline

2. Midtermprogressreport(due:10/25(W),3-5pages)3. Finalprojectreport(due:11/30(Th),4-8pages)4. Afinal5-10minsprojectpresentationand/ordemonstration

(inthelast1-2classes)

13DukeCS,Fall2017 CompSci516:DatabaseSystems

ProjectEvaluationCriteriaScaleof100:1. Well-motivated?102. Novel?103. Comprehensiverelatedworksurvey?104. Qualityofwriting?10

– shouldreflectallotherfactorstooexceptclasspresentation

5. Classpresentation/demo?15– shouldreflectallotherfactorstooexceptwriting

6. Technicalcontributions?45– Problemformulation/Algorithms/Experiments/Theory/System/

Userinterface/Efficiency/Usability/Datasetexplorationetc.

DukeCS,Fall2017 CompSci516:DatabaseSystems 14

ClassParticipation• 5%weight• Includes

– Participationinclass(Q/A)– Pop-upquiz(youwillgettokenbyemailtoenrollin“gradiance”)

• Participation+correctanswering(lowesttwoscoreswillbedropped)– Evaluatingothers’projectsduringtheprojectpresentation

Ingeneral,• Activelyparticipateintheclass!

– Askquestionsinclassandonpiazza– Stopmeasmanytimesasyouneedtounderstandthelectures– Answereachother’squestionsonpiazza

• Alsosend(anonymousornot)feedback,suggestions,orconcernsonPiazza– thereisa“feedback”folder

DukeCS,Fall2017 CompSci516:DatabaseSystems 15

ReadingMaterial

• Willmostlyfollowthe”cowbook”byRamakrishnan-Gehrke– Thechapternumberswillbeposted

• Youdonothavetobuythebooks,butitwillbegoodtoconsultthemfromtimetotime

• Youshouldbepreparedtodoquiteabitofreadingfromvariousbooksandpapers

16DukeCS,Fall2017 CompSci516:DatabaseSystems

Whatisthiscourseabout?

• Thisisagraduate-leveldatabasecourseinCS

• Wewillcoverprinciples,internals,andapplicationsofdatabasesystemsindepth

• Wewillalsohaveanintroductiontoafewadvancedresearchtopicsindatabases(laterinthecourse)

17DukeCS,Fall2017 CompSci516:DatabaseSystems

AQuickSurvey• Haveyoutakenanundergraddatabasecourseearlier

– CS316/equivalent?

• Areyoufamiliarwith– SQL?– RA?(σ, Π, ´, ⨝, r, È, Ç, -)– Keys, foreign keys?– Indexindatabases?– Logic:∧,∨,∀,∃,¬,∈, =>

– Transactions?– Map-reduce/Spark?

• Haveyoueverworkedwithadataset?– relationaldatabase,text,csv,XML

• Haveyoueverusedadatabasesystem?– PostGres,MySQL,SQLServer,SQLAzure

18DukeCS,Fall2017 CompSci516:DatabaseSystems

Whatwillbecovered?• Databaseconcepts

– DataModels,SQL,Views,Constraints,RA,Normalization

• Principlesandinternalsofdatabasemanagementsystems(DBMS)– Indexing,QueryExecution-Algorithms-Optimization,Transactions,

ParallelandDistributedQueryProcessing,MapReduce

• Advancedandresearchtopicsindatabases– e.g.Datalog,NOSQL,Datamining,Datawarehouse– Morewillbeaddedinthe“TBD”lectures

• Wewillgofastforsomebasictopicsindatabasescoveredinundergraddbcourses– Datamodel,SQL,RA– Butaskmetoslowdownifyouarenotfamiliarwiththem

19DukeCS,Fall2017 CompSci516:DatabaseSystems

WhatthiscourseisNOTabout

• Spark,AWS,clustercomputing…– PartiallycoveredinaHWandalecture

• Machinelearningbasedanalytics• Statisticalmethodsfordataanalytics• Python,R,…• Programming

DukeCS,Fall2017 CompSci516:DatabaseSystems 20

Background• Youshouldhavesomeunderstanding(attheCS

undergraduatelevel)– datastructure,discretemaths,algorithms– databases– orhavetolearntheseyourselfasnecessary

• Needtopickupnewcodingframeworkandprogramminglanguagesonyourown– andhowtoprocessdatausingthem– Homeworkassignmentswillmostlybeself-taught– …withhelpfromtheTA

• Willinvolvesomemathematicalandanalyticalreasoningtoo

DukeCS,Fall2017 CompSci516:DatabaseSystems 21

Whyshouldwecareaboutdatabases?

• Weareinadata-drivenworld

• “BigData”issupposedtochangethemodeofoperationforalmosteverysinglefield– Science,Technology,Healthcare,Business,Manufacturing,Journalism,Government,Education,…

• Wemustknowhowtocollect,store,process,andanalyzesuchdata

22DukeCS,Fall2017 CompSci516:DatabaseSystems

Whyshouldwecareaboutdatabases?

• From“BigData”wiki:“TheLargeHadronColliderexperimentsrepresentabout150millionsensorsdeliveringdata40 milliontimespersecond.Therearenearly600 millioncollisionspersecond.IfallsensordatawererecordedinLHC,….thisisequivalentto500quintillion(5×1020)bytesperday,almost200timesmorethanalltheothersourcescombinedintheworld.”

23

Science

DukeCS,Fall2017 CompSci516:DatabaseSystems

Whyshouldwecareaboutdatabases?

• From“BigData”wiki:– eBay.com usestwodatawarehousesat7.5PB(x1012)and40PBaswellasa40PBHadoopclusterforsearch,consumerrecommendations,andmerchandising

– Facebookhandles50 billionphotosfromitsuserbase– AsofAugust2012,Googlewashandlingroughly100 billionsearchespermonth

24

Technology

DukeCS,Fall2017 CompSci516:DatabaseSystems

Whyshouldwecareaboutdatabases?

• From“BigData”wiki:– Healthcare:digitizationofpatient’sdata,prescriptiveanalytics

– Media:Tailorarticlesandadvertisementsthatreachtargetedpeople,validateclaims

• “ComputationalJournalism”projectinDukeDBgroup

– Manufacturing:supplyplanning– Sports:improvetraining,understandingcompetitors

25

HealthcareMediaManufacturingSports…..

DukeCS,Fall2017 CompSci516:DatabaseSystems

Whyshouldwecareaboutdatabases?

• Simplystoringsuchlargedatasetsinaflatfilestopsworkingatsomepoint– Needefficientmodel,storage,andprocessing

• ADBMStakescareofsuchissues– theuseronlyhastorunqueriestoprocesssuchdatasets– muchsimplerthanwritinglowlevelcode

26DukeCS,Fall2017 CompSci516:DatabaseSystems

Today

• DBMS• DataModels

• [RG]1.1,1.3-1.5

27DukeCS,Fall2017 CompSci516:DatabaseSystems

WhatisaDatabase?

• Adatabaseisacollectionofdata– typicallyrelatedanddescribingactivitiesofanorganization

• Adatabasemaycontaininformationabout– Entities

• students,faculty,courses,classroom

– Relationshipsbetweenentities• students’enrollment,facultyteachingcourses,roomsforcourses

28DukeCS,Fall2017 CompSci516:DatabaseSystems

Andwhatdoesitcontain?

WhyuseaDBMS• i.e.whynotusefilesystemandaprogramminglanguage?

• Supposeacompanyhasalargecollectionofdataonemployees,departments,products,salesetc.

• Requirements:– Quicklyanswerquestionsondata

• Notethatallthedatamaynotfitinmainmemory– Concurrentaccess:applychangesconsistently– Restrictedaccess(e.g.salary)

29DukeCS,Fall2017 CompSci516:DatabaseSystems

WhyuseaDBMS?

• ADBMSisapieceofsoftware(i.e.abigprogramwrittenbysomeoneelse)thatmakesthesetaskseasier– Quickaccess– Robustaccess– Safeaccess– Simpleraccess

• Next:somenicepropertiesofaDBMS

30DukeCS,Fall2017 CompSci516:DatabaseSystems

WhyuseaDBMS?

1. DataIndependence– Applicationprogramsshouldnotbeexposedtothedata

representationandstorage– DBMSprovidesanabstractviewofthedata

2. EfficientDataAccess– ADBMSutilizesavarietyofsophisticatedtechniquesto

storeandretrievedata(fromdisk)efficiently

31DukeCS,Fall2017 CompSci516:DatabaseSystems

WhyuseaDBMS?

3. DataIntegrityandSecurity– DBMSenforces“integrityconstraints”– e.g.check

whethertotalsalaryislessthanthebudget– DBMSenforces“accesscontrols”– whethersalary

informationcanbeaccessesbyaparticularuser

4. DataAdministration– Centralizedprofessionaldataadministrationby

experienceduserscanmanagedataaccess,organizedatarepresentationtominimizeredundancy,andfinetunethestorage

32DukeCS,Fall2017 CompSci516:DatabaseSystems

WhyuseaDBMS?

5. ConcurrentAccessandCrashRecovery– DBMSschedulesconcurrentaccessestothedatasuch

thattheusersthinkthatthedataisbeingaccessedbyonlyoneuseratatime

– DBMSprotectsdatafromsystemfailures

6. ReducedApplicationDevelopmentTime– Supportsmanyfunctionsthatarecommontoanumber

ofapplicationsaccessingdata– Provideshigh-levelinterface– Facilitatesquickandrobustapplicationdevelopment

33DukeCS,Fall2017 CompSci516:DatabaseSystems

WhenNOTtouseaDBMS?• DBMSisoptimizedforcertainkindofworkloadsand

manipulations

• Theremaybeapplicationswithtightreal-timeconstraintsorafewwell-definedcriticaloperations

• AbstractviewofthedataprovidedbyDBMSmaynotsuffice

• Toruncomplex,statistical/MLanalyticsonlargedatasets

34DukeCS,Fall2017 CompSci516:DatabaseSystems

DataModel• Applicationsneedtomodelsomerealworldunits• Entities:

– Students,Departments,Courses,Faculty,Organization,Employee,…

• Relationships:– Courseenrollmentsbystudents,Productsalesbyanorganization

• Adatamodelisacollectionofhigh-leveldatadescriptionconstructsthathidemanylow-levelstoragedetails

35DukeCS,Fall2017 CompSci516:DatabaseSystems

DataModelCanSpecify:

1. Structureofthedata– likearraysorstructs inaprogramminglanguage– butatahigherlevel(conceptualmodel)

2. Operationsonthedata– unlikeaprogramminglanguage,notanyoperationcanbeperformed– allowlimitedsetsofqueriesandmodifications– astrength,notaweakness!

3. Constraintsonthedata– whatthedatacanbe– e.g.amoviehasexactlyonetitle

36DukeCS,Fall2017 CompSci516:DatabaseSystems

ImportantDataModels

• StructuredData• Semi-structuredData• UnstructuredData

Whatarethese?

37DukeCS,Fall2017 CompSci516:DatabaseSystems

ImportantDataModels• StructuredData

– Allelementshaveafixedformat– RelationalModel(table)

• Semi-structuredData– Somestructurebutnotfixed– Hierarchicallynestedtagged-elementsintreestructure– XML

• UnstructuredData– Nostructure– text,image,audio,video

38DukeCS,Fall2017 CompSci516:DatabaseSystems

RelationalDataModel

• ProposedbyEdward(Ted)Codd in1970– wonTuringawardforit!

• Motivation:– Simplicity– Betterlogicalandphysicaldataindependence

DukeCS,Fall2017 CompSci516:DatabaseSystems 39

RelationalDataModel

• ThedatadescriptionconstructisaRelation– Representedasa“table”– Basicallya“set”ofrecords(setsemantic)– orderdoesnotmatter– andallrecordsaredistinct

• however,itistruefortherelationalmodel,notforstandardDBM– allowduplicaterows(bagsemantic)– unlessrestrictedbykeyconstraints.Why?

40DukeCS,Fall2017 CompSci516:DatabaseSystems

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0

Students

Bag:{1,1,2,2,3,2,1,5,6,1}Set:{1,2,3,5,6}

Bagvs.Set

• Why“bagsemantic”andnot“setsemantic”instandardDBMSs?– Primarilyperformancereasons– Duplicateeliminationisexpensive(requiressorting)– Someoperationslike“projection”s aremuchmoreefficientonbags

thansets

41DukeCS,Fall2017 CompSci516:DatabaseSystems

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0

Students

RelationalDataModel

42DukeCS,Fall2017 CompSci516:DatabaseSystems

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0

Students Attribute/Column/Field

Tuple/Row/Record

Value

Whatisapoorlychosenattributeinthisrelation?

• Relationaldatabase=asetofrelations• ARelation:madeupoftwoparts

1. Schema2. Instance

SchemaandInstance• Oneschemacanhavemultipleinstances

• Schema:– Atemplatefordescribinganentity/relationship(e.g.students)– specifiesnameofrelation+nameandtypeofeachcolumne.g.Students(sid:string,name:string,login:string,age:integer,gpa:real).

• Instance:– Whenwefillinactualdatavaluesinaschema– atable,hasrowsandcolumns– eachrow/tuplefollowstheschemaanddomainconstraints– #Rows=cardinality,#fields=degreeorarity– examplebelow

DukeCS,Fall2017 CompSci516:DatabaseSystems 43

Cardinality = 3, degree = 5sid name login age gpa

53666 Jones jones@cs 18 3.4

53688 Smith smith@ee 18 3.2

53650 Smith smith1@math 19 3.8

LevelsofAbstractionsinaDBMS

• Physicalschema– Storageasfiles,rowvs.

columnstore,indexes– willdiscussthesein

laterlectures

DukeCS,Fall2017 CompSci516:DatabaseSystems 44

Disk

PhysicalSchema

LogicalSchema

ExternalSchema External Schema ExternalSchema

LevelsofAbstractionsinaDBMS

• Logical/Conceptualschema– describesthestoreddatainthe

physicalschema

• Decidedbyconceptualschemadesign

– e.g.ERDiagram• notcoveredinthiscourse

– Normalization• willbecovered

Students(sid:string,name:string,login:string,age:integer,gpa:real)

DukeCS,Fall2017 CompSci516:DatabaseSystems 45

Disk

PhysicalSchema

LogicalSchema

ExternalSchema External Schema ExternalSchema

LevelsofAbstractionsinaDBMS

• Externalschema– different“views”ofthe

databasetodifferentusers

– willdiscussviewslater

• Onephysicalandlogicalschemabuttherecanbemultipleexternalschemas

DukeCS,Fall2017 CompSci516:DatabaseSystems 46

Disk

PhysicalSchema

LogicalSchema

ExternalSchema External Schema ExternalSchema

DataIndependence

• Applicationprogramsareinsulatedfromchangesinthewaythedataisstructuredandstored

• AveryimportantpropertyofaDBMS

• LogicalandPhysical

DukeCS,Fall2017 CompSci516:DatabaseSystems 47

LogicalDataIndependence• Userscanbeshieldedfromchangesinthelogical

structureofdata• e.g.Students:

Students(sid:string,name:string,login:string,age:integer,gpa:real)• Divideintotworelations

Students_public(sid:string,name:string,login:string)Students_private(sid:string,age:integer,gpa:real)

• Stilla“view”Studentscanbeobtainedusingtheabovenewrelations– by“joining”themwithsid

• AuserwhoqueriesthisviewStudentswillgetthesameanswerasbefore

DukeCS,Fall2017 CompSci516:DatabaseSystems 48

PhysicalDataIndependence

• Thelogical/conceptualschemainsulatesusersfromchangesinphysicalstoragedetails– howthedataisstoredondisk– thefilestructure– thechoiceofindexes

• Theapplicationremainsunaltered– Buttheperformancemaybeaffectedbysuchchanges

DukeCS,Fall2017 CompSci516:DatabaseSystems 49

Veryimportant

UnderstandtheCourse-Policy

See“whatisallowed/notallowed”

willberemindedineveryhwassignmenttoo

DukeCS,Fall2017 CompSci516:DatabaseSystems 50

top related