compsci516 database systems - duke university•from “big data” wiki: –healthcare:...

50
CompSci 516 Database Systems Lecture 1 Introduction and Data Models Instructor: Sudeepa Roy 1 Duke CS, Fall 2017 CompSci 516: Database Systems

Upload: others

Post on 20-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

CompSci 516DatabaseSystems

Lecture1Introduction

andDataModels

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 2: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

CourseWebsite

• http://www.cs.duke.edu/courses/fall17/compsci516/

• Pleasecheckfrequentlyforupdates!

• NewRoom:LSRCD106

DukeCS,Fall2017 CompSci516:DatabaseSystems 2

Page 3: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Instructor• Sudeepa Roy

[email protected]– https://users.cs.duke.edu/~sudeepa/– officehour:Mondays11:30am-12:30pm,LSRCD325

• Aboutmyself– AssistantProfessorinCS– PhD:UPenn,Postdoc:Univ.ofWashington– JoinedDukeCSinFall2015– Researchinterests:

• Databases(theoryandapplications)• DataAnalysis,causality,explaininganswers• Uncertaindata,dataprovenance,crowdsourcing

3DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 4: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Two(half-)TAs

• YilinGao– [email protected]– officehour:Wed,3-4pm,Location:TBD

• Keping Wang– [email protected]– officehour:Thurs,3-4pm,Location:TBD

• BothCompSci 516veterans!4DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 5: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Logistics

• Homeworksubmission:Sakai– Allenrolledstudentsarealreadythere

• Discussionforum:Piazza– Allenrolledstudentsarealreadythere– SendmeanemailifyouhavenotreceivedawelcomeemailfromPiazza

• Lectureslideswillbeuploadedbeforetheclass– butwillbeupdatedaftertheclass

DukeCS,Fall2017 CompSci516:DatabaseSystems 5

Page 6: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Grading

• ThreeHomework:30%• Project:15%• TwoMidterms:25+25=50%• Classparticipation:5%

6DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 7: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

GradingStrategy• Relativegrading

– Theactualgradedistributionattheendwilldependontheperformanceoftheentireclassonallthecomponents.

– TopperoftheclassgetsA+irrespectiveofthenumber,andonly“aboveexpectation”performancesgetA+.

– Nofixedlowestgradeorgradedistribution.– SEveryone cangetgoodgradebyworkinghard!

7DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 8: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Homework• Duein2-3weeksaftertheyareposted/previoushw isdue

– ALWAYSstartearly!

• Nolatedays– contacttheinstructorifyouhavea*valid*reasontobelate– Anotherexam,project,hw isNOTavalidreason– wewillalwaysbe

fairtoall– Computercrash/suddeninterviewtrips/medicalissues(following

officialprocedures)maycountasvalidreasons– Noguaranteethatyourrequestwillbegranted– again,startearly!

• Tobedoneindividually

DukeCS,Fall2017 CompSci516:DatabaseSystems 8

Page 9: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

HomeworkOverview• Youwilllearnhowtousetraditionalandnewdatabase

systemsinthehomework– Havetolearnthemmostlyonyourownfollowingtutorialsavailable

onlineandwithsomehelpfromtheTA

• HW1:TraditionalDBMS– SQLandPostgres

• HW2:Distributeddataprocessing– SparkandAWS

• HW3:NOSQL– e.g.MongoDB

DukeCS,Fall2017 CompSci516:DatabaseSystems 9

Page 10: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Exams• Midterm-1– Oct11(Wed)• Midterm-2– Nov29(Wed)

• Inclass• Closedbook,closednotes,noelectronicdevices• Totalweight:25+25%=50%• Examswilltestyourunderstandingofthematerial

• Bothexamsarecomprehensive– wouldincludeeverylectureuptothemidterm

DukeCS,Fall2017 CompSci516:DatabaseSystems 10

Page 11: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Projects• 15%weight• Ingroupsof3-4

– YoucanlookforgroupmembersthroughPiazzabyannouncingyourgeneralareaofinterestorifyouhaveaprobleminmind

– Eachgroupmembershoulddoapprox.equalwork

• Showyourcreativityandresearcher-side!• Workdoneshouldbeatleastequivalentto

– ahw *no.ofgroupmembers

• Allgroupmemberswillgetthesamegrade

DukeCS,Fall2017 CompSci516:DatabaseSystems 11

Page 12: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

ProjectTopics• Anythingrelatedto“Data”

– Datamanagement/processing/cleaning– Datavisualization– Dataexplorationoranalysis– Applicationsofdata(toanyfield)– Theoreticalfindingswithdata– Newtoolfordataanalysis

• Chooseaprojectaccordingtoyourresearchinterest• Youcancheckoutmajordatabaseconferencesforideas,e.g.

– Demonstrations (buildaprototypesolvingaproblemorimprovingUI)• SIGMOD’17:http://sigmod2017.org/sigmod-program/#posters• SIGMOD’16:http://sigmod2016.org/sigmod_demo_list.shtml• VLDB’17:http://www.vldb.org/2017/accepted_papers_demo_track.php• VLDB’16:http://vldb2016.persistent.com/demonstrations.php

– Researchpapers(solveaproblem,doexperimentswithdata)• CheckoutpapersinSIGMODandVLDBfromrecentyears

– Youcancheckoutpreviousyearstoo,andconferencesfromyourownresearcharea

DukeCS,Fall2017 CompSci516:DatabaseSystems 12

Page 13: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

ProjectDeliverables1. Projectproposal(due:9/20(W),1-3pages)

– problemselectionispartoftheproject– 3weeksfromnow– butstartasap,lookforproblems,dorelatedworkstudy,findan

interestingquestion,letmeknowyourinitialthoughts,allbythedeadline

2. Midtermprogressreport(due:10/25(W),3-5pages)3. Finalprojectreport(due:11/30(Th),4-8pages)4. Afinal5-10minsprojectpresentationand/ordemonstration

(inthelast1-2classes)

13DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 14: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

ProjectEvaluationCriteriaScaleof100:1. Well-motivated?102. Novel?103. Comprehensiverelatedworksurvey?104. Qualityofwriting?10

– shouldreflectallotherfactorstooexceptclasspresentation

5. Classpresentation/demo?15– shouldreflectallotherfactorstooexceptwriting

6. Technicalcontributions?45– Problemformulation/Algorithms/Experiments/Theory/System/

Userinterface/Efficiency/Usability/Datasetexplorationetc.

DukeCS,Fall2017 CompSci516:DatabaseSystems 14

Page 15: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

ClassParticipation• 5%weight• Includes

– Participationinclass(Q/A)– Pop-upquiz(youwillgettokenbyemailtoenrollin“gradiance”)

• Participation+correctanswering(lowesttwoscoreswillbedropped)– Evaluatingothers’projectsduringtheprojectpresentation

Ingeneral,• Activelyparticipateintheclass!

– Askquestionsinclassandonpiazza– Stopmeasmanytimesasyouneedtounderstandthelectures– Answereachother’squestionsonpiazza

• Alsosend(anonymousornot)feedback,suggestions,orconcernsonPiazza– thereisa“feedback”folder

DukeCS,Fall2017 CompSci516:DatabaseSystems 15

Page 16: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

ReadingMaterial

• Willmostlyfollowthe”cowbook”byRamakrishnan-Gehrke– Thechapternumberswillbeposted

• Youdonothavetobuythebooks,butitwillbegoodtoconsultthemfromtimetotime

• Youshouldbepreparedtodoquiteabitofreadingfromvariousbooksandpapers

16DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 17: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Whatisthiscourseabout?

• Thisisagraduate-leveldatabasecourseinCS

• Wewillcoverprinciples,internals,andapplicationsofdatabasesystemsindepth

• Wewillalsohaveanintroductiontoafewadvancedresearchtopicsindatabases(laterinthecourse)

17DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 18: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

AQuickSurvey• Haveyoutakenanundergraddatabasecourseearlier

– CS316/equivalent?

• Areyoufamiliarwith– SQL?– RA?(σ, Π, ´, ⨝, r, È, Ç, -)– Keys, foreign keys?– Indexindatabases?– Logic:∧,∨,∀,∃,¬,∈, =>

– Transactions?– Map-reduce/Spark?

• Haveyoueverworkedwithadataset?– relationaldatabase,text,csv,XML

• Haveyoueverusedadatabasesystem?– PostGres,MySQL,SQLServer,SQLAzure

18DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 19: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Whatwillbecovered?• Databaseconcepts

– DataModels,SQL,Views,Constraints,RA,Normalization

• Principlesandinternalsofdatabasemanagementsystems(DBMS)– Indexing,QueryExecution-Algorithms-Optimization,Transactions,

ParallelandDistributedQueryProcessing,MapReduce

• Advancedandresearchtopicsindatabases– e.g.Datalog,NOSQL,Datamining,Datawarehouse– Morewillbeaddedinthe“TBD”lectures

• Wewillgofastforsomebasictopicsindatabasescoveredinundergraddbcourses– Datamodel,SQL,RA– Butaskmetoslowdownifyouarenotfamiliarwiththem

19DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 20: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

WhatthiscourseisNOTabout

• Spark,AWS,clustercomputing…– PartiallycoveredinaHWandalecture

• Machinelearningbasedanalytics• Statisticalmethodsfordataanalytics• Python,R,…• Programming

DukeCS,Fall2017 CompSci516:DatabaseSystems 20

Page 21: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Background• Youshouldhavesomeunderstanding(attheCS

undergraduatelevel)– datastructure,discretemaths,algorithms– databases– orhavetolearntheseyourselfasnecessary

• Needtopickupnewcodingframeworkandprogramminglanguagesonyourown– andhowtoprocessdatausingthem– Homeworkassignmentswillmostlybeself-taught– …withhelpfromtheTA

• Willinvolvesomemathematicalandanalyticalreasoningtoo

DukeCS,Fall2017 CompSci516:DatabaseSystems 21

Page 22: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Whyshouldwecareaboutdatabases?

• Weareinadata-drivenworld

• “BigData”issupposedtochangethemodeofoperationforalmosteverysinglefield– Science,Technology,Healthcare,Business,Manufacturing,Journalism,Government,Education,…

• Wemustknowhowtocollect,store,process,andanalyzesuchdata

22DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 23: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Whyshouldwecareaboutdatabases?

• From“BigData”wiki:“TheLargeHadronColliderexperimentsrepresentabout150millionsensorsdeliveringdata40 milliontimespersecond.Therearenearly600 millioncollisionspersecond.IfallsensordatawererecordedinLHC,….thisisequivalentto500quintillion(5×1020)bytesperday,almost200timesmorethanalltheothersourcescombinedintheworld.”

23

Science

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 24: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Whyshouldwecareaboutdatabases?

• From“BigData”wiki:– eBay.com usestwodatawarehousesat7.5PB(x1012)and40PBaswellasa40PBHadoopclusterforsearch,consumerrecommendations,andmerchandising

– Facebookhandles50 billionphotosfromitsuserbase– AsofAugust2012,Googlewashandlingroughly100 billionsearchespermonth

24

Technology

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 25: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Whyshouldwecareaboutdatabases?

• From“BigData”wiki:– Healthcare:digitizationofpatient’sdata,prescriptiveanalytics

– Media:Tailorarticlesandadvertisementsthatreachtargetedpeople,validateclaims

• “ComputationalJournalism”projectinDukeDBgroup

– Manufacturing:supplyplanning– Sports:improvetraining,understandingcompetitors

25

HealthcareMediaManufacturingSports…..

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 26: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Whyshouldwecareaboutdatabases?

• Simplystoringsuchlargedatasetsinaflatfilestopsworkingatsomepoint– Needefficientmodel,storage,andprocessing

• ADBMStakescareofsuchissues– theuseronlyhastorunqueriestoprocesssuchdatasets– muchsimplerthanwritinglowlevelcode

26DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 27: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Today

• DBMS• DataModels

• [RG]1.1,1.3-1.5

27DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 28: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

WhatisaDatabase?

• Adatabaseisacollectionofdata– typicallyrelatedanddescribingactivitiesofanorganization

• Adatabasemaycontaininformationabout– Entities

• students,faculty,courses,classroom

– Relationshipsbetweenentities• students’enrollment,facultyteachingcourses,roomsforcourses

28DukeCS,Fall2017 CompSci516:DatabaseSystems

Andwhatdoesitcontain?

Page 29: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

WhyuseaDBMS• i.e.whynotusefilesystemandaprogramminglanguage?

• Supposeacompanyhasalargecollectionofdataonemployees,departments,products,salesetc.

• Requirements:– Quicklyanswerquestionsondata

• Notethatallthedatamaynotfitinmainmemory– Concurrentaccess:applychangesconsistently– Restrictedaccess(e.g.salary)

29DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 30: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

WhyuseaDBMS?

• ADBMSisapieceofsoftware(i.e.abigprogramwrittenbysomeoneelse)thatmakesthesetaskseasier– Quickaccess– Robustaccess– Safeaccess– Simpleraccess

• Next:somenicepropertiesofaDBMS

30DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 31: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

WhyuseaDBMS?

1. DataIndependence– Applicationprogramsshouldnotbeexposedtothedata

representationandstorage– DBMSprovidesanabstractviewofthedata

2. EfficientDataAccess– ADBMSutilizesavarietyofsophisticatedtechniquesto

storeandretrievedata(fromdisk)efficiently

31DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 32: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

WhyuseaDBMS?

3. DataIntegrityandSecurity– DBMSenforces“integrityconstraints”– e.g.check

whethertotalsalaryislessthanthebudget– DBMSenforces“accesscontrols”– whethersalary

informationcanbeaccessesbyaparticularuser

4. DataAdministration– Centralizedprofessionaldataadministrationby

experienceduserscanmanagedataaccess,organizedatarepresentationtominimizeredundancy,andfinetunethestorage

32DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 33: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

WhyuseaDBMS?

5. ConcurrentAccessandCrashRecovery– DBMSschedulesconcurrentaccessestothedatasuch

thattheusersthinkthatthedataisbeingaccessedbyonlyoneuseratatime

– DBMSprotectsdatafromsystemfailures

6. ReducedApplicationDevelopmentTime– Supportsmanyfunctionsthatarecommontoanumber

ofapplicationsaccessingdata– Provideshigh-levelinterface– Facilitatesquickandrobustapplicationdevelopment

33DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 34: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

WhenNOTtouseaDBMS?• DBMSisoptimizedforcertainkindofworkloadsand

manipulations

• Theremaybeapplicationswithtightreal-timeconstraintsorafewwell-definedcriticaloperations

• AbstractviewofthedataprovidedbyDBMSmaynotsuffice

• Toruncomplex,statistical/MLanalyticsonlargedatasets

34DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 35: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

DataModel• Applicationsneedtomodelsomerealworldunits• Entities:

– Students,Departments,Courses,Faculty,Organization,Employee,…

• Relationships:– Courseenrollmentsbystudents,Productsalesbyanorganization

• Adatamodelisacollectionofhigh-leveldatadescriptionconstructsthathidemanylow-levelstoragedetails

35DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 36: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

DataModelCanSpecify:

1. Structureofthedata– likearraysorstructs inaprogramminglanguage– butatahigherlevel(conceptualmodel)

2. Operationsonthedata– unlikeaprogramminglanguage,notanyoperationcanbeperformed– allowlimitedsetsofqueriesandmodifications– astrength,notaweakness!

3. Constraintsonthedata– whatthedatacanbe– e.g.amoviehasexactlyonetitle

36DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 37: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

ImportantDataModels

• StructuredData• Semi-structuredData• UnstructuredData

Whatarethese?

37DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 38: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

ImportantDataModels• StructuredData

– Allelementshaveafixedformat– RelationalModel(table)

• Semi-structuredData– Somestructurebutnotfixed– Hierarchicallynestedtagged-elementsintreestructure– XML

• UnstructuredData– Nostructure– text,image,audio,video

38DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 39: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

RelationalDataModel

• ProposedbyEdward(Ted)Codd in1970– wonTuringawardforit!

• Motivation:– Simplicity– Betterlogicalandphysicaldataindependence

DukeCS,Fall2017 CompSci516:DatabaseSystems 39

Page 40: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

RelationalDataModel

• ThedatadescriptionconstructisaRelation– Representedasa“table”– Basicallya“set”ofrecords(setsemantic)– orderdoesnotmatter– andallrecordsaredistinct

• however,itistruefortherelationalmodel,notforstandardDBM– allowduplicaterows(bagsemantic)– unlessrestrictedbykeyconstraints.Why?

40DukeCS,Fall2017 CompSci516:DatabaseSystems

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0

Students

Bag:{1,1,2,2,3,2,1,5,6,1}Set:{1,2,3,5,6}

Page 41: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Bagvs.Set

• Why“bagsemantic”andnot“setsemantic”instandardDBMSs?– Primarilyperformancereasons– Duplicateeliminationisexpensive(requiressorting)– Someoperationslike“projection”s aremuchmoreefficientonbags

thansets

41DukeCS,Fall2017 CompSci516:DatabaseSystems

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0

Students

Page 42: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

RelationalDataModel

42DukeCS,Fall2017 CompSci516:DatabaseSystems

sid name login age gpa53666 Jones jones@cs 18 3.453688 Smith smith@ee 18 3.253650 Smith smith1@math 19 3.853831 Madayan madayan@music 11 1.853832 Guldu guldu@music 12 2.0

Students Attribute/Column/Field

Tuple/Row/Record

Value

Whatisapoorlychosenattributeinthisrelation?

• Relationaldatabase=asetofrelations• ARelation:madeupoftwoparts

1. Schema2. Instance

Page 43: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

SchemaandInstance• Oneschemacanhavemultipleinstances

• Schema:– Atemplatefordescribinganentity/relationship(e.g.students)– specifiesnameofrelation+nameandtypeofeachcolumne.g.Students(sid:string,name:string,login:string,age:integer,gpa:real).

• Instance:– Whenwefillinactualdatavaluesinaschema– atable,hasrowsandcolumns– eachrow/tuplefollowstheschemaanddomainconstraints– #Rows=cardinality,#fields=degreeorarity– examplebelow

DukeCS,Fall2017 CompSci516:DatabaseSystems 43

Cardinality = 3, degree = 5sid name login age gpa

53666 Jones jones@cs 18 3.4

53688 Smith smith@ee 18 3.2

53650 Smith smith1@math 19 3.8

Page 44: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

LevelsofAbstractionsinaDBMS

• Physicalschema– Storageasfiles,rowvs.

columnstore,indexes– willdiscussthesein

laterlectures

DukeCS,Fall2017 CompSci516:DatabaseSystems 44

Disk

PhysicalSchema

LogicalSchema

ExternalSchema External Schema ExternalSchema

Page 45: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

LevelsofAbstractionsinaDBMS

• Logical/Conceptualschema– describesthestoreddatainthe

physicalschema

• Decidedbyconceptualschemadesign

– e.g.ERDiagram• notcoveredinthiscourse

– Normalization• willbecovered

Students(sid:string,name:string,login:string,age:integer,gpa:real)

DukeCS,Fall2017 CompSci516:DatabaseSystems 45

Disk

PhysicalSchema

LogicalSchema

ExternalSchema External Schema ExternalSchema

Page 46: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

LevelsofAbstractionsinaDBMS

• Externalschema– different“views”ofthe

databasetodifferentusers

– willdiscussviewslater

• Onephysicalandlogicalschemabuttherecanbemultipleexternalschemas

DukeCS,Fall2017 CompSci516:DatabaseSystems 46

Disk

PhysicalSchema

LogicalSchema

ExternalSchema External Schema ExternalSchema

Page 47: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

DataIndependence

• Applicationprogramsareinsulatedfromchangesinthewaythedataisstructuredandstored

• AveryimportantpropertyofaDBMS

• LogicalandPhysical

DukeCS,Fall2017 CompSci516:DatabaseSystems 47

Page 48: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

LogicalDataIndependence• Userscanbeshieldedfromchangesinthelogical

structureofdata• e.g.Students:

Students(sid:string,name:string,login:string,age:integer,gpa:real)• Divideintotworelations

Students_public(sid:string,name:string,login:string)Students_private(sid:string,age:integer,gpa:real)

• Stilla“view”Studentscanbeobtainedusingtheabovenewrelations– by“joining”themwithsid

• AuserwhoqueriesthisviewStudentswillgetthesameanswerasbefore

DukeCS,Fall2017 CompSci516:DatabaseSystems 48

Page 49: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

PhysicalDataIndependence

• Thelogical/conceptualschemainsulatesusersfromchangesinphysicalstoragedetails– howthedataisstoredondisk– thefilestructure– thechoiceofindexes

• Theapplicationremainsunaltered– Buttheperformancemaybeaffectedbysuchchanges

DukeCS,Fall2017 CompSci516:DatabaseSystems 49

Page 50: CompSci516 Database Systems - Duke University•From “Big Data” wiki: –Healthcare: digitization of patient’s data, prescriptive analytics –Media: Tailor articles and advertisements

Veryimportant

UnderstandtheCourse-Policy

See“whatisallowed/notallowed”

willberemindedineveryhwassignmenttoo

DukeCS,Fall2017 CompSci516:DatabaseSystems 50