distributed database systems

50
Distributed Database Distributed Database Systems Systems Jaweed Yazdani Jaweed Yazdani King Fahd University of Petroleum and King Fahd University of Petroleum and Minerals Minerals

Upload: dewitt

Post on 19-Jan-2016

93 views

Category:

Documents


4 download

DESCRIPTION

Distributed Database Systems. Jaweed Yazdani King Fahd University of Petroleum and Minerals. File Systems. Database Management Systems. Why Distributed Database Systems ?. Distributed Computing ?. A concept in search of a definition and a name. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Distributed Database Systems

Distributed Database Distributed Database SystemsSystems

Jaweed Yazdani Jaweed Yazdani King Fahd University of Petroleum and King Fahd University of Petroleum and MineralsMinerals

Page 2: Distributed Database Systems

File SystemsFile Systems

Page 3: Distributed Database Systems

Database Management Database Management SystemsSystems

Page 4: Distributed Database Systems

Why Distributed Database Why Distributed Database Systems ?Systems ?

Page 5: Distributed Database Systems

Distributed Computing ?Distributed Computing ?

A concept in search of a definition A concept in search of a definition and a name. and a name.

A number of autonomous processing A number of autonomous processing elements (not necessarily elements (not necessarily homogeneous) that are homogeneous) that are interconnected by a computer interconnected by a computer network and that cooperate in network and that cooperate in performing their assigned tasks.performing their assigned tasks.

Page 6: Distributed Database Systems

Distributed Distributed Computing-Computing- Synonymous terms Synonymous terms

distributed function distributed function distributed data processing distributed data processing multiprocessors/multicomputers multiprocessors/multicomputers satellite processing satellite processing backend processing backend processing dedicated/special purpose computers dedicated/special purpose computers timeshared systems timeshared systems functionally modular systemsfunctionally modular systems

Page 7: Distributed Database Systems

What gets What gets distributed ?distributed ? Processing logic Processing logic Functions Functions Data Data ControlControl

Page 8: Distributed Database Systems

What is a Distributed Database What is a Distributed Database System? System?

A distributed database (DDB) is a collection of A distributed database (DDB) is a collection of multiple, logically interrelated databases multiple, logically interrelated databases distributed over a computer network. distributed over a computer network.

A distributed database management system (D–A distributed database management system (D–DBMS) is the software that manages the DDB and DBMS) is the software that manages the DDB and provides an access mechanism that makes this provides an access mechanism that makes this distribution transparent to the users. distribution transparent to the users.

Distributed database system (DDBS) = DDB + D–Distributed database system (DDBS) = DDB + D–DBMSDBMS

Page 9: Distributed Database Systems

What is not a DDBS? What is not a DDBS?

A timesharing computer system A timesharing computer system A loosely or tightly coupled A loosely or tightly coupled

multiprocessor system multiprocessor system A database system which resides at A database system which resides at

one of the nodes of a network of one of the nodes of a network of computers - this is a centralized computers - this is a centralized database on a network nodedatabase on a network node

Page 10: Distributed Database Systems

Time-Sharing SystemTime-Sharing System

Page 11: Distributed Database Systems

Centralized DBMS on a Centralized DBMS on a NetworkNetwork

Page 12: Distributed Database Systems

Multiple Clients/Single Multiple Clients/Single ServerServer

Page 13: Distributed Database Systems

Multiple Clients/Multiple Multiple Clients/Multiple ServersServers

Page 14: Distributed Database Systems

Distributed DBMS Distributed DBMS EnvironmentEnvironment

Page 15: Distributed Database Systems

Implicit AssumptionsImplicit Assumptions

Data stored at a number of sites Data stored at a number of sites each site logically consists of a single each site logically consists of a single

processor processor Processors at different sites are Processors at different sites are

interconnected by a computer network interconnected by a computer network no multiprocessors no multiprocessors

Distributed database is a database, not a Distributed database is a database, not a collection of filescollection of files data logically related as exhibited through the data logically related as exhibited through the

users’ access patterns users’ access patterns D-DBMS is a full-fledged DBMS D-DBMS is a full-fledged DBMS

Page 16: Distributed Database Systems

Shared-Memory Shared-Memory ArchitectureArchitecture

Page 17: Distributed Database Systems

Shared-Disk ArchitectureShared-Disk Architecture

Page 18: Distributed Database Systems

Shared-Nothing Shared-Nothing ArchitectureArchitecture

Page 19: Distributed Database Systems

D-DBMS ApplicationsD-DBMS Applications

Manufacturing - especially multi-Manufacturing - especially multi-plant manufacturing plant manufacturing

Military command and control Military command and control Corporate MIS Corporate MIS Airlines Airlines Hotel chains Hotel chains Any organization which has a Any organization which has a

decentralized organization structuredecentralized organization structure

Page 20: Distributed Database Systems

Distributed DBMS Distributed DBMS

AdvantagesAdvantages

Transparent management of Transparent management of distributed, fragmented, and distributed, fragmented, and replicated data replicated data

Improved reliability/availability Improved reliability/availability through distributed transactions through distributed transactions

Improved performance Improved performance Easier and more economical system Easier and more economical system

expansionexpansion

Page 21: Distributed Database Systems

TransparencyTransparency

Fundamental issue is to provide Fundamental issue is to provide data data independenceindependence in the distributed in the distributed environmentenvironment

Transparency is the separation of the higher Transparency is the separation of the higher level semantics of a system from the lower level level semantics of a system from the lower level implementation issues. implementation issues. Network (distribution) transparency Network (distribution) transparency Replication transparency Replication transparency Fragmentation transparency Fragmentation transparency

• horizontal fragmentation: selection horizontal fragmentation: selection • vertical fragmentation: projection vertical fragmentation: projection • hybridhybrid

Page 22: Distributed Database Systems

An ExampleAn Example

Page 23: Distributed Database Systems

Database Management Database Management SystemsSystems

Page 24: Distributed Database Systems

Distributed QueryDistributed Query

SELECT ENAME, SAL SELECT ENAME, SAL

FROM EMP, ASSGN, PAY FROM EMP, ASSGN, PAY

WHERE WHERE DUR > 12 DUR > 12

AND EMP.ENO = AND EMP.ENO = ASSGN.ENO ASSGN.ENO

AND EMP.TITLE = AND EMP.TITLE = PAY.TITLEPAY.TITLE

Page 25: Distributed Database Systems

Distributed Database – User Distributed Database – User ViewView

Page 26: Distributed Database Systems

Distributed Database – RealityDistributed Database – Reality

Page 27: Distributed Database Systems

Performance Performance EnhancementEnhancement

Proximity of data to its points of Proximity of data to its points of use use Requires some support for Requires some support for

fragmentation and replication fragmentation and replication Parallelism in execution Parallelism in execution

Inter-query parallelism Inter-query parallelism Intra-query parallelismIntra-query parallelism

Page 28: Distributed Database Systems

Parallelism RequirementsParallelism Requirements

Have as much of the data required by Have as much of the data required by each application at the site where the each application at the site where the application executes application executes Full replication Full replication

How about updates? How about updates? Updates to replicated data requires Updates to replicated data requires

implementation of distributed concurrency implementation of distributed concurrency control and commit protocolscontrol and commit protocols

Page 29: Distributed Database Systems

Scalability RequirementsScalability Requirements

Issue is database scaling Issue is database scaling Emergence of microprocessor and Emergence of microprocessor and

workstation technologies workstation technologies Demise of Grosh's law Demise of Grosh's law Client-server model of computing Client-server model of computing

Data communication cost vs Data communication cost vs telecommunication costtelecommunication cost

Page 30: Distributed Database Systems

Distribution Distribution TransparencyTransparency

The extent to which the distribution of data to The extent to which the distribution of data to sites is shielded from users and applicationssites is shielded from users and applications

Data distribution strategies:Data distribution strategies: Basic distribution tablesBasic distribution tables Extracted tablesExtracted tables Snapshot tablesSnapshot tables Replicated tablesReplicated tables Fragmented tablesFragmented tables

Horizontal fragmentationHorizontal fragmentation• PrimaryPrimary• DerivedDerived

Vertical fragmentationVertical fragmentation Mixed fragmentationMixed fragmentation

Page 31: Distributed Database Systems

. . . Distribution . . . Distribution TransparencyTransparency Location(Network) transparencyLocation(Network) transparency

hiding the details of the distribution of data in the hiding the details of the distribution of data in the networknetwork

Fragmentation transparencyFragmentation transparency When a data item is requested, the specific When a data item is requested, the specific

fragment need not be namedfragment need not be named Replication transparencyReplication transparency

A user should not know how a data item is replicatedA user should not know how a data item is replicated Naming and local autonomyNaming and local autonomy

two solutions in distributed databases:two solutions in distributed databases: a central name servera central name server each site prefix its own site identifier to any name each site prefix its own site identifier to any name

it generates: site5.deposit.f3.r2it generates: site5.deposit.f3.r2

Page 32: Distributed Database Systems

Issues in Distributed Issues in Distributed DatabasesDatabases

Distributed Database Design Distributed Database Design how to distribute the database how to distribute the database replicated & non-replicated database replicated & non-replicated database

distribution distribution a related problem in directory management a related problem in directory management

Query Processing Query Processing convert user transactions to data convert user transactions to data

manipulation instructions manipulation instructions optimization problem optimization problem min{cost = data transmission + local min{cost = data transmission + local

processing} processing} general formulation is NP-hardgeneral formulation is NP-hard

Page 33: Distributed Database Systems

Issues in Distributed Databases Issues in Distributed Databases (contd.)(contd.)

Concurrency Control Concurrency Control synchronization of concurrent accesses synchronization of concurrent accesses consistency and isolation of transactions' consistency and isolation of transactions'

effects effects deadlock management deadlock management

Reliability Reliability how to make the system resilient to how to make the system resilient to

failures failures atomicity and durabilityatomicity and durability

Page 34: Distributed Database Systems

Relationship Between Relationship Between IssuesIssues

Page 35: Distributed Database Systems

Directory ManagementDirectory Management

Metadata (catalog) can be stored:Metadata (catalog) can be stored: CentrallyCentrally

Master siteMaster site

DistributivelyDistributively Maximize autonomyMaximize autonomy

Fully replicatedFully replicated Metadata cachingMetadata caching

Distributed approach with caching of Distributed approach with caching of metadata from remote sitesmetadata from remote sites

Page 36: Distributed Database Systems

. . . Directory . . . Directory ManagementManagement How to ensure that the cached How to ensure that the cached

metadata is up-to-date?metadata is up-to-date?Owner responsibilityOwner responsibilityVersionsVersions

Dependency trackingDependency trackingOptimized execution plansOptimized execution plans

Page 37: Distributed Database Systems

Distributed Distributed Concurrency ControlConcurrency Control Nonreplicated SchemeNonreplicated Scheme

Each site maintains a local lock manager to Each site maintains a local lock manager to administer lock and unlock requests for local dataadminister lock and unlock requests for local data

Deadlock handling is more complexDeadlock handling is more complex Single-Coordinator ApproachSingle-Coordinator Approach

The system maintains a single lock manager that The system maintains a single lock manager that resides in a single chosen siteresides in a single chosen site

Can be used with replicated dataCan be used with replicated data AdvantagesAdvantages

simple implementationsimple implementation simple deadlock handlingsimple deadlock handling

DisadvantagesDisadvantages bottleneckbottleneck vulnerabilityvulnerability

Page 38: Distributed Database Systems

. . . Distributed . . . Distributed Concurrency ControlConcurrency Control Majority ProtocolMajority Protocol

A lock manager at each siteA lock manager at each siteWhen a transaction wishes to lock a When a transaction wishes to lock a

data item Q, which is replicated in n data item Q, which is replicated in n different sites, it must send a lock different sites, it must send a lock request to more than half of the n request to more than half of the n sites in which Q is storedsites in which Q is stored

complex to implementcomplex to implementdifficult to handle deadlocksdifficult to handle deadlocks

Page 39: Distributed Database Systems

. . . Distributed . . . Distributed Concurrency ControlConcurrency Control Biased ProtocolBiased Protocol

A variation of the majority protocolA variation of the majority protocol When a transaction needs a shared lock on When a transaction needs a shared lock on

data item Q, it requests the lock from the data item Q, it requests the lock from the lock manager at one site containing a replica lock manager at one site containing a replica of Qof Q

When a transaction needs an exclusive lock, When a transaction needs an exclusive lock, it requests the lock from the lock manager at it requests the lock from the lock manager at all sites containing a replica of Qall sites containing a replica of Q

Primary CopyPrimary Copy A replica is selected as the primary copyA replica is selected as the primary copy Locks are requested from the primary siteLocks are requested from the primary site

Page 40: Distributed Database Systems

Deadlock HandlingDeadlock Handling

Local waits-for graph (WFG) is not Local waits-for graph (WFG) is not enoughenough

Centralized siteCentralized site Distributed deadlock avoidanceDistributed deadlock avoidance

A unique priority is assigned to each A unique priority is assigned to each transaction transaction (time stamp + site number)(time stamp + site number)

Wait-die : if priority (TWait-die : if priority (Tii) < priority (T) < priority (Tjj) ) then Tthen Tii waits, else T waits, else Tii dies dies

Page 41: Distributed Database Systems

. . . Deadlock Handling. . . Deadlock Handling

Decentralized deadlock detectionDecentralized deadlock detection Path pushing techniquePath pushing technique

Periodically each site analyzes its local WFG and Periodically each site analyzes its local WFG and lists all paths. For each path Tlists all paths. For each path Tii -> ... -> T -> ... -> Tjj, it sends , it sends

the path information to every site where Tthe path information to every site where Tjj might be might be

blocked because of a lock waitblocked because of a lock wait

WFG1: TWFG1: T11 -> T -> T22 -> T -> T33 and T3 has made a RPC to and T3 has made a RPC to

site 2site 2

WFG2: TWFG2: T33 -> T -> T44 and T4 has made a RPC to and T4 has made a RPC to

site 3site 3

WFG3: TWFG3: T44 -> T -> T22

Page 42: Distributed Database Systems

Distributed Distributed Transaction Transaction ManagementManagement More complex, since several sites may More complex, since several sites may

participate in executing a transactionparticipate in executing a transaction Transaction managerTransaction manager

maintaining a log for recovery purposesmaintaining a log for recovery purposes concurrency controlconcurrency control

Transaction coordinator Transaction coordinator Starting the execution of the transactionStarting the execution of the transaction Breaking the execution into a number of Breaking the execution into a number of

subtransactionssubtransactions Coordinating the termination of the Coordinating the termination of the

transactiontransaction

Page 43: Distributed Database Systems

... Distributed Transaction ... Distributed Transaction ManagementManagement

Additional failuresAdditional failures site failuresite failure link failurelink failure loss of messageloss of message network partitionnetwork partition

Atomicity of transactionsAtomicity of transactions All participating sites either commit or abortAll participating sites either commit or abort

Protocols to ensure atomicityProtocols to ensure atomicity Two-phase commitTwo-phase commit Three-phase commitThree-phase commit

Page 44: Distributed Database Systems

Two-phase Commit Two-phase Commit ProtocolProtocol

Let T be a transaction initiated at site SLet T be a transaction initiated at site S ii with coordinator with coordinator CCii When all the sites at which T has executed inform CWhen all the sites at which T has executed inform C ii that that T has completed, then CT has completed, then Cii starts the two-phase commit: starts the two-phase commit: Phase 1Phase 1

CCii adds the record <prepare T> to the log adds the record <prepare T> to the log sends a prepare T message to all sites at which T executedsends a prepare T message to all sites at which T executed if a site answers no, it adds <abort T> to its log and then responds by if a site answers no, it adds <abort T> to its log and then responds by

sending an abort T message to Csending an abort T message to Cii

if a site answers yes, it adds <ready T> to its log and then responds if a site answers yes, it adds <ready T> to its log and then responds by sending a ready T message to Cby sending a ready T message to Cii

Phase 2Phase 2 if Cif Cii receives a ready T message from all the participating sites, T can receives a ready T message from all the participating sites, T can

be committed, Cbe committed, Cii adds <commit T> to the log and sends a commit T adds <commit T> to the log and sends a commit T message to all participating sites to be added to their logsmessage to all participating sites to be added to their logs

if Cif Cii receives an abort T message or when a prespecified interval of receives an abort T message or when a prespecified interval of time has elapsed, it adds <abort T> to the log and then sends an time has elapsed, it adds <abort T> to the log and then sends an abort T messageabort T message

Page 45: Distributed Database Systems

Handling of FailuresHandling of Failures Failure of a participating site SFailure of a participating site Skk

When the site recovers it examines its log and:When the site recovers it examines its log and:

If the log contains a <If the log contains a <commit Tcommit T> record, S> record, Skk executes executes

redoredo((TT))

If the log contains a <If the log contains a <abort Tabort T> record, S> record, Skk executes executes undoundo((TT))

If the log contains a <If the log contains a <ready Tready T> record, the site must > record, the site must

consult Cconsult Ci i to determine the fate of T. If Cto determine the fate of T. If Cii is up, it notifies S is up, it notifies Skk

as to whether as to whether TT committed or aborted. If C committed or aborted. If Cii is down, S is down, Skk

must find the fate of must find the fate of TT from other sites by sending a query- from other sites by sending a query-status status TT message to all the sites in the system message to all the sites in the system

If the log has no control records concerning T, SIf the log has no control records concerning T, Skk executes executes

undoundo((TT))

Page 46: Distributed Database Systems

... Handling of Failures... Handling of Failures

6-12

Failure of the coordinatorFailure of the coordinator If an active site contains a <If an active site contains a <commit Tcommit T> in its > in its

log, then log, then TT must be committed must be committed If an active site contains a <If an active site contains a <abort Tabort T> in its > in its

log, then log, then TT must be aborted must be aborted If some active sites does not contain <If some active sites does not contain <ready ready

TT> in its log, then > in its log, then TT must be aborted must be aborted If non of the above cases hold, the active site If non of the above cases hold, the active site

must wait for Cmust wait for Cii to recover (blocking problem) to recover (blocking problem) Failure of a linkFailure of a link

same as beforesame as before

Page 47: Distributed Database Systems

Three-phase Commit Three-phase Commit ProtocolProtocol The major disadvantage of the The major disadvantage of the

two-phase commit protocol is two-phase commit protocol is that coordinator failure may that coordinator failure may result in blockingresult in blocking

The three-phase commit protocol The three-phase commit protocol is designed to avoid the is designed to avoid the possibility of blockingpossibility of blocking

Page 48: Distributed Database Systems

Distributed Query Distributed Query OptimizationOptimization

6-6

Other factors of distributed query Other factors of distributed query processingprocessing Cost of data transmissionCost of data transmission Potential gain from parallel processingPotential gain from parallel processing Relative speed of processing at each siteRelative speed of processing at each site

Join strategiesJoin strategies Ship one operandShip one operand Ship both operandsShip both operands SemijoinSemijoin Distributed nested loopDistributed nested loop

Fragment access optimizationFragment access optimization

Page 49: Distributed Database Systems

Distributed Integrity Distributed Integrity ConstraintsConstraints Fragment integrityFragment integrity Distributed integrity constrainsDistributed integrity constrains

Page 50: Distributed Database Systems

Truly Distributed Truly Distributed CapabilitiesCapabilities Code’s twelve rulesCode’s twelve rules

Local autonomyLocal autonomy No reliance on a central site for any particular serviceNo reliance on a central site for any particular service Continues operationContinues operation Location independenceLocation independence Fragmentation independenceFragmentation independence Replication independenceReplication independence Distributed query processingDistributed query processing Distributed transaction managementDistributed transaction management Hardware independenceHardware independence Operating system independenceOperating system independence Network independenceNetwork independence DBMS independenceDBMS independence