1 comparison of access methods for time-evolving data betty salzberg and vassilis tsotras cs599,...
Post on 20-Dec-2015
214 views
TRANSCRIPT
1
Comparison of Access Methods for Time-Evolving Data
Betty Salzberg and Vassilis Tsotras
CS599, Temporal and Spatial Databases Course.
Presented by:
Atousa Golpayegani
11/16/2000
2
Temporal database design problem
Access Methods
DBMS Software
Access Method?
Temporal Queries
Time Dimension
Database Designer
B+-tree
R-treesnapshot-index
time-index
3
Temporal database design problem
Access Methods
DBMS Software
Access Method
Temporal Queries
Selected Time Dimension
Access MethodSelection Criteria
Database Designer
B+-tree R-tree
snapshot-index
time-index
4
Outline
Brief introduction to temporal databases
Criteria for comparison of access methods
Efficient method design for transaction/Bitemporal data
Method classification and comparison
5
Conventional Vs. Temporal Databases
Conventional database captures only a single snapshot of the modeled reality, usually the most current. Can not support past and future data.
Ex: “ the current salary of an employee”.
Temporal database supports some time domain and is thus able to manage time varying data. User-defined time is excluded.
Ex: “ the current and the previous salaries of an employee since hiring”.
6
What is Stored in a Temporal Database?
Tuple-versioning Temporal model is used. In this model Database is a set of records that store the versions of the real life objects. Each record has: A time invariant key A number of time variant attributes (for simplicity just one) One or two intervals
Start time End time
Ex:
Employee (ss#, name, salary, [t1, now) )
7
Taxonomy of Time in a Temporal Database
Transaction Time Is defined as the time when a fact is stored in the database.
Valid Time Is defined as the time when a fact becomes effective (valid) in
reality. Bitemporal Time
Is the combination of the above two types. Time is assumed to be discrete, and consecutive nonnegative
integers Any change is assumed to occur only at an indicated time
Addition of an object Deletion of an object Value change of an object’s attribute
8
Transaction Time Database
Time Interval shows when an object was added and deleted. An object is alive from the time it is added to the database until
it is deleted. Ex. Object ‘a’ is the only one that is currently alive.
No record is physically deleted (logical deletion). Therefore database has logical state Ex. Object ‘b’ is deleted.
Logical state at time t consists of those records whose transaction time interval contains t. Ex. S(T4) = (‘a’ , ‘b’)
T1 T2 T3 T4 T5 Time
ab b
Logical state T4
9
Assumptions in a Transaction Time Database
Past states can not be changed.
Linear transaction time evolution A New database state is created by updating only the current
database state. Another option is called branching transaction time evolution, where
new states can be created from any past database state.
Implicit updating assumption If an object is updated at time t, the database system will be
updated at the same time. There is no delay.
10
Transaction Time Database Example
T1 T2 T3 T4 T5 Time
(SS1,A,X1)(SS2,B,X2) (SS12B,X2)
key Salary Transaction Time
SS1
SS2
SS2
X1
X2
X2
[ T1, now)
[ T2, T3)
[T4, T5)
Name
A
B
B
11
T1 T2 T3 T4 T5 Time
ab b
Logical state T4
Transaction Time Database Access Method Characteristic
Store the past logical states.
Support addition/deletion/modification changes on the objects of the current logical state.
Efficiently access and query objects in any of logical states.
12
Valid Time Database
Time Interval shows the validity period of an object in reality. Ex. The period that a contract is valid.
An object that is not valid anymore will be physically deleted from the database.
Only the latest “snapshot” of the collection of interval-objects are kept.
Valid TimeValid Time
I1I2
State i-1 State i
13
Valid Time Database Access Method Characteristic
Store the latest collection of interval-objects.
Support addition/deletion/modification changes to this collection.
Efficiently query the interval-objects contained in the collection.
Valid Time
I2
14
Bitemporal Database
Supports both transaction time and valid time. Instead of a single collection of interval-objects, there is a
sequence of collections, C(ti), indexed by transaction time.
Transaction time and valid time are two orthogonal time dimensions.
Ex. (Contract#, amount, duration, [t1, now))
t1 t4t3t2C(t1) C(t4)C(t3)C(t2)
15
Bitemporal Database Access Method Characteristic
Store its past logical states.
Support addition/deletion/modification changes on the interval-objects of its current logical state.
Efficiently access and query the interval-objects on any of its states.
16
Criteria for Comparison of Access Methods
Query
Access Method Costs
Index Pagination and Data Clustering
Migration of Past Data to Another Location
Lower Bounds on I/O Complexity
17
Query
One of the criteria for the comparison is how efficient an access method answers a query.
Three general classes of queries are chosen.
They can be used for both valid time and transaction time since
from a query perspective, valid time and transaction time are
simply collections of intervals.
18
Query Type I
Given a contiguous interval T, find all objects alive during this interval.
Representative query for this type:
Pure-timeslice query : A special case of type I when interval
T is reduced to a single time instant t
Ex.” Find all employees working at the company at time T’ “
T1 T2 T3 T8 T9 Time
ab b
TT’
19
Query Type II
Given a key range and contiguous time interval T, find the objects with keys in the given range that are alive during interval T.
Representative query for this type:
Range-timeslice query : A special case of type II when interval T is reduced to a single time instant t
Ex. “ Find the employees working at the company at time T’ and whose ssn belongs in range K “
T1 T2 T3 T8 T9 Time
ab b
TT’Range = {a,b}
c
20
Query Type III
Given a key range, find a history of objects in this range.
Representative query for this type: Pure-Key query : A special case of type III where the key
range is reduced to a single key
Ex. “Find the salary history of employee with ssn b“
T1 T2 T3 T8 T9 Time
ab b
Range = {b}c
21
Bitemporal Queries
Bitemporal queries will be a combination of the previous types.
Ex.“ Find all the company contracts that were valid on v = Jan,1,94 as recorded in the database during transaction time interval T = May,1 - May,20, 1993.
22
Three-Entry Notation Query-Type Representation
Key / Valid / Transaction
Values for Each entry: Point Range ‘ * ’ ‘ - ‘
Example Transaction time Pure-Timeslice: “ * / - / point “ Valid Pure-Key: “ point / * / - “
23
Access Method Costs
The performance of an access method is characterized by three costs:
1) Storage Space, to physically store the data records and the structures of the access method
2) Update Processing Time, is the time to update the method’s data structures as a result of a change.
3) Query Time, for each of the basic queries.
24
Storage Space
For Transaction time and Bitemporal methods, storage space is a function of ‘n’, O(n). ‘n’ is the number of changes in the database.
Ex. For 1000 insertions and 1000 deletions, n is 2000.
For Valid time methods, storage space is a function of ‘l ’, O(l).
‘l ’ is the number of interval-objects currently stored in the method..
Ex. If seven Objects are stored in the database, l = 7.
25
Update Processing Time
Update processing time is dependent on the access method index structure.
For tree structured access method, it is a logarithmic time.
Ex. The best in-core algorithm for valid pure-timeslice query, requires O (log l ) update processing time per change.
26
Query Time
A method’s query time is a function of the query answer size, denoted by ‘ a ‘.
It is usually consists of two parts:
Logarithmic time for searching the access method structure to find the answer records.
Time that is required to retrieve the result.
Ex. The best in-core algorithm for valid pure-timeslice query, requires O (log l + a) query time.
27
Index Pagination and Data clustering
In a database environment the major cost of computation is based on how many pages are transferred between main memory and secondary memory.
Index pagination. How well index nodes of a method are paginated on the secondary memory.
Since index is used to search for and update data, its pagination greatly improves the performance of the access method.
Ex. B+-tree is a well paginated index, since it needs O(logB r) page accesses for searching or updating r objects, using pages of size B.
28
Index Pagination and Data clustering
Data Clustering. Is to store ‘logically’ related data, physically close.
It can substantially improve the performance of an access method since in order to answer query, fewer pages are accessed.
Ex. A transaction pure-timeslice query takes O(logB n + a/B) page accesses if the result are clustered in pages of size B. ‘n’ is the number of changes in the database and ‘a’ is the answer size. If result is not clustered O(logB n + a) page accesses are required.
29
Migration of Past Data to Another Location
Methods that support transaction time maintain all their past states. This property can easily result in excessive amounts of data.
Methods that support the following two criteria will have a better performance:
Whether or not past data can be separated from the current data, so that the smaller collection of the current data can be accessed more efficiently.
Whether data is appended sequentially to the method and never changed, so that write-once read-many (WORM) devices can be used.
30
Migration of Past Data to Another Location
Since current data is more frequently accessed than past data, If a method supports these criteria, past data can be moved to a higher capacity, but slower medium such as optical disk.
There are two ways for this separation:
With Manual approach, a process vacuums all records that are ‘dead’ when the process is called and moves them to the optical disk.
With Automated approach, any objects that become ‘dead’ (logically deleted) will be moved to the optical disk
31
Lower Bound on I/O Complexity
The minimum number of I/Os to solve a transaction time query is at least (logB n + a/B)
Since in a paginated environment O(logB n) is the minimum number of I/Os to search for the answer in the index structure and O(a/B) is the minimum number of pages to store the answer.
This lower bound is under the assumption that all time instants have the same probability of being in the query predicate.
If queries have special probabilities, the logarithmic search part will be improved. Ex. If most queries ask for the recent times.
32
Lower Bound on I/O Complexity
The minimum number of I/Os to solve a valid time query is also at least (logB l + a/B).
A method is called “ I/O Optimal “ if it achieves:
O(logB n + a/B) query time and O(n/B) space for transaction time
O(logB l + a/B) query time and O(l/B) space for valid time.
O(logB m) update processing per change. ‘m’ is the number of alive objects when the update takes place.
33
Efficient Method Design for Transaction/Bitemporal Data
The Question is how to efficiently store large amount of data produced by a transaction time database.
There are two approaches for storing the data to answer pure-timeslice queries efficiently (copy, log):
Copy approach stores a copy of the transaction database state (timeslice) for each transaction time that at least one change occurred.
T1 T2 T3 T4 T5 Time
ab b
CopyCopy Copy CopyCopy
34
Efficient Method Design for Transaction/Bitemporal Data
These copies are indexed by time t. Since changes arrive in order, therefore indexes are paginated (which provides a minimal query time).
It means O(logB n) page accesses are needed to find the timeslice and O(a/B) pages to retrieve the result. Therefore the page accesses and also query time is O(logB n + a/B)
T1 T2 T3 T4 T5 Time
ab b
Copy 2 Rec.Copy 1 Rec. Copy 2 Rec. Copy 2 Rec.Copy 2 Rec.
35
Disadvantages:
The space used can in the worst case be O(n2/ B) pages.
1 + 2 + 3 + 4 + 5 + . . . + n = n *(n+1)/2 = (n2 + n)/2 O(n2/ B ).
Update processing per change in the worst: O(n/ B ).
Efficient Method Design for Transaction/Bitemporal Data
T1 T2 T3 T4 T5 Time
ab
Copy 3 Rec.Copy 1 Rec. Copy 2 Rec. Copy 5 Rec. . . .Copy 4 Rec.
dc
e
36
T1 T2 T3 T4 T5 Time
ab b
Copy 1 Rec.Copy 1 Rec. Copy 1 Rec. Copy 1 Rec.Copy 1 Rec.
Efficient Method Design for Transaction/Bitemporal Data
Log approach stores only the changes that occur in the database time-stamped by the time instant on which they occurred.
The space used is reduced to O(n/ B) pages. The Update processing per change is also reduces to O(1). The Query time increased to O(n/ B) Since to find a timeslice all
the past records may have to be searched
37
Efficient Method Design for Transaction/Bitemporal Data
For the Pure-Key queries a better approach than copy and log is to store history of each key separately, This creates a key-Only method.
For the Range-Timeslice queries it is best to cluster by transaction time and key within pages.
For the Bitemporal queries the copy approach can be used to store the collections at given transaction time and a log approach to store changes between copies.
Another approach is to index the bitemporal objects on a single time axis. This approach is not efficient because very few of the accessed objects may satisfy the valid-time predicate.
38
Method Classification
Transaction access method is classified into three categories:
Time-only : clustering data by time only.
Key-only : clustering data by key only.
Time-key : clustering data by both time and key.
39
Examples of Method Classes
Transaction Time
time-only key-only time-key append-only tree reverse chaining composite indexes
time-index accession lists segment-R tree
differential file approach time sequence array write-once B-tree
achievable time index C-lists time-split B-tree
snap shot index multiversion access window method structure
overlapping B-tree
40
Time Index Access Method (Time Only)
41
Time Index Access Method (Time Only)
Based B+-Tree access method on the time axis. New nodes are always added on the rightmost leaf of the index. It is equivalent to Copy approach
Query Time Update processing
timeSpace
Pure-TimeSlice
O(logB n + a/B) O(n/B) O(n2/B)
Range-TImeSclice
O(M logB n + a)O(n/B) O(n2/B)
Pure-Key N/A N/A N/A
42
Reverse chaining (Key-Only)
B+-treeIndex
Current Past
B+-treeIndex
Key
(K1,A1,I1) ø
(K2,A2,I2) P2
(K3,A3,I3) P3
(K2,A2”,I2”)(K2,A2’,I2’) ø
(K3,A3’,I3’) ø
(K4,A4,I4) ø
(K4,A4”,I4”)
(K4,A4’,I4’)
Key
43
Reverse Chaining
The current and past data are stored separately. Previous version of a given key are linked together in reverse
chronological order. Current data is assume to be queried more often.(smaller data size
means faster query) Each tuple includes
Key Attribute value Life span interval Pointer (which points to the previous version, if any, of this key).
Update processing is O(logB n) , the query time is O(logB n + a), and the used space is O(n / B)
44
Comparison of Access Methods for time evolving data
45
The End