1 presented by: victor gonzalez-castro lachlan mackinnon a survey “off the record” – using...
TRANSCRIPT
1
Presented by: Victor Gonzalez-Castro Lachlan MacKinnon
A survey “Off the Record” – A survey “Off the Record” – Using Alternative Data Using Alternative Data
Models to Increase Data Models to Increase Data Density in Data Warehouse Density in Data Warehouse
Enviroments.Enviroments.
2
AgendaAgenda
Introduction Data Sparsity State of the art
Relational Model The Triple Store The Binary Model The Associative model The Transrelational model
Our proposal Questions
3
IntroductionIntroduction• In Data Warehouse
environments Data Sparsity is a common issue that remains unresolved.
• Alternative Data Models that abandon the traditional record storage/manipulation structure have been researched.
• We are investigating the use of these alternative data models to increase data density with the idea to decrease data sparsity.
4
Origin of Data SparsityOrigin of Data Sparsity
• Data sparsity is originated from the aim of answering all possible user queries from the information stored in a Data Warehouse that contains Nulls.
$ $ $ $ $ $ $ $
$ $ $
$
$
Time Dimension
Month
Year
Day
Fig.1. A three level dimension and Nulls. After [6]
5
Origin of Data Sparsity (Cont…)Origin of Data Sparsity (Cont…)
• Data Sparsity is the result of the Cartesian product of all dimensions and all aggregation levels.
(Sparse)
(Dense)
Fig.2. Data Sparsity and data density. From [6].
6
State of the art. (Relational)State of the art. (Relational)
• The Relational Model [7] uses the traditional record storage/manipulation structure.
1234 Nut Red London
• It is the base model against which the other models will be compared.
• All RDBMS made a poor management of sparsity (missing information).
•Codd [7] suggested a fundamental change in the relational Model V2, the use of a 4 value-logic.
•No one has implemented this fundamental change
7
State of the art. (Relational)State of the art. (Relational)
• Major players on the Relational Market
/ SQL Server
8
State of the art. (TripleStore)State of the art. (TripleStore)
Identifier Name
1 Nut
2 Red
3 London
… …
• The Triple Store. [1],[2]. It uses a Structure called the Name Store to keep all the names.
• To construct the processing Structure, uses Triples.
1 2 3
4 5 6
… … …
l m n
9
State of the art. (TripleStore)State of the art. (TripleStore)
• The major project in Triple Store is TriStarp
• Tristarp was stablished in 1984. Leaded by Peter King with Support from IBM Hursley labs.
• Dr. Sharman from IBM Hursley [1] is visiting the Tristarp team.
• Current directions• Further development of the persistent Triple
Store Repository.• Continuing Research on the graph-based
model.• Extending technology to manage partially
structured data
10
State of the art. (Binary)State of the art. (Binary)
Sur Pname Color City
s1 Nut Red London
s2 Bolt Green Paris
s3 Screw Blue Oslo
• The Binary Model [4] considers that all tables are Binary tables.
Sur City
s1 London
s2 Paris
s3 Oslo
Sur Pname
s1 Nut
s2 Bolt
s3 Screw
Sur Color
s1 Red
s2 Green
s3 Blue
11
State of the art. (Binary)State of the art. (Binary)
• A Major Project in the Binary Model [4] is MONETDB.
• Is a DBMS designed to provide high performance on complex queries against real-world sized database.
• Achieves this goal using innovations at all layers of a DBMS: a storage model based on vertical fragmentation, processing speed by self-tuning relational operators, algorithms designed to exploit modern hardware, self-managing indexing structures, modular and extensible software architecture, etc.
• It is developed at the Institute for Mathematics and Computer Science Research of The Netherlands.
12
State of the art. (Associative)State of the art. (Associative)
Identifier Name
77 Nut
08 Red
32 London
12 That is
67 Is located in
• The Associative Model [3] comprises two types of data structures Items and Links.
• It differs from Binary and Triple store in one fundamental way; Associations themselves may be either the source or the target of other associations.
• It uses Quadruplets.
Identifier Source Verb Target
74 77 12 08
03 74 67 32
13
State of the art. (Associative)State of the art. (Associative)
• The Major product in the Associative Model is SentencesDB.
• Instead of using a separate, unique table for every different type of data, it uses a single, generic structure to contain all types of data.
• Information about the logical structure of the data and the rules that govern it are stored alongside the data in the database.
• The programs are truly reusable, and no longer need to be amended when the data structures change.
14
State of the art. (Transrelational)State of the art. (Transrelational)
• The TransRelational ModelTM. [5] keeps the Relational model itself but abandon the record storage structure. It uses two structures:
The Record Reconstruction Table.The Field Values Table.• Since there is currently no instantiation of the
Transrelational Model available, We will build an implementation of the essential algorithms.
P# PNAME COLOR CITY
P1 Bolt Blue London
P2 Cam Blue London
P3 Cog Green London
P4 Nut Red Oslo
P5 Screw Red Paris
P6 Screw Red Paris
P# PNAME COLOR CITY
4 3 2 1
1 1 4 4
5 6 5 6
6 4 1 3
2 2 3 2
3 5 6 5
15
Transrelational. AlgorithmsTransrelational. Algorithms
P# PNAME COLOR CITY
P1 Nut Red London
P2 Bolt Green Paris
P3 Screw Blue Oslo
P4 Screw Red London
P5 Cam Blue Paris
P6 Cog Red London
P# PNAME COLOR CITY
4 3 2 1
1 1 4 4
5 6 5 6
6 4 1 3
2 2 3 2
3 5 6 5
P# PNAME COLOR CITY
P1 Bolt Blue London
P2 Cam Blue London
P3 Cog Green London
P4 Nut Red Oslo
P5 Screw Red Paris
P6 Screw Red Paris
Field Values Table (FVT)
1. A file for the suppliers relation 2. Sort each column in asc.Record Reconst. Table (RRT)
P# PNAME COLOR CITY
P1 London
Nut Red
1. Go to Cell [1,1] of the FVT, fetch the value stored (P1).
3. Go to the corresponding RRT cell [4,2] and fetch the row number (4). The next (3rd or COLOR) is the 4th row in the FVT (Red).
5. Go to the corresponding RRT cell [4,1] and fetch value (1). The next 5th column does not exist, so it wraps around to the 1st column, so then is the 1st row in the FVT.
4. Go to the corresponding RRT cell [4,3] and fetch value (1). The next 4th or CITY) is the 1st row in the FVT (London).
2. Go to the same cell [1,1] in the RRT and fetch the value (4). It is interpreted to mean that the next field value (PNAME), is in the 4 th row of the FVT. Go to that cell and fetch the value (Nut)
16
Alternative Data Models ComparisonAlternative Data Models Comparison
Model Storage Structure Linkage Structure
Relational Table (Relation) By position
Triple Store Name Store Triple Store
Binary Binary Table Joins
Associative Items Links
Transrelational Field Values Table Record Reconstruction Table
17
Our proposal (Our aims)Our proposal (Our aims)
• To carry out an impartial survey on alternative Data Models.
• Compare whether or not the use of alternative data models can improve the Data Density in Data Warehouse environments.
• Observe the effect that such data density increase has on the data sparsity.
18
Our proposal (How…)Our proposal (How…)
• We intend to use an implementation of each data model
TransRelationalTM
• We will use TPC-H data set to load each database.
• Run a set of benchmark metrics, where available if not we will develop our metrics to determine relative performance and then consider relative data density and sparsity.
19
Just Remember…Just Remember…
• Instead of storing data horizontally, do it vertically and eliminate duplicate values.
123
456
789
234
567
Bolt
Screw
Nut
Nail
Black
Blue
White
Paris
London
Here are the Savings
• We are abandoning the traditional Record Structure, we are going “off the record”.
20
Questions?Questions?
22
ReferencesReferences
1. G C H Sharman and N Winterbottom, The Universal Triple Machine: a Reduced Instruction Set Repository Manager. Proceedings of BNCOD 6, pp 189-214, 1988.
2. TriStarp Web Site: http://www.dcs.bbk.ac.uk/~tristarp. Updated November, 2000.
3. Simon Williams. The Associative Model of Data, Second Edition, Lazy Software Ltd. ISBN: 1-903453-01-1 www.lazysoft.com
4. MonetDB. ©1994-2004 by CWI. http://monetdb.cwi.nl
5. Date, C.J. An introduction to Database Systems. Appendix A. The Transrelational Model , Eighth Edition. Addison Wesley. 2004. USA. ISBN: 0-321-18956-6.
6. Pendse Nigel. Database explosion. http://www.olapreport.com Updated Aug, 2003.
7. Codd, E.F. The Relational Model for Database Management Version 2. Addison-Wesley. 1990. ISBN 0-201-14192-2.