data processing over very large relational databases

38
Data Processing over Very Large Databases Ing. Ľuboš Takáč Supervisor: doc. Ing. Michal Zábovský, PhD. Faculty of Management Science and Informatics University of Žilina

Upload: kvaderlipa

Post on 20-Jun-2015

164 views

Category:

Technology


0 download

DESCRIPTION

Final presentation of my dissertation thesis focused on orientation, analyzing and finding information in large or unknown relational databases and data visualisation

TRANSCRIPT

Page 1: Data Processing over very Large Relational Databases

Data Processing over Very Large Databases

Ing. Ľuboš Takáč

Supervisor: doc. Ing. Michal Zábovský, PhD.

Faculty of Management Science and Informatics

University of Žilina

Page 2: Data Processing over very Large Relational Databases

Large Databases

• VLDB (very large databases)

• Relational Databases with hundreds of tables and millions of rows

Page 3: Data Processing over very Large Relational Databases

The Problem

• How to understand relational database model so that we could find information in them.

• Orientation in large RDB– given by the complexity of RDB model

• Modification and development of RDB.

Page 4: Data Processing over very Large Relational Databases

Existing approaches

• Database metrics

• Database visualization

• Database to ontology mapping and examination of ontology

Page 5: Data Processing over very Large Relational Databases

Database Metrics• Database metric is a function that assigns to an

object from the database a numeric value.

• Examples of table metrics– DRT(T) – depth of relational tree

– TS(T) – table size

– RD(T) – referential degree

– …

• Rankings – grouping metrics with different weights.

Page 6: Data Processing over very Large Relational Databases

RDB Visualization

• Database schema visualization.

• Standard ER - diagram is insufficient for large RDB model.

Page 7: Data Processing over very Large Relational Databases
Page 8: Data Processing over very Large Relational Databases
Page 9: Data Processing over very Large Relational Databases

SchemaBall

• Visualization of large or complex RDB schemas.

• Using RDB metrics and rankings.

• We implemented and enhanced such solution.

Page 10: Data Processing over very Large Relational Databases

SchemaBall

Page 11: Data Processing over very Large Relational Databases

Visualization of RDB schema graph

• Vertex and edge weighted graph based on RDB metrics.

• Using Gephi for visualization– automatic generated layout

– interactive visualization (selections, examinations of nodes and edges)

– using graph algorithms

Page 12: Data Processing over very Large Relational Databases
Page 13: Data Processing over very Large Relational Databases
Page 14: Data Processing over very Large Relational Databases

Analyzing of RDB graph

• Three approaches– graph of RDB model (vertex – table, edges – foreign key

relations)

– alternative (vertex – table, edge – foreign key relation for each tuple)

– graph of tuples (vertex – tuple, edge – foreign key relation between tuples)

Page 15: Data Processing over very Large Relational Databases

Analyzing of RDB Graph – first approach

1 2 3 4 5 6 7 8 9 10 11 13 17 18 290.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

probability

vertex degreeDistribution function of vertex degree.

Page 16: Data Processing over very Large Relational Databases

Analyzing of RDB Graph – second approach

probability

vertex degreeDistribution function of vertex degree.

0.00

E+00

5.00

E+05

1.00

E+06

1.50

E+06

2.00

E+06

2.50

E+06

3.00

E+06

3.50

E+06

4.00

E+06

4.50

E+06

5.00

E+06

5.50

E+06

6.00

E+06

6.50

E+06

7.00

E+06

7.50

E+06

8.00

E+06

8.50

E+06

9.00

E+06

9.50

E+06

1.00

E+07

1.05

E+07

1.10

E+07

1.15

E+07

1.20

E+07

1.25

E+07

1.30

E+07

1.35

E+070

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 17: Data Processing over very Large Relational Databases

Analyzing of RDB Graph – third approach

count

vertex degree

Distribution function of vertex degree.

Page 18: Data Processing over very Large Relational Databases

Analyzing of RDB Graph – Scale free networks• Connected graph with Yule-Simon distribution of

vertex degree.

• , usually between 2 – 3

Page 19: Data Processing over very Large Relational Databases

Visualization of RDB schema network

Page 20: Data Processing over very Large Relational Databases

Analyzing of RDB Graph - Conclusion

• RDB model is scale-free.

• To understand RDB you must to understand centers at first. (there is not a lot of centres)

• Very useful metric NR(T) – number of references validated by analyzing of RDB Graph.

• We created 2 new metrics based on mentioned three approaches.

Page 21: Data Processing over very Large Relational Databases

A Method for Analyzing Large RDB

• Find components of schema graph (tables = vertices, FK = edges)

• Examine each component starting in order with largest first– If you get alone table, very probably is an archive, try to

check it or find another purpose.

– Else visualize it via ER diagram, Schamaball or graph using table metrics.

Page 22: Data Processing over very Large Relational Databases

Practical Example

• Unknown complex RDB– 332 tables

– 2339 attributes

– 192 foreign keys

– Size 2,4 GB

Page 23: Data Processing over very Large Relational Databases

All tables

Page 24: Data Processing over very Large Relational Databases

Archive Tables

• Each alone table is archive table, with convention “_A”

Page 25: Data Processing over very Large Relational Databases

Component A

Page 26: Data Processing over very Large Relational Databases

Component B

Page 27: Data Processing over very Large Relational Databases
Page 28: Data Processing over very Large Relational Databases

RDBAnalyzer• supports all RDB Systems supporting JDBC, easy

scalable, online connection

• features– large online RDB schema visualization

– finding the components of graph

– schema graph creation, visualization and export (GEPHI)

– transform RDB to tuple graph

– metrics charts, parallel coordinates visualization

Page 29: Data Processing over very Large Relational Databases

RDBAnalyzer

Page 30: Data Processing over very Large Relational Databases

RDB to Ontology Mapping

– better understanding and searching for information without knowledge of RDB model, data mining from RDB

– can be used by web search engines to search in RDBs

– getting information from RDB by people, whose do not understand RDB technology (layman)

– a method how to merge multiple databases (ontology merging)

– interactive searching for information (Protégé)

Page 31: Data Processing over very Large Relational Databases

RDB Schema NORTHWIND (ER-Diagram)

Page 32: Data Processing over very Large Relational Databases

OntoGraph (Protége)

Page 33: Data Processing over very Large Relational Databases
Page 34: Data Processing over very Large Relational Databases

How to find information in Ontologies

• using query language (SPARQL)

• interactive (e.g. Protégé)– using OntoGraf combined with text searching

– explore entities and individuals

Page 35: Data Processing over very Large Relational Databases
Page 36: Data Processing over very Large Relational Databases

Disadvantages & Problems of mapped RDBs to Ontologies

• Difficult to maintain actual data (static & dynamic Ontology creation).

• Aggregated queries are very slow.

• Existing tools are not capable with large RDBs (or large ontologies).

Page 37: Data Processing over very Large Relational Databases

Conclusion & Scientific Contribution• Design and creation of method for orientation,

understanding and finding information in large or unknown relational databases. (RDBAnalyzer supports mentioned principles)

• Detection of RDB graph characteristics (Scale free network) and using this knowledge to create 2 new and validate 1 existing metric.

• Design and creation of method for finding information in ontologies generated from RDB.

Page 38: Data Processing over very Large Relational Databases

Thank you for your attention!

[email protected]