a. benabdelkader, 2009 name goes here 03.01.06 nurturing relationships. enhancing value. managing...

A. Benabdelkader, 2009

Name goes here03.01.06

Nurturing Relationships. Enhancing Value.

Managing Distributed Data

Bio-Wise Information Management

Enhancing Value … ….

Ammar Benabdelkader

Amsterdam, 25 – 05 - 2009

http://www.science.uva.nl/~ammar


Managing Distributed Data

• Introduction to Relational Databases

• The relational model

• Distributed Databases

• Homogeneous / heterogeneous databases

• e-Science Database Integrator (e-DBI)

• e-DBI, a scenario case

• Data Warehouse

• Multi-dimensional data model


Introduction to Relational Databases

• Relational Model

• Primary/Foreign keys

• Integrity constraints

• Data Normalization

• Queries/Join

• Views


© Prentice Hall, 2002

• Definition: A relation is a named, two-dimensional table of data• Table is made up of rows (records), and columns (attribute or field)

• Requirements:• Every relation has a unique name.• Every attribute value is atomic (not multivalued, not composite)• Every row is unique (can’t have two rows with exactly the same

values for all their fields)• Attributes (columns) in tables have unique names• The order of the columns is irrelevant• The order of the rows is irrelevant

Relational Model


Key Fields

• Keys are special fields that serve two main purposes:

• Primary keys are unique identifiers of the relation. Examples

include employee numbers, social security numbers, etc. This is

how we can guarantee that all rows are unique

• Foreign keys are identifiers that enable a dependent relation to

refer to its parent relation

• Keys can be simple (a single field) or composite (more

than one field)



Primary Key

Foreign Key (implements 1:N relationship between customer and order)

Composite primary key (implement M:N relationship between order and product)


Key Fields - Example


Integrity Constraints

• Domain Constraints• Allowable values for an attribute.

• Entity Integrity• No primary key attribute may be null. All primary key fields

MUST have data

• Referential Integrity• Rule that states that any foreign key value MUST match a

primary key value in the relation of the one side.



Referential integrity constraints are drawn via arrows from dependent to parent table


Integrity Constraints - Example


Data Normalization

• Primarily a tool to validate and improve a logical design so that it

satisfies certain constraints that avoid unnecessary

duplication of data

• The process of decomposing relations with anomalies to

produce smaller, well-structured relations



Well-Structured Relations

• A relation that contains minimal data redundancy and allows users to insert, delete, and update rows without causing data inconsistencies

• Goal is to avoid anomalies• Insertion Anomaly – adding new rows forces user to create

duplicate data• Deletion Anomaly – deleting rows may cause a loss of data

that would be needed for other future rows• Modification Anomaly – changing data in a row forces

changes to other rows because of duplication



• First Normal Form• No multivalued attributes• Every attribute value is atomic

• Second Normal Form• 1NF plus• Every non-key attribute must be defined by the entire key,

not by only part of the key• No partial functional dependencies

• Third Normal Form• 2NF PLUS no transitive dependencies: one attribute

functionally determines a second, which functionally determines a third.

Data Normalization



• Structured Query Language• The standard for relational database management

systems (RDBMS) • SQL-92 Standard -- Purpose:

• Specify syntax/semantics for data definition and manipulation

• Define data structures• Enable portability• Specify minimal (level 1) and complete (level 2) standards• Allow for later growth/enhancement to standard

SQL



• Data Definition Language (DDL)

• Create table

• Create view, etc.

• Data Manipulation Language (DML)

• Insert statement: adds data to a table

• Delete statement: removes rows from a table

• Update statement: modifies data in existing rows

• Select statement: queries single or multiple tables

SQL


Views

• Views provide users controlled access to tables

• Advantages of views:• Simplify query commands

• Provide data security

• Enhance programming productivity



The SELECT Statement

• Used for queries on single or multiple tables• Clauses of the SELECT statement:

• SELECT• List the columns (and expressions) that should be returned from the query

• FROM• Indicate the table(s) or view(s) from which data will be obtained

• WHERE• Indicate the conditions under which a row will be included in the result

• GROUP BY• Indicate categorization of results

• HAVING• Indicate the conditions under which a category (group) will be included

• ORDER BY• Sorts the result according to specified criteria



SQL Operators

• Boolean operators: AND, OR, NOT• String comparison using wildcards %: LIKE• Scalar aggregate using COUNT• Vector aggregate using GROUP BY• Qualifying results by categories using HAVING



Joins

• Join – a relational operation that causes two or more tables

with a common domain to be combined into a single table or view • The common columns in joined tables are usually the primary

key of the dominant table and the foreign key of the dependent table in 1:M relationships.



customersorders

order lines

products


Joins - Example


• Assemble all information necessary to create an invoice for order number 1006

• SELECT C.CUSTOMER_ID, C.NAME, C.ADDRESS, CITY, SATE, POSTAL_CODE, O.ORDER_ID, O.DATE, QUANTITY, P.NAME, PRICE, (QUANTITY * PRICE)

• FROM CUSTOMER C, ORDER O, ORDER_LINE OL, PRODUCT P• WHERE C.CUSTOMER_ID = OL.CUSTOMER_ID AND O.ORDER_ID = OL.ORDER_ID

AND OL.PROEUCT_ID = P.PRODUCT_IDAND O.ORDER_ID = 1006;

Four tables involved in this join

Multiple Table Join Example

Each pair of tables requires an equality-check condition in the WHERE clause, matching primary keys against foreign keys


From CUSTOMER_T table

From ORDER_T table From PRODUCT_T table


Multiple Table Join Example


Distributed DatabasesDistributed Databases

A single logical database that is spread physically

across computers in multiple locations that are

connected by a data communications link

• Homogeneous - Same DBMS at each node

• Autonomous - Independent DBMSs

• Non-autonomous - Central , coordinating DBMS

• Easy to manage, difficult to enforce

• Heterogeneous - Different DBMSs at different nodes

• Autonomous - Independent DBMSs

• Difficult to manage, preferred by independent organizations



Homogeneous, Non-Autonomous Database

• Data is distributed across all the nodes

• Same DBMS at each node

• All data is managed by the distributed DBMS (no

exclusively local data)

• All access is through one, global schema

• The global schema is the union of all the local schema



Identical DBMSs

Homogeneous Database

Source: adapted from Bell and Grimson, 1992.



Typical Heterogeneous Environment

• Data distributed across all the nodes

• Different DBMSs may be used at each node

• Local access is done using the local DBMS and schema


Data Integration For e-Science

• Data Management Architecture

• e-DBI data Integration Layer

• e-DBI Data Integration Approach

• e-DBI Implementation Strategy

• e-DBI Demo


Data ManagementData ManagementHigh Level Architecture DesignHigh Level Architecture Design

Standard DBMS (Oracle, MySQL, etc.)Standard DBMS (Oracle, MySQL, etc.)

File Servers (SRM, SRB, Gridftp, etc.)File Servers (SRM, SRB, Gridftp, etc.)

Ap

p.

Sp

ecifi

c d

ata

sou

rces

Ap

p.

Sp

ecifi

c d

ata

sou

rces

Data Sources Manager

e-DBI

Virtual / Materialized user-specific catalogs

1

2

3

4

~.~.~.~. ~.~.~.~. . . . . . . .

~.~.~.~. ~.~.~.~. . . . . . . .


Data Integration LayerData Integration Layer

Virtual/Materialized user-specific repositoriesVirtual/Materialized user-specific repositories

DS Registry MD Collector

Pre-defined Dynamic

Meta Data Integrator

e-Science Database Integratore-DBI

e-Science Database Integratore-DBI

Data Sources Manager

~.~.~.~. ~.~.~.~. . . . . . . . DS

~.~.~.~. ~.~.~.~. . . . . . . . MD


e-DBI – Approache-DBI – Approach

1. Define a virtual database (VDB), using any relational database

2. Select the needed information from the different data sources (tables):

• Filter the data

• Rename table name and attributes

• Reformat the data (apply any conversion if required)

3. Transfer the data into the new VDB, by copying the information

4. Enhance the VDB

• Set new constraints

• Merge or fuse data

• Apply additional reformatting, etc.

5. Update the VDB

• Check anytime availability and completeness at the sources

• Decide whether to perform an update or a data replacement

The Current Implementation of e-DBI supports the following data sources:

•Oracle, Sybase, MySQL, XML, Excel Spreadsheets, etc


e-DBI – DBMS Driver Registrye-DBI – DBMS Driver Registry

Description: e-DBI DBMS Driver Registry allows the user from the application to register supported DBMSsregister supported DBMSs. Information required to be register a DBMS includes: DBMS driver, and URL format.


e-DBI – DS Registrye-DBI – DS Registry

Description: e-DBI Data Source Registry allows the user from the application to register the data sourcesregister the data sources that will be used during the integration process. Information to be registered includes: DS name, host, port, driver, user name, and user password.


e-DBI Interfacee-DBI Interface


e-DBI – MD Collectore-DBI – MD Collector

Description: e-DBI Meta Data Collector allows the user from the application to identify identify the sub set of meta datathe sub set of meta data to be used for integration. In addition, MD Collector allows a limited meta data conversion to be applied against the single data sources, namely: renaming, conversion, aggregation, and type casting.

Metadata Collector


e-DBI – MD Integratore-DBI – MD Integrator

Description: e-DBI Meta Data Integrator allows the user from the application to perform MD integrationMD integration from the different data sources based on the set of metadata gathered through the MD collector. MD Integrator will allow a full integration of meta data from the different source, including data merging and data aggregation.

Metadata Integrator


e-DBI – Implementation Strategye-DBI – Implementation Strategy

e-DBI implementation is based on the open source Squirrel SQL project (http://squirrel-sql.sourceforge.net ).

e-DBI targets the following additional challenges:

1. Provide an interface that is more suitable and convenient for the scientist• Hide un-necessary details• Re-organize the architecture of the interface

2. Enhance the data integration functionalities:• Provide a hybrid solution between federated and warehousing

approaches• Facilitate schema update and data refreshment


Data Warehousing and OLAP Technology

• What is a data warehouse?

• A multi-dimensional data model

• Data warehouse architecture

• Data Warehouse Process

• Data Cube example

• Slice and Dice Queries


What is Data Warehouse?

• Defined in many different ways, but not rigorously.

• A data warehouse is simply a single, complete, and

consistent store of data obtained from a variety of

sources and made available to end users in a way

they can understand and use it in a business

context.” -- Barry Devlin, IBM Consultant

• “A data warehouse is a subject-oriented, integrated,

time-variant, and nonvolatile collection of data in

support of management’s decision-making

process.”—W. H. Inmon


Data Warehouse

• Subject-OrientedSubject-Oriented

• Organized around major subjects, such as gene, ontology, customer, product.

• Provide a simple and concise view around particular subject issues by excluding

data that are not useful in the decision support process

• IntegratedIntegrated

• Constructed by integrating multiple, heterogeneous data sources

• Data cleaning and data integration techniques are applied

• Time-VariantTime-Variant

• Data warehouse provides information from a historical perspective

• Every key structure contains an element of time, explicitly or implicitly

• Non-VolatileNon-Volatile

• A physically separate store of data transformed from the operational environment.

• Operational update of data does not occur in the data warehouse.


DW – Multidimensional model

• A data warehouse is based on a multidimensional data model which views data in the form of a data cube

• Data is modeled as fact table (s) and multiple dimensions• Fact tableFact table contains measures and keys to each of the

related dimension tables (e.g. Call Details)

• Dimension tablesDimension tables, such as Caller (name, address, Tel), or time(day, week, month, quarter, year)


DW - Conceptual Modeling

• Modeling data warehouses: dimensions &

measures• Star schemaStar schema: A fact table in the middle connected to a

set of dimension tables

• Snowflake schemaSnowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

• Fact constellationsFact constellations: Multiple fact tables share

dimension tables, viewed as a collection of stars,

therefore called galaxy schema or fact constellation


Example of Star Schema

Bonus keyBonus TypeDateAmount

Bonus Location keyStreetCity key

Location

Fact TableCall Detail

Time

Caller key

Call Type

Location key

Call Duration

Call Cost

Credit

Customer keyCustomer NameAddressTelephoneFax

Customer

Type keySIM PackNumber FormatSupplier

Call Type

measures

A fact tablefact table in the middle connected to a set of dimension tables

Date

Bonus key


Example of Snowflake Schema

City keyCityProvinceCountry

City

Supplier

Supplier keySupplier type

Dimensional hierarchyhierarchy is added to the dimension tables


Bonus

Location keyStreetCity key

Location


Time

Caller key

Call Type

Location key

Call Duration

Call Cost

Credit


Customer

Type keySIM PackNumber FormatSupplier key

Call Type

measures

Date

Bonus key


Example of Fact Constellation

Sales Fact Table

Dealer keyDealer nameDealer typeLocation

Dealer

Customer key

Product key

Dealer key

Location

Unit Price

Total cost

To Location

Multiple fact tables shareshare dimension tables, viewed as a collection of stars


Bonus

Location keyStreetCity key

Location


Time

Caller key

Call Type

Location key

Call Duration

Call Cost

Credit


Customer

Type keySIM PackNumber FormatSupplier key

Call Type

measures

Date

Bonus key


Example of Fact Constellation

Fact Table

S_Key_1S_Key_2S_Key_3S_Key_4Measure_1Measure_2Measure_3

NumNumNumNumNumNumNum

Dimension 1

S_Key_1Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5

NumTXTTXTTXTTXTTXT

Dimension 2


NumTXTTXTTXTTXTTXT

Dimension 3


NumTXTTXTTXTTXTTXT

Dimension 4


NumTXTTXTTXTTXTTXT Fact Table



Dimension 2


NumTXTTXTTXTTXTTXT

Dimension 3


NumTXTTXTTXTTXTTXT

Dimension 4


NumTXTTXTTXTTXTTXT

Fact Table



Dimension 1


NumTXTTXTTXTTXTTXT

Dimension 2


NumTXTTXTTXTTXTTXT

Dimension 4


NumTXTTXTTXTTXTTXT


Atlas data warehouse for integrative bioinformatics


Data Warehouse Process

• Data warehouses are designed around a businessdesigned around a business process rather than the business requirements.

• Business processes corresponds to the data flows within a data warehouse. These processes are (ETL):• Extract and Load the data

• Clean and Transform the data into a form that cope with large data volumes and provide good query performance.

• backup and Archive data

• Manage queries and direct them to the appropriate tables

Data SourcesData

WarehouseE T

L


Multi-Tiered ArchitectureMulti-Tiered Architecture

Source 2

StagingArea

Report

Report

Report

Report

Report

…

Source 1

Source n

E T

L

Qu

ery

Ma

na

ge

r

…basically outlines how the DW components fit together

DetailedInfo.

SummaryInfo.

MetaData

ArchiveDetailed

Information

Warehouse manager

E T

L


A Sample Data Cube

Total annual Experimentsfor East by scientist sc3

Date

Scien

tists

Exp

erim

ents

sum

sum sc3

sc1sc2

1Qtr 2Qtr 3Qtr 4Qtr

East

Center

West

sum


Slice and Dice Queries

• Slice and Dice: select and project on one or more

dimensions

experiments

scientists

genes

scientist= “Smith”

a. benabdelkader, 2009 name goes here 03.01.06 nurturing relationships. enhancing value. managing...

Documents