a. benabdelkader, 2009 name goes here 03.01.06 nurturing relationships. enhancing value. managing...

48
A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing Value … …. Ammar Benabdelkader Amsterdam, 25 – 05 - 2009 http://www.science.uva.nl/~ammar

Upload: millicent-logan

Post on 28-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Name goes here03.01.06

Nurturing Relationships. Enhancing Value.

Managing Distributed Data

Bio-Wise Information Management

Enhancing Value … ….

Ammar Benabdelkader

Amsterdam, 25 – 05 - 2009

http://www.science.uva.nl/~ammar

Page 2: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Managing Distributed Data

• Introduction to Relational Databases

• The relational model

• Distributed Databases

• Homogeneous / heterogeneous databases

• e-Science Database Integrator (e-DBI)

• e-DBI, a scenario case

• Data Warehouse

• Multi-dimensional data model

Page 3: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Introduction to Relational Databases

• Relational Model

• Primary/Foreign keys

• Integrity constraints

• Data Normalization

• Queries/Join

• Views

Page 4: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

© Prentice Hall, 2002

• Definition: A relation is a named, two-dimensional table of data• Table is made up of rows (records), and columns (attribute or field)

• Requirements:• Every relation has a unique name.• Every attribute value is atomic (not multivalued, not composite)• Every row is unique (can’t have two rows with exactly the same

values for all their fields)• Attributes (columns) in tables have unique names• The order of the columns is irrelevant• The order of the rows is irrelevant

Relational Model

Page 5: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Key Fields

• Keys are special fields that serve two main purposes:

• Primary keys are unique identifiers of the relation. Examples

include employee numbers, social security numbers, etc. This is

how we can guarantee that all rows are unique

• Foreign keys are identifiers that enable a dependent relation to

refer to its parent relation

• Keys can be simple (a single field) or composite (more

than one field)

© Prentice Hall, 2002

Page 6: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Primary Key

Foreign Key (implements 1:N relationship between customer and order)

Composite primary key (implement M:N relationship between order and product)

© Prentice Hall, 2002

Key Fields - Example

Page 7: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Integrity Constraints

• Domain Constraints• Allowable values for an attribute.

• Entity Integrity• No primary key attribute may be null. All primary key fields

MUST have data

• Referential Integrity• Rule that states that any foreign key value MUST match a

primary key value in the relation of the one side.

© Prentice Hall, 2002

Page 8: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Referential integrity constraints are drawn via arrows from dependent to parent table

© Prentice Hall, 2002

Integrity Constraints - Example

Page 9: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Data Normalization

• Primarily a tool to validate and improve a logical design so that it

satisfies certain constraints that avoid unnecessary

duplication of data

• The process of decomposing relations with anomalies to

produce smaller, well-structured relations

© Prentice Hall, 2002

Page 10: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Well-Structured Relations

• A relation that contains minimal data redundancy and allows users to insert, delete, and update rows without causing data inconsistencies

• Goal is to avoid anomalies• Insertion Anomaly – adding new rows forces user to create

duplicate data• Deletion Anomaly – deleting rows may cause a loss of data

that would be needed for other future rows• Modification Anomaly – changing data in a row forces

changes to other rows because of duplication

© Prentice Hall, 2002

Page 11: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

• First Normal Form• No multivalued attributes• Every attribute value is atomic

• Second Normal Form• 1NF plus• Every non-key attribute must be defined by the entire key,

not by only part of the key• No partial functional dependencies

• Third Normal Form• 2NF PLUS no transitive dependencies: one attribute

functionally determines a second, which functionally determines a third.

Data Normalization

© Prentice Hall, 2002

Page 12: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

• Structured Query Language• The standard for relational database management

systems (RDBMS) • SQL-92 Standard -- Purpose:

• Specify syntax/semantics for data definition and manipulation

• Define data structures• Enable portability• Specify minimal (level 1) and complete (level 2) standards• Allow for later growth/enhancement to standard

SQL

© Prentice Hall, 2002

Page 13: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

• Data Definition Language (DDL)

• Create table

• Create view, etc.

• Data Manipulation Language (DML)

• Insert statement: adds data to a table

• Delete statement: removes rows from a table

• Update statement: modifies data in existing rows

• Select statement: queries single or multiple tables

SQL

Page 14: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Views

• Views provide users controlled access to tables

• Advantages of views:• Simplify query commands

• Provide data security

• Enhance programming productivity

© Prentice Hall, 2002

Page 15: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

The SELECT Statement

• Used for queries on single or multiple tables• Clauses of the SELECT statement:

• SELECT• List the columns (and expressions) that should be returned from the query

• FROM• Indicate the table(s) or view(s) from which data will be obtained

• WHERE• Indicate the conditions under which a row will be included in the result

• GROUP BY• Indicate categorization of results

• HAVING• Indicate the conditions under which a category (group) will be included

• ORDER BY• Sorts the result according to specified criteria

© Prentice Hall, 2002

Page 16: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

SQL Operators

• Boolean operators: AND, OR, NOT• String comparison using wildcards %: LIKE• Scalar aggregate using COUNT• Vector aggregate using GROUP BY• Qualifying results by categories using HAVING

© Prentice Hall, 2002

Page 17: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Joins

• Join – a relational operation that causes two or more tables

with a common domain to be combined into a single table or view • The common columns in joined tables are usually the primary

key of the dominant table and the foreign key of the dependent table in 1:M relationships.

© Prentice Hall, 2002

Page 18: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

customersorders

order lines

products

© Prentice Hall, 2002

Joins - Example

Page 19: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

• Assemble all information necessary to create an invoice for order number 1006

• SELECT C.CUSTOMER_ID, C.NAME, C.ADDRESS, CITY, SATE, POSTAL_CODE, O.ORDER_ID, O.DATE, QUANTITY, P.NAME, PRICE, (QUANTITY * PRICE)

• FROM CUSTOMER C, ORDER O, ORDER_LINE OL, PRODUCT P• WHERE C.CUSTOMER_ID = OL.CUSTOMER_ID AND O.ORDER_ID = OL.ORDER_ID

AND OL.PROEUCT_ID = P.PRODUCT_IDAND O.ORDER_ID = 1006;

Four tables involved in this join

Multiple Table Join Example

Each pair of tables requires an equality-check condition in the WHERE clause, matching primary keys against foreign keys

Page 20: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

From CUSTOMER_T table

From ORDER_T table From PRODUCT_T table

© Prentice Hall, 2002

Multiple Table Join Example

Page 21: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Distributed DatabasesDistributed Databases

A single logical database that is spread physically

across computers in multiple locations that are

connected by a data communications link

• Homogeneous - Same DBMS at each node

• Autonomous - Independent DBMSs

• Non-autonomous - Central , coordinating DBMS

• Easy to manage, difficult to enforce

• Heterogeneous - Different DBMSs at different nodes

• Autonomous - Independent DBMSs

• Difficult to manage, preferred by independent organizations

© Prentice Hall, 2002

Page 22: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Homogeneous, Non-Autonomous Database

• Data is distributed across all the nodes

• Same DBMS at each node

• All data is managed by the distributed DBMS (no

exclusively local data)

• All access is through one, global schema

• The global schema is the union of all the local schema

© Prentice Hall, 2002

Page 23: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Identical DBMSs

Homogeneous Database

Source: adapted from Bell and Grimson, 1992.

© Prentice Hall, 2002

Page 24: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Typical Heterogeneous Environment

• Data distributed across all the nodes

• Different DBMSs may be used at each node

• Local access is done using the local DBMS and schema

Page 25: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Data Integration For e-Science

• Data Management Architecture

• e-DBI data Integration Layer

• e-DBI Data Integration Approach

• e-DBI Implementation Strategy

• e-DBI Demo

Page 26: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Data ManagementData ManagementHigh Level Architecture DesignHigh Level Architecture Design

Standard DBMS (Oracle, MySQL, etc.)Standard DBMS (Oracle, MySQL, etc.)

File Servers (SRM, SRB, Gridftp, etc.)File Servers (SRM, SRB, Gridftp, etc.)

Ap

p.

Sp

ecifi

c d

ata

sou

rces

Ap

p.

Sp

ecifi

c d

ata

sou

rces

Data Sources Manager

e-DBI

Virtual / Materialized user-specific catalogs

1

2

3

4

~.~.~.~. ~.~.~.~. . . . . . . .

~.~.~.~. ~.~.~.~. . . . . . . .

Page 27: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Data Integration LayerData Integration Layer

Virtual/Materialized user-specific repositoriesVirtual/Materialized user-specific repositories

DS Registry MD Collector

Pre-defined Dynamic

Meta Data Integrator

e-Science Database Integratore-DBI

e-Science Database Integratore-DBI

Data Sources Manager

~.~.~.~. ~.~.~.~. . . . . . . . DS

~.~.~.~. ~.~.~.~. . . . . . . . MD

Page 28: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

e-DBI – Approache-DBI – Approach

1. Define a virtual database (VDB), using any relational database

2. Select the needed information from the different data sources (tables):

• Filter the data

• Rename table name and attributes

• Reformat the data (apply any conversion if required)

3. Transfer the data into the new VDB, by copying the information

4. Enhance the VDB

• Set new constraints

• Merge or fuse data

• Apply additional reformatting, etc.

5. Update the VDB

• Check anytime availability and completeness at the sources

• Decide whether to perform an update or a data replacement

The Current Implementation of e-DBI supports the following data sources:

•Oracle, Sybase, MySQL, XML, Excel Spreadsheets, etc

Page 29: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

e-DBI – DBMS Driver Registrye-DBI – DBMS Driver Registry

Description: e-DBI DBMS Driver Registry allows the user from the application to register supported DBMSsregister supported DBMSs. Information required to be register a DBMS includes: DBMS driver, and URL format.

Page 30: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

e-DBI – DS Registrye-DBI – DS Registry

Description: e-DBI Data Source Registry allows the user from the application to register the data sourcesregister the data sources that will be used during the integration process. Information to be registered includes: DS name, host, port, driver, user name, and user password.

Page 31: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

e-DBI Interfacee-DBI Interface

Page 32: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

e-DBI – MD Collectore-DBI – MD Collector

Description: e-DBI Meta Data Collector allows the user from the application to identify identify the sub set of meta datathe sub set of meta data to be used for integration. In addition, MD Collector allows a limited meta data conversion to be applied against the single data sources, namely: renaming, conversion, aggregation, and type casting.

Metadata Collector

Page 33: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

e-DBI – MD Integratore-DBI – MD Integrator

Description: e-DBI Meta Data Integrator allows the user from the application to perform MD integrationMD integration from the different data sources based on the set of metadata gathered through the MD collector. MD Integrator will allow a full integration of meta data from the different source, including data merging and data aggregation.

Metadata Integrator

Page 34: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

e-DBI – Implementation Strategye-DBI – Implementation Strategy

e-DBI implementation is based on the open source Squirrel SQL project (http://squirrel-sql.sourceforge.net ).

e-DBI targets the following additional challenges:

1. Provide an interface that is more suitable and convenient for the scientist• Hide un-necessary details• Re-organize the architecture of the interface

2. Enhance the data integration functionalities:• Provide a hybrid solution between federated and warehousing

approaches• Facilitate schema update and data refreshment

Page 35: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Data Warehousing and OLAP Technology

• What is a data warehouse?

• A multi-dimensional data model

• Data warehouse architecture

• Data Warehouse Process

• Data Cube example

• Slice and Dice Queries

Page 36: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

What is Data Warehouse?

• Defined in many different ways, but not rigorously.

• A data warehouse is simply a single, complete, and

consistent store of data obtained from a variety of

sources and made available to end users in a way

they can understand and use it in a business

context.” -- Barry Devlin, IBM Consultant

• “A data warehouse is a subject-oriented, integrated,

time-variant, and nonvolatile collection of data in

support of management’s decision-making

process.”—W. H. Inmon

Page 37: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Data Warehouse

• Subject-OrientedSubject-Oriented

• Organized around major subjects, such as gene, ontology, customer, product.

• Provide a simple and concise view around particular subject issues by excluding

data that are not useful in the decision support process

• IntegratedIntegrated

• Constructed by integrating multiple, heterogeneous data sources

• Data cleaning and data integration techniques are applied

• Time-VariantTime-Variant

• Data warehouse provides information from a historical perspective

• Every key structure contains an element of time, explicitly or implicitly

• Non-VolatileNon-Volatile

• A physically separate store of data transformed from the operational environment.

• Operational update of data does not occur in the data warehouse.

Page 38: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

DW – Multidimensional model

• A data warehouse is based on a multidimensional data model which views data in the form of a data cube

• Data is modeled as fact table (s) and multiple dimensions• Fact tableFact table contains measures and keys to each of the

related dimension tables (e.g. Call Details)

• Dimension tablesDimension tables, such as Caller (name, address, Tel), or time(day, week, month, quarter, year)

Page 39: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

DW - Conceptual Modeling

• Modeling data warehouses: dimensions &

measures• Star schemaStar schema: A fact table in the middle connected to a

set of dimension tables

• Snowflake schemaSnowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

• Fact constellationsFact constellations: Multiple fact tables share

dimension tables, viewed as a collection of stars,

therefore called galaxy schema or fact constellation

Page 40: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Example of Star Schema

Bonus keyBonus TypeDateAmount

Bonus Location keyStreetCity key

Location

Fact TableCall Detail

Time

Caller key

Call Type

Location key

Call Duration

Call Cost

Credit

Customer keyCustomer NameAddressTelephoneFax

Customer

Type keySIM PackNumber FormatSupplier

Call Type

measures

A fact tablefact table in the middle connected to a set of dimension tables

Date

Bonus key

Page 41: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Example of Snowflake Schema

City keyCityProvinceCountry

City

Supplier

Supplier keySupplier type

Dimensional hierarchyhierarchy is added to the dimension tables

Bonus keyBonus TypeDateAmount

Bonus

Location keyStreetCity key

Location

Fact TableCall Detail

Time

Caller key

Call Type

Location key

Call Duration

Call Cost

Credit

Customer keyCustomer NameAddressTelephoneFax

Customer

Type keySIM PackNumber FormatSupplier key

Call Type

measures

Date

Bonus key

Page 42: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Example of Fact Constellation

Sales Fact Table

Dealer keyDealer nameDealer typeLocation

Dealer

Customer key

Product key

Dealer key

Location

Unit Price

Total cost

To Location

Multiple fact tables shareshare dimension tables, viewed as a collection of stars

Bonus keyBonus TypeDateAmount

Bonus

Location keyStreetCity key

Location

Fact TableCall Detail

Time

Caller key

Call Type

Location key

Call Duration

Call Cost

Credit

Customer keyCustomer NameAddressTelephoneFax

Customer

Type keySIM PackNumber FormatSupplier key

Call Type

measures

Date

Bonus key

Page 43: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Example of Fact Constellation

Fact Table

S_Key_1S_Key_2S_Key_3S_Key_4Measure_1Measure_2Measure_3

NumNumNumNumNumNumNum

Dimension 1

S_Key_1Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5

NumTXTTXTTXTTXTTXT

Dimension 2

S_Key_2Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5

NumTXTTXTTXTTXTTXT

Dimension 3

S_Key_3Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5

NumTXTTXTTXTTXTTXT

Dimension 4

S_Key_4Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5

NumTXTTXTTXTTXTTXT Fact Table

S_Key_1S_Key_2S_Key_3S_Key_4Measure_1Measure_2Measure_3

NumNumNumNumNumNumNum

Dimension 2

S_Key_2Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5

NumTXTTXTTXTTXTTXT

Dimension 3

S_Key_3Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5

NumTXTTXTTXTTXTTXT

Dimension 4

S_Key_4Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5

NumTXTTXTTXTTXTTXT

Fact Table

S_Key_1S_Key_2S_Key_3S_Key_4Measure_1Measure_2Measure_3

NumNumNumNumNumNumNum

Dimension 1

S_Key_1Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5

NumTXTTXTTXTTXTTXT

Dimension 2

S_Key_2Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5

NumTXTTXTTXTTXTTXT

Dimension 4

S_Key_4Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5

NumTXTTXTTXTTXTTXT

Page 44: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Atlas data warehouse for integrative bioinformatics

Page 45: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Data Warehouse Process

• Data warehouses are designed around a businessdesigned around a business process rather than the business requirements.

• Business processes corresponds to the data flows within a data warehouse. These processes are (ETL):• Extract and Load the data

• Clean and Transform the data into a form that cope with large data volumes and provide good query performance.

• backup and Archive data

• Manage queries and direct them to the appropriate tables

Data SourcesData

WarehouseE T

L

Page 46: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Multi-Tiered ArchitectureMulti-Tiered Architecture

Source 2

StagingArea

Report

Report

Report

Report

Report

Source 1

Source n

E T

L

Qu

ery

Ma

na

ge

r

…basically outlines how the DW components fit together

DetailedInfo.

SummaryInfo.

MetaData

ArchiveDetailed

Information

Warehouse manager

E T

L

Page 47: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

A Sample Data Cube

Total annual Experimentsfor East by scientist sc3

Date

Scien

tists

Exp

erim

ents

sum

sum sc3

sc1sc2

1Qtr 2Qtr 3Qtr 4Qtr

East

Center

West

sum

Page 48: A. Benabdelkader, 2009 Name goes here 03.01.06 Nurturing Relationships. Enhancing Value. Managing Distributed Data Bio-Wise Information Management Enhancing

A. Benabdelkader, 2009

Slice and Dice Queries

• Slice and Dice: select and project on one or more

dimensions

experiments

scientists

genes

scientist= “Smith”