a. benabdelkader, 2009 name goes here 03.01.06 nurturing relationships. enhancing value. managing...
TRANSCRIPT
A. Benabdelkader, 2009
Name goes here03.01.06
Nurturing Relationships. Enhancing Value.
Managing Distributed Data
Bio-Wise Information Management
Enhancing Value … ….
Ammar Benabdelkader
Amsterdam, 25 – 05 - 2009
http://www.science.uva.nl/~ammar
A. Benabdelkader, 2009
Managing Distributed Data
• Introduction to Relational Databases
• The relational model
• Distributed Databases
• Homogeneous / heterogeneous databases
• e-Science Database Integrator (e-DBI)
• e-DBI, a scenario case
• Data Warehouse
• Multi-dimensional data model
A. Benabdelkader, 2009
Introduction to Relational Databases
• Relational Model
• Primary/Foreign keys
• Integrity constraints
• Data Normalization
• Queries/Join
• Views
A. Benabdelkader, 2009
© Prentice Hall, 2002
• Definition: A relation is a named, two-dimensional table of data• Table is made up of rows (records), and columns (attribute or field)
• Requirements:• Every relation has a unique name.• Every attribute value is atomic (not multivalued, not composite)• Every row is unique (can’t have two rows with exactly the same
values for all their fields)• Attributes (columns) in tables have unique names• The order of the columns is irrelevant• The order of the rows is irrelevant
Relational Model
A. Benabdelkader, 2009
Key Fields
• Keys are special fields that serve two main purposes:
• Primary keys are unique identifiers of the relation. Examples
include employee numbers, social security numbers, etc. This is
how we can guarantee that all rows are unique
• Foreign keys are identifiers that enable a dependent relation to
refer to its parent relation
• Keys can be simple (a single field) or composite (more
than one field)
© Prentice Hall, 2002
A. Benabdelkader, 2009
Primary Key
Foreign Key (implements 1:N relationship between customer and order)
Composite primary key (implement M:N relationship between order and product)
© Prentice Hall, 2002
Key Fields - Example
A. Benabdelkader, 2009
Integrity Constraints
• Domain Constraints• Allowable values for an attribute.
• Entity Integrity• No primary key attribute may be null. All primary key fields
MUST have data
• Referential Integrity• Rule that states that any foreign key value MUST match a
primary key value in the relation of the one side.
© Prentice Hall, 2002
A. Benabdelkader, 2009
Referential integrity constraints are drawn via arrows from dependent to parent table
© Prentice Hall, 2002
Integrity Constraints - Example
A. Benabdelkader, 2009
Data Normalization
• Primarily a tool to validate and improve a logical design so that it
satisfies certain constraints that avoid unnecessary
duplication of data
• The process of decomposing relations with anomalies to
produce smaller, well-structured relations
© Prentice Hall, 2002
A. Benabdelkader, 2009
Well-Structured Relations
• A relation that contains minimal data redundancy and allows users to insert, delete, and update rows without causing data inconsistencies
• Goal is to avoid anomalies• Insertion Anomaly – adding new rows forces user to create
duplicate data• Deletion Anomaly – deleting rows may cause a loss of data
that would be needed for other future rows• Modification Anomaly – changing data in a row forces
changes to other rows because of duplication
© Prentice Hall, 2002
A. Benabdelkader, 2009
• First Normal Form• No multivalued attributes• Every attribute value is atomic
• Second Normal Form• 1NF plus• Every non-key attribute must be defined by the entire key,
not by only part of the key• No partial functional dependencies
• Third Normal Form• 2NF PLUS no transitive dependencies: one attribute
functionally determines a second, which functionally determines a third.
Data Normalization
© Prentice Hall, 2002
A. Benabdelkader, 2009
• Structured Query Language• The standard for relational database management
systems (RDBMS) • SQL-92 Standard -- Purpose:
• Specify syntax/semantics for data definition and manipulation
• Define data structures• Enable portability• Specify minimal (level 1) and complete (level 2) standards• Allow for later growth/enhancement to standard
SQL
© Prentice Hall, 2002
A. Benabdelkader, 2009
• Data Definition Language (DDL)
• Create table
• Create view, etc.
• Data Manipulation Language (DML)
• Insert statement: adds data to a table
• Delete statement: removes rows from a table
• Update statement: modifies data in existing rows
• Select statement: queries single or multiple tables
SQL
A. Benabdelkader, 2009
Views
• Views provide users controlled access to tables
• Advantages of views:• Simplify query commands
• Provide data security
• Enhance programming productivity
© Prentice Hall, 2002
A. Benabdelkader, 2009
The SELECT Statement
• Used for queries on single or multiple tables• Clauses of the SELECT statement:
• SELECT• List the columns (and expressions) that should be returned from the query
• FROM• Indicate the table(s) or view(s) from which data will be obtained
• WHERE• Indicate the conditions under which a row will be included in the result
• GROUP BY• Indicate categorization of results
• HAVING• Indicate the conditions under which a category (group) will be included
• ORDER BY• Sorts the result according to specified criteria
© Prentice Hall, 2002
A. Benabdelkader, 2009
SQL Operators
• Boolean operators: AND, OR, NOT• String comparison using wildcards %: LIKE• Scalar aggregate using COUNT• Vector aggregate using GROUP BY• Qualifying results by categories using HAVING
© Prentice Hall, 2002
A. Benabdelkader, 2009
Joins
• Join – a relational operation that causes two or more tables
with a common domain to be combined into a single table or view • The common columns in joined tables are usually the primary
key of the dominant table and the foreign key of the dependent table in 1:M relationships.
© Prentice Hall, 2002
A. Benabdelkader, 2009
customersorders
order lines
products
© Prentice Hall, 2002
Joins - Example
A. Benabdelkader, 2009
• Assemble all information necessary to create an invoice for order number 1006
• SELECT C.CUSTOMER_ID, C.NAME, C.ADDRESS, CITY, SATE, POSTAL_CODE, O.ORDER_ID, O.DATE, QUANTITY, P.NAME, PRICE, (QUANTITY * PRICE)
• FROM CUSTOMER C, ORDER O, ORDER_LINE OL, PRODUCT P• WHERE C.CUSTOMER_ID = OL.CUSTOMER_ID AND O.ORDER_ID = OL.ORDER_ID
AND OL.PROEUCT_ID = P.PRODUCT_IDAND O.ORDER_ID = 1006;
Four tables involved in this join
Multiple Table Join Example
Each pair of tables requires an equality-check condition in the WHERE clause, matching primary keys against foreign keys
A. Benabdelkader, 2009
From CUSTOMER_T table
From ORDER_T table From PRODUCT_T table
© Prentice Hall, 2002
Multiple Table Join Example
A. Benabdelkader, 2009
Distributed DatabasesDistributed Databases
A single logical database that is spread physically
across computers in multiple locations that are
connected by a data communications link
• Homogeneous - Same DBMS at each node
• Autonomous - Independent DBMSs
• Non-autonomous - Central , coordinating DBMS
• Easy to manage, difficult to enforce
• Heterogeneous - Different DBMSs at different nodes
• Autonomous - Independent DBMSs
• Difficult to manage, preferred by independent organizations
© Prentice Hall, 2002
A. Benabdelkader, 2009
Homogeneous, Non-Autonomous Database
• Data is distributed across all the nodes
• Same DBMS at each node
• All data is managed by the distributed DBMS (no
exclusively local data)
• All access is through one, global schema
• The global schema is the union of all the local schema
© Prentice Hall, 2002
A. Benabdelkader, 2009
Identical DBMSs
Homogeneous Database
Source: adapted from Bell and Grimson, 1992.
© Prentice Hall, 2002
A. Benabdelkader, 2009
Typical Heterogeneous Environment
• Data distributed across all the nodes
• Different DBMSs may be used at each node
• Local access is done using the local DBMS and schema
A. Benabdelkader, 2009
Data Integration For e-Science
• Data Management Architecture
• e-DBI data Integration Layer
• e-DBI Data Integration Approach
• e-DBI Implementation Strategy
• e-DBI Demo
A. Benabdelkader, 2009
Data ManagementData ManagementHigh Level Architecture DesignHigh Level Architecture Design
Standard DBMS (Oracle, MySQL, etc.)Standard DBMS (Oracle, MySQL, etc.)
File Servers (SRM, SRB, Gridftp, etc.)File Servers (SRM, SRB, Gridftp, etc.)
Ap
p.
Sp
ecifi
c d
ata
sou
rces
Ap
p.
Sp
ecifi
c d
ata
sou
rces
Data Sources Manager
e-DBI
Virtual / Materialized user-specific catalogs
1
2
3
4
~.~.~.~. ~.~.~.~. . . . . . . .
~.~.~.~. ~.~.~.~. . . . . . . .
A. Benabdelkader, 2009
Data Integration LayerData Integration Layer
Virtual/Materialized user-specific repositoriesVirtual/Materialized user-specific repositories
DS Registry MD Collector
Pre-defined Dynamic
Meta Data Integrator
e-Science Database Integratore-DBI
e-Science Database Integratore-DBI
Data Sources Manager
~.~.~.~. ~.~.~.~. . . . . . . . DS
~.~.~.~. ~.~.~.~. . . . . . . . MD
A. Benabdelkader, 2009
e-DBI – Approache-DBI – Approach
1. Define a virtual database (VDB), using any relational database
2. Select the needed information from the different data sources (tables):
• Filter the data
• Rename table name and attributes
• Reformat the data (apply any conversion if required)
3. Transfer the data into the new VDB, by copying the information
4. Enhance the VDB
• Set new constraints
• Merge or fuse data
• Apply additional reformatting, etc.
5. Update the VDB
• Check anytime availability and completeness at the sources
• Decide whether to perform an update or a data replacement
The Current Implementation of e-DBI supports the following data sources:
•Oracle, Sybase, MySQL, XML, Excel Spreadsheets, etc
A. Benabdelkader, 2009
e-DBI – DBMS Driver Registrye-DBI – DBMS Driver Registry
Description: e-DBI DBMS Driver Registry allows the user from the application to register supported DBMSsregister supported DBMSs. Information required to be register a DBMS includes: DBMS driver, and URL format.
A. Benabdelkader, 2009
e-DBI – DS Registrye-DBI – DS Registry
Description: e-DBI Data Source Registry allows the user from the application to register the data sourcesregister the data sources that will be used during the integration process. Information to be registered includes: DS name, host, port, driver, user name, and user password.
A. Benabdelkader, 2009
e-DBI Interfacee-DBI Interface
A. Benabdelkader, 2009
e-DBI – MD Collectore-DBI – MD Collector
Description: e-DBI Meta Data Collector allows the user from the application to identify identify the sub set of meta datathe sub set of meta data to be used for integration. In addition, MD Collector allows a limited meta data conversion to be applied against the single data sources, namely: renaming, conversion, aggregation, and type casting.
Metadata Collector
A. Benabdelkader, 2009
e-DBI – MD Integratore-DBI – MD Integrator
Description: e-DBI Meta Data Integrator allows the user from the application to perform MD integrationMD integration from the different data sources based on the set of metadata gathered through the MD collector. MD Integrator will allow a full integration of meta data from the different source, including data merging and data aggregation.
Metadata Integrator
A. Benabdelkader, 2009
e-DBI – Implementation Strategye-DBI – Implementation Strategy
e-DBI implementation is based on the open source Squirrel SQL project (http://squirrel-sql.sourceforge.net ).
e-DBI targets the following additional challenges:
1. Provide an interface that is more suitable and convenient for the scientist• Hide un-necessary details• Re-organize the architecture of the interface
2. Enhance the data integration functionalities:• Provide a hybrid solution between federated and warehousing
approaches• Facilitate schema update and data refreshment
A. Benabdelkader, 2009
Data Warehousing and OLAP Technology
• What is a data warehouse?
• A multi-dimensional data model
• Data warehouse architecture
• Data Warehouse Process
• Data Cube example
• Slice and Dice Queries
A. Benabdelkader, 2009
What is Data Warehouse?
• Defined in many different ways, but not rigorously.
• A data warehouse is simply a single, complete, and
consistent store of data obtained from a variety of
sources and made available to end users in a way
they can understand and use it in a business
context.” -- Barry Devlin, IBM Consultant
• “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in
support of management’s decision-making
process.”—W. H. Inmon
A. Benabdelkader, 2009
Data Warehouse
• Subject-OrientedSubject-Oriented
• Organized around major subjects, such as gene, ontology, customer, product.
• Provide a simple and concise view around particular subject issues by excluding
data that are not useful in the decision support process
• IntegratedIntegrated
• Constructed by integrating multiple, heterogeneous data sources
• Data cleaning and data integration techniques are applied
• Time-VariantTime-Variant
• Data warehouse provides information from a historical perspective
• Every key structure contains an element of time, explicitly or implicitly
• Non-VolatileNon-Volatile
• A physically separate store of data transformed from the operational environment.
• Operational update of data does not occur in the data warehouse.
A. Benabdelkader, 2009
DW – Multidimensional model
• A data warehouse is based on a multidimensional data model which views data in the form of a data cube
• Data is modeled as fact table (s) and multiple dimensions• Fact tableFact table contains measures and keys to each of the
related dimension tables (e.g. Call Details)
• Dimension tablesDimension tables, such as Caller (name, address, Tel), or time(day, week, month, quarter, year)
A. Benabdelkader, 2009
DW - Conceptual Modeling
• Modeling data warehouses: dimensions &
measures• Star schemaStar schema: A fact table in the middle connected to a
set of dimension tables
• Snowflake schemaSnowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
• Fact constellationsFact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
A. Benabdelkader, 2009
Example of Star Schema
Bonus keyBonus TypeDateAmount
Bonus Location keyStreetCity key
Location
Fact TableCall Detail
Time
Caller key
Call Type
Location key
Call Duration
Call Cost
Credit
Customer keyCustomer NameAddressTelephoneFax
Customer
Type keySIM PackNumber FormatSupplier
Call Type
measures
A fact tablefact table in the middle connected to a set of dimension tables
Date
Bonus key
A. Benabdelkader, 2009
Example of Snowflake Schema
City keyCityProvinceCountry
City
Supplier
Supplier keySupplier type
Dimensional hierarchyhierarchy is added to the dimension tables
Bonus keyBonus TypeDateAmount
Bonus
Location keyStreetCity key
Location
Fact TableCall Detail
Time
Caller key
Call Type
Location key
Call Duration
Call Cost
Credit
Customer keyCustomer NameAddressTelephoneFax
Customer
Type keySIM PackNumber FormatSupplier key
Call Type
measures
Date
Bonus key
A. Benabdelkader, 2009
Example of Fact Constellation
Sales Fact Table
Dealer keyDealer nameDealer typeLocation
Dealer
Customer key
Product key
Dealer key
Location
Unit Price
Total cost
To Location
Multiple fact tables shareshare dimension tables, viewed as a collection of stars
Bonus keyBonus TypeDateAmount
Bonus
Location keyStreetCity key
Location
Fact TableCall Detail
Time
Caller key
Call Type
Location key
Call Duration
Call Cost
Credit
Customer keyCustomer NameAddressTelephoneFax
Customer
Type keySIM PackNumber FormatSupplier key
Call Type
measures
Date
Bonus key
A. Benabdelkader, 2009
Example of Fact Constellation
Fact Table
S_Key_1S_Key_2S_Key_3S_Key_4Measure_1Measure_2Measure_3
NumNumNumNumNumNumNum
Dimension 1
S_Key_1Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5
NumTXTTXTTXTTXTTXT
Dimension 2
S_Key_2Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5
NumTXTTXTTXTTXTTXT
Dimension 3
S_Key_3Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5
NumTXTTXTTXTTXTTXT
Dimension 4
S_Key_4Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5
NumTXTTXTTXTTXTTXT Fact Table
S_Key_1S_Key_2S_Key_3S_Key_4Measure_1Measure_2Measure_3
NumNumNumNumNumNumNum
Dimension 2
S_Key_2Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5
NumTXTTXTTXTTXTTXT
Dimension 3
S_Key_3Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5
NumTXTTXTTXTTXTTXT
Dimension 4
S_Key_4Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5
NumTXTTXTTXTTXTTXT
Fact Table
S_Key_1S_Key_2S_Key_3S_Key_4Measure_1Measure_2Measure_3
NumNumNumNumNumNumNum
Dimension 1
S_Key_1Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5
NumTXTTXTTXTTXTTXT
Dimension 2
S_Key_2Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5
NumTXTTXTTXTTXTTXT
Dimension 4
S_Key_4Attribute_1Attribute_2Attribute_3Attribute_4Attribute_5
NumTXTTXTTXTTXTTXT
A. Benabdelkader, 2009
Atlas data warehouse for integrative bioinformatics
A. Benabdelkader, 2009
Data Warehouse Process
• Data warehouses are designed around a businessdesigned around a business process rather than the business requirements.
• Business processes corresponds to the data flows within a data warehouse. These processes are (ETL):• Extract and Load the data
• Clean and Transform the data into a form that cope with large data volumes and provide good query performance.
• backup and Archive data
• Manage queries and direct them to the appropriate tables
Data SourcesData
WarehouseE T
L
A. Benabdelkader, 2009
Multi-Tiered ArchitectureMulti-Tiered Architecture
Source 2
StagingArea
Report
Report
Report
Report
Report
…
Source 1
Source n
E T
L
Qu
ery
Ma
na
ge
r
…basically outlines how the DW components fit together
DetailedInfo.
SummaryInfo.
MetaData
ArchiveDetailed
Information
Warehouse manager
E T
L
A. Benabdelkader, 2009
A Sample Data Cube
Total annual Experimentsfor East by scientist sc3
Date
Scien
tists
Exp
erim
ents
sum
sum sc3
sc1sc2
1Qtr 2Qtr 3Qtr 4Qtr
East
Center
West
sum
A. Benabdelkader, 2009
Slice and Dice Queries
• Slice and Dice: select and project on one or more
dimensions
experiments
scientists
genes
scientist= “Smith”