lecture @dhbw: data warehouse part 05: data …a collaborative data management practice focused on...
TRANSCRIPT
![Page 1: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/1.jpg)
A company of Daimler AG
LECTURE @DHBW: DATA WAREHOUSE
PART 05: DATA SECURITY – DATA CULTUREANDREAS BUCKENHOFER, DAIMLER TSS
![Page 2: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/2.jpg)
ABOUT ME
Andreas BuckenhoferSenior DB Professional
Since 2009 at Daimler TSS
Department: Machine Learning Solutions
Business Unit: Analytics
Contact/Connect
vcard
• Oracle ACE Associate
• DOAG responsible for InMemory DB
• Lecturer at DHBW
• Certified Data Vault Practitioner 2.0
• Certified Oracle Professional
• Certified IBM Big Data Architect
• Over 20 years experience with
database technologies
• Over 20 years experience with Data
Warehousing
• International project experience
![Page 3: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/3.jpg)
After the end of this lecture you will be able to
• Understand importance of Data Security
• Data Classification
• GDPR
• Understand importance of Data Culture
WHAT YOU WILL LEARN TODAY
Data Warehouse / DHBWDaimler TSS 3
![Page 4: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/4.jpg)
DATA SECURITY
Data Warehouse / DHBWDaimler TSS 4
![Page 5: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/5.jpg)
LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 5
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
![Page 6: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/6.jpg)
• End users should only have access to the Data Marts in a DWH
• Grant privileges and roles to users
• Good practice:
• Grant privileges directly to a role; do not nest roles (grant role to role)
• Grant role to users
• Distinguish end users, administrators and technical users with different password
policies
• Much more difficult to grant access in a Data Lake
• Tool maturity
• Data often unknown with schema-on-read
DATA ACCESS
Data Warehouse / DHBWDaimler TSS 6
![Page 7: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/7.jpg)
CIA TRIAD – DATA CLASSIFICATION
Data Warehouse / DHBWDaimler TSS 7
Confidentiality
IntegrityAvailability
whether or not
information is kept
secret or private,
e.g. data theft
whether the
information is kept
accurate, e.g.
faking data
ensuring that
information is
available when it is
needed, e.g.
blackout
![Page 8: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/8.jpg)
Regulation in EU law on data protection and privacy for all individuals in the
EU
Requirements like
• Data protection by design and by default
• Right to erasure / Right to be forgotten
• Records of processing activities
GDPR – GENERAL DATA PROTECTION REGULATION
Data Warehouse / DHBWDaimler TSS 8
Not relevant for exam
![Page 9: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/9.jpg)
• "capture-it-all" approach causes serious questions of privacy
• always ask yourself why you are capturing or storing data
CHALLENGES DWH & BIG DATA
Data Warehouse / DHBWDaimler TSS 9
Source: https://martinfowler.com/bliki/Datensparsamkeit.html
Challenges like
• Combining of sensitive data
sources allowed?
• Export restrictions that
forbid to combine data from
Germany, US, China
![Page 10: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/10.jpg)
Anonymization
• Irreversibly destroys any way to identify an individual
• GDPR does not apply
Pseudonymization
• Additional information required to re-identify an individual
• GDPR does apply
ANONYMIZATION VS PSEUDONYMIZATION
Data Warehouse / DHBWDaimler TSS 10
Not relevant for exam
![Page 11: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/11.jpg)
TYPICAL SECURITY FUNCTIONALITIES AND MEASURES IN DATABASES
Data Warehouse / DHBWDaimler TSS 11
Auditing
Encryption
Masking (static, dynamic)
Row Level Security
Authentication & Password
policies
Patching
Secure installation
SSL – secure
communication
Authorization & roles,
profiles, ressource limits
Security checklists
![Page 12: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/12.jpg)
Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
![Page 13: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/13.jpg)
Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
![Page 14: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/14.jpg)
DATA CULTURE
Data Warehouse / DHBWDaimler TSS 14
![Page 15: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/15.jpg)
• “Taping into the potential of data involves data
sensemaking. I prefer the term over the more
popular term analytics because it better fits the full
range of activities that are needed” (page 44)
• “Data sensemaking requires a significant
investment in the development of human skills”(page 74)
• DWH/Big Data Architecture (instead of
technology/tools)
• Data integration, Data visualization, Data modeling, …
STEPHEN FEW: BIG DATA, BIG DUPE (ANALYTICS PRESS, 2018)
Data Warehouse / DHBWDaimler TSS 15
![Page 16: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/16.jpg)
DATA LANDSCAPE MANIFESTO@SCOUT24
Data Warehouse / DHBWDaimler TSS 16
Source: Data Festival, Munich 2019
Not relevant for exam
![Page 17: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/17.jpg)
A collaborative data management practice focused on improving the
communication, integration and automation of data flows between data
managers and consumers across an organization.
The goal of DataOps is to create predictable delivery and change
management of data, data models and related artifacts. DataOps uses
technology to automate data delivery with the appropriate levels of security,
quality and metadata to improve the use and value of data in a dynamic
environment.
DATAOPS
Data Warehouse / DHBWDaimler TSS 17
Source: https://medium.com/data-ops/the-best-dataops-articles-of-q3-2018-c39882be3d7b
Not relevant for exam
![Page 18: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/18.jpg)
NETFLIX’ THREE CREDOS FOR DATA CULTURE
Data Warehouse / DHBWDaimler TSS 18
Source: Phil Simons’ webinar on “The visual organization: data visualization, big data, and the quest for better decisions” from
Harvard Business Review (2014)
Data catalog: Data should be accessible, easy to discover, and easy to process for everyone
Data storytelling: Whether your dataset is large or small, being able to visualize it makes it easier to explain
Fail fast and iterate: The longer you take to find the data, the less valuable it becomes
Not relevant for exam
![Page 19: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/19.jpg)
Establish a culture for
• Sharing data across the organization by following privacy and ethics
• Collaborating around data products
• Using data for decisions instead of e.g. experience
• Speed: fail fast and iterate
Being successful as a data-driven company requires the active involvement of
all employees
DATA CULTURE
Data Warehouse / DHBWDaimler TSS 19
Not relevant for exam
![Page 20: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/20.jpg)
DIGITIZATION - NEW GIANTSBAT – BAIDU + ALIBABA + TENCENT
Data Warehouse / DHBWDaimler TSS 20
Sources:
https://venitism.wordpress.com/2017/12/15/beware-of-the-bats-baidu-alibaba-and-tencent/
https://www.afr.com/brand/business-summit/baidu-alibaba-tencent-to-disrupt-facebook-amazon-netflix-google-in-asia-20180228-h0wrdl
Not relevant for exam
![Page 21: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/21.jpg)
We’re entering a new world in which
data may be more important than
software.
[Tim O’Reilly, Founder O’Reilly Media]
Data is a precious thing and will last longer than
the systems themselves.
[Tim Berners-Lee, Father of the Worldwide Web]
Information is the oil of the 21st
century
[Peter Sondergaard, Gartner]
Everything we do in the digital realm ... creates a data trail.
And if that trail exists, chances are someone is using it.
[Douglas Rushkoff, Author]
![Page 22: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/22.jpg)
Data creation is exploding
[Gavin Belson, HBOs Silicon Valley]
Data is the new gold
[Open Data Initiative, European Commission]
In a world deluged by irrelevant
information, clarity is power.
[Yuval Noah Harari, Author]
Big data is not about the data
[Gary King, Harvard University]
![Page 23: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/23.jpg)
DATA WAREHOUSE
Data Warehouse / DHBWDaimler TSS 23
Applications come, applications go.The data, however, lives forever.
It is not about building applications;it really is about the data
underneath these applications(Tom Kyte, Oracle)
![Page 24: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/24.jpg)
Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle
Data Warehouse / DHBWDaimler TSS 24
THANK YOU
![Page 25: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/25.jpg)
SQL STANDARD IS EVOLVING
Data Warehouse / DHBWDaimler TSS 25
Source: https://vimeo.com/289497563
![Page 26: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/26.jpg)
Write an SQL statement that computes the most recent data for each
customer.
Script to create the table including data: https://github.com/abuckenhofer/dwh_course/tree/master/scripts
EXERCISE: COMPUTE MOST RECENT ROWS
Data Warehouse / DHBWDaimler TSS 26
Customer_key Name Status Valid_from
1 Brown Single 01-MAY-2014
2 Bush Married 05-JAN-2015
1 Miller Married 15-DEC-2015
3 Stein 15-DEC-2015
3 Stein Single 18-DEC-2015
SIN1.sql
![Page 27: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/27.jpg)
EXERCISE: COMPUTE MOST RECENT ROWS
Data Warehouse / DHBWDaimler TSS 27
![Page 28: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/28.jpg)
create table s_customer (
customer_key integer NOT NULL
, cust_name varchar2(100) NOT NULL
, status varchar2(10)
, valid_from date NOT NULL
, CONSTRAINT s_customer_pk PRIMARY KEY (customer_key, valid_from)
);
insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Brown',
'Single', to_date('01.05.2014', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (2, 'Bush',
'Married', to_date('05.01.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Miller',
'Married', to_date('15.12.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein',
NULL, to_date('15.12.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein',
'Single', to_date('18.12.2015', 'DD.MM.YYYY'));
commit;
EXERCISE: COMPUTE MOST RECENT ROWS
Data Warehouse / DHBWDaimler TSS 28
![Page 29: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/29.jpg)
SELECT s.*
FROM S_CUSTOMER s
JOIN (SELECT i.customer_key,
max(i.valid_from) as max_valid_from
FROM S_CUSTOMER i
GROUP BY i.customer_key) b
ON s.customer_key = b.customer_key
AND s.valid_from = b.max_valid_from;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 1: MAX-FUNCTION
Data Warehouse / DHBWDaimler TSS 29
S2IN.sql
![Page 30: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/30.jpg)
SELECT s.*
FROM S_CUSTOMER s
WHERE NOT EXISTS (SELECT 1
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key
AND s.valid_from < i.valid_from);
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 2: EXISTS
Data Warehouse / DHBWDaimler TSS 30
S2IN.sql
![Page 31: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/31.jpg)
SELECT s.*
FROM S_CUSTOMER s
WHERE s.valid_from = (SELECT MAX(i.valid_from)
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key);
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 3: MAX IN CORRELATED SUB-SELECT
Data Warehouse / DHBWDaimler TSS 31
S2IN.sql
![Page 32: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/32.jpg)
SELECT *
FROM (SELECT coalesce ((SELECT min (i.valid_from)
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key
AND s.valid_from < i.valid_from
), to_date ('31.12.9999',
'DD.MM.YYYY')) as end_ts,
s.*
FROM S_CUSTOMER s)
WHERE end_ts = to_date ('31.12.9999', 'DD.MM.YYYY');
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 4: COALESCE WITH SUB-SELECT
Data Warehouse / DHBWDaimler TSS 32
S2IN.sql
![Page 33: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/33.jpg)
WITH max_cust as (
SELECT i.customer_key,
max(i.valid_from) as max_valid_from
FROM S_CUSTOMER i
GROUP BY i.customer_key)
SELECT s.*
FROM S_CUSTOMER s
JOIN max_cust b ON s.customer_key = b.customer_key
AND s.valid_from = b.max_valid_from;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 5: MAX-FUNCTION AND WITH-CLAUSE
Data Warehouse / DHBWDaimler TSS 33
S2IN.sql
![Page 34: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/34.jpg)
partition data
compute functions over these partitions
Rank [sequential order], first [first row], last [last row], lag [previous row], lead [next row]
return result
EXERCISE: COMPUTE MOST RECENT ROWSSQL ANALYTIC / WINDOWING FUNCTIONS
Data Warehouse / DHBWDaimler TSS 34
![Page 35: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/35.jpg)
WITH lead_cust as (
SELECT lead (s.valid_from, 1) OVER (PARTITION BY
s.customer_key
ORDER BY s.valid_from ASC) as end_ts
, s.*
FROM s_customer s)
SELECT *
FROM lead_cust b
WHERE b.end_ts IS NULL;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 6: ANALYTIC / WINDOWING FUNCTION LEAD
Data Warehouse / DHBWDaimler TSS 35
S3IN.sql
![Page 36: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/36.jpg)
WITH lead_cust as (
SELECT row_number() OVER(PARTITION BY s.customer_key
ORDER BY s.valid_from DESC) as rn
, s.*
FROM s_customer s)
SELECT *
FROM lead_cust b
WHERE b.rn = 1;
EXERCISE: COMPUTE MOST RECENT ROWSSOLUTION 7: ANALYTIC FUNCTION ROW_NUMBER
Data Warehouse / DHBWDaimler TSS 36
S3IN.sql
![Page 37: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/37.jpg)
• Check execution plans, execution time including service + response time,
resource usage for final decision
• Solutions with Analytic / Windowing do not need self-join and show better
statistics compared to the other shown solutions
• Analytic / Windowing functions are very powerful
• Remark: Usage of with-clause in SQL statements is preferable compared
to sub-selects as it improves readability, understandability, maintainability
MAX OR ANALYTIC / WINDOWING FUNCTIONS WHICH ALTERNATIVE WOULD YOU RECOMMEND?
Data Warehouse / DHBWDaimler TSS 37
![Page 38: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/38.jpg)
TEMPORAL DATA STORAGE (BITEMPORAL DATA)
Data Warehouse / DHBWDaimler TSS 38
![Page 39: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/39.jpg)
TEMPORAL DATA STORAGE (BITEMPORAL DATA)
Data Warehouse / DHBWDaimler TSS 39
10.09. 20.09. 30.09. 10.10.
Time
Price: 15EUR Price: 16EUR
New Price of 16EUR is
entered into the DB
Valid
Time
(20.09.)
Transaction
Time
(10.09.)
![Page 40: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/40.jpg)
• Time period when a fact is true in the real world
• The end user determines start and end date/time (or just a date/time for events)
Business validity:
Valid time
• Time period when a fact stored in the database is known
• ETL process determines start and end date/time
Technical validity:Transaction time
• Combines both Valid and Transaction TimeBitemporal data
TEMPORAL DATA STORAGE (BITEMPORAL DATA)DEFINITION
Data Warehouse / DHBWDaimler TSS 40
![Page 41: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/41.jpg)
• SQL standard SQL:2011
• But different implementations by RDBMSes like Oracle, DB2, SQL Server
and others
• Different syntax!
• Different coverage of standard!
• Very useful for slowly changing dimensions type 2, but also for other
purposes
TEMPORAL DATA STORAGE (BITEMPORAL DATA)SQL STANDARD
Data Warehouse / DHBWDaimler TSS 41
![Page 42: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/42.jpg)
CREATE TABLE customer_address
( customerID INTEGER NOT NULL
, name VARCHAR(100)
, city VARCHAR(100)
, valid_start DATE NOT NULL
, valid_end DATE NOT NULL
, PERIOD BUSINESS_TIME(valid_start, valid_end)
, PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS) );
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 42
![Page 43: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/43.jpg)
INSERT INTO customer_address VALUES
(1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013');
UPDATE customer_address FOR PORTION OF BUSINESS_TIME
FROM '22.05.2013' TO '31.12.2013'
SET city = 'San Diego' WHERE customerID = 1;
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 43
customerID Name City Valid_start Valid_end
1 Miller Seattle 01.01.2013 22.05.2013
1 Miller San Diego 22.05.2013 31.12.2013
![Page 44: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/44.jpg)
SELECT *
FROM customer_address
FOR BUSINESS_TIME AS OF '17.05.2013';
DB2 VALID TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 44
![Page 45: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/45.jpg)
CREATE TABLE customer_info(
customerId INTEGER NOT NULL,
comment VARCHAR(1000) NOT NULL,
sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN,
sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,
PERIOD SYSTEM_TIME (sys_start, sys_end)
);
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 45
![Page 46: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/46.jpg)
Transaction on 15.10.2013:
INSERT INTO customer_info VALUES( 1, 'comment 1');
Transaction on 31.10.2013
UPDATE customer_address SET comment = 'comment 2'
WHERE customerID = 1;
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 46
CustomerId comment Sys_start Sys_end
1 Comment 2 31.10.2013 31.12.2999
![Page 47: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/47.jpg)
SELECT *
FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013';
Data comes from a history table:
Valid Time and Transaction Time can be combined = Bitemporal table
DB2 TRANSACTION TIME EXAMPLE
Data Warehouse / DHBWDaimler TSS 47
CustomerId comment Sys_start Sys_end
1 Comment 1 15.10.2013 31.10.2013
![Page 48: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/48.jpg)
• Very important performance improvement technique
• Good for many reads with high selectivity, write penalty
• B-trees most common
INDEXING - WHY
Data Warehouse / DHBWDaimler TSS 48
root
branch branch
leaf leaf leaf
…
…
Table
![Page 49: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/49.jpg)
• DBs index Primary Keys by default
• Dimension table columns
that are regularly used in where clauses
are candidates
• Maybe foreign Key columns in Fact table
(see also later Star Transformation)
INDEXING A STAR SCHEMA – WHICH COLUMNS ARE CANDIDATES FOR AN INDEX?
Data Warehouse / DHBWDaimler TSS 49
![Page 50: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/50.jpg)
• Fact table has normally much more rows compared to dimension tables
• Common join techniques would need to join first dimension table with the
fact table
• Alternative technique: evaluate all dimensions
(cartesian join)
• Then join into fact table in last step
• Oracle uses Bitmap indexes on
foreign key columns in fact tables to achieve
Star Join; not supported by many DBs
STAR TRANSFORMATION
Data Warehouse / DHBWDaimler TSS 50
![Page 51: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/51.jpg)
PARTITIONING
Data Warehouse / DHBWDaimler TSS 51
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
3 C CC CCC
Col1 Col2
1 A
2 B
3 C
Col3 col4
AA AAA
BB BBB
CC CCC
Col1 Col2 Col3 col4
3 C CC CCC
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
Vertical partitioning Horizontal partitioning
![Page 52: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/52.jpg)
• Very powerful feature in a DWH to reduce workload
• Split table into logical smaller tables
• Avoidance of full table scans
• How could a table be split?
• Introduction to (Oracle) partitioning:
https://asktom.oracle.com/partitioning-for-developers.htm
HORIZONTAL PARTITIONING
Data Warehouse / DHBWDaimler TSS 52
![Page 53: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/53.jpg)
• By range
• Most common
• Use date field like order data to partition table into months, days, etc
• By list
• Use field that has limited number of different values, e.g. split customer data by
country if end users most likely select customers from within a country
• By hash
• Use a filed that most likely splits the data in evenly distributed chunks
HORIZONTAL PARTITIONING – SPLITTING OPTIONS
Data Warehouse / DHBWDaimler TSS 53
![Page 54: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/54.jpg)
• Statements are normally executed on one CPU
• Parallelism allows the DB to distribute the execution to several CPUs
• Powerful combination with partitioning
• Parallelism is limited by the number of CPUs: if parallelism is too high, performance
will degrade
• Intra-query parallelism and inter-query parallelism
PARALLELISM
Data Warehouse / DHBWDaimler TSS 54
![Page 55: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/55.jpg)
• Data compression + Index compression
• Store more data in a block/page = read more data during I/O
• If CPU resources are available, often a very powerful feature to improve performance
• Additionally reduce storage
• Additionally reduce backup time
COMPRESSION
Data Warehouse / DHBWDaimler TSS 55
![Page 56: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/56.jpg)
• Relational columnar In-Memory DB
• Materialized Views / Query Tables
ALREADY COVERED IN A PREVIOUS LECTURE
Data Warehouse / DHBWDaimler TSS 56
![Page 57: LECTURE @DHBW: DATA WAREHOUSE PART 05: DATA …A collaborative data management practice focused on improving the communication, integration and automation of data flows between data](https://reader035.vdocument.in/reader035/viewer/2022081513/5f0d94097e708231d43b0d03/html5/thumbnails/57.jpg)
• Recapture ETL and DB specific topics
• Which topics do you remember, or do you find important?
• Write down 1-2 topics on stick-it cards.
EXERCISE - RECAPTURE ETL AND DB SPECIFICS
Data Warehouse / DHBWDaimler TSS 57