lecture @dhbw: data warehousebuckenhofer/20192dwh/...•for many tasks, it’s easier to collect the...
TRANSCRIPT
![Page 1: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/1.jpg)
Ein Unternehmen der Daimler AG
Lecture @DHBW: Data Warehouse
06 Data Catalog, Data Security, Frontend
Andreas Buckenhofer
![Page 2: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/2.jpg)
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com
Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle
© Daimler TSS I Template Revision
Andreas BuckenhoferSenior DB Professional
Since 2009 at Daimler TSS
Department: Machine Learning Solutions
Business Unit: AnalyticsDHBWDOAG
Contact/Connect
vcard
• Oracle ACE Associate
• DOAG responsible for InMemory DB
• Lecturer at DHBW
• Certified Data Vault Practitioner 2.0
• Certified Oracle Professional
• Certified IBM Big Data Architect
• Over 20 years experience with
database technologies
• Over 20 years experience with Data
Warehousing
• International project experience
![Page 3: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/3.jpg)
Daimler TSS Data Warehouse / DHBW 3
Change Log
Date Changes
14.11.2019 Initial version
![Page 4: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/4.jpg)
Daimler TSS Data Warehouse / DHBW 4
What you will learn today
After the end of this lecture you will be able to
• Explain the concepts behind a modern metadata management
• Data Catalog
• Have an overview of frontend requirements
• Information Design
• Understand importance of Data Security and Data Ethics
• Data Classification
• GDPR
• Risks
• Understand necessity for Data Culture
![Page 5: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/5.jpg)
Data Warehouse /
DHBWDaimler TSS 5
Data Catalog
![Page 6: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/6.jpg)
Standard Data Warehouse architecture
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
Daimler TSS Data Warehouse / DHBW 6
![Page 7: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/7.jpg)
Making it easier to discover datasetsavailable since 05-sep-2018
Source: Google announcement https://www.blog.google/products/search/making-it-easier-discover-datasets/
Daimler TSS Data Warehouse / DHBW 7
![Page 8: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/8.jpg)
Making it easier to discover datasetsavailable since 05-sep-2018
Daimler TSS Data Warehouse / DHBW 8
![Page 9: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/9.jpg)
Find the right data
• With data science and analytics on the rise and under way to being democratized, the importance of being able to find the right data to investigate hypotheses and derive insights is paramountSource: https://www.zdnet.com/article/google-can-now-search-for-datasets-first-research-then-the-world
• Google Dataset search helps to find external data
• Schema.org defines open metadata format; dataset itself may not be open/free
• Search engines can interpret the format
• Ranking of data
• Help users discover where the data is and user can access it directly from the source
What about internal data?
Daimler TSS Data Warehouse / DHBW 9
![Page 10: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/10.jpg)
What is metadata?
Data
about
other data
Daimler TSS Data Warehouse / DHBW 10
![Page 11: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/11.jpg)
Types of metadata
Business Metadata
• Business terms
• Domain model
• Glossary, etc
Legal Metadata
• Sensitive data
• GDPR
• Security
classification
Profiling
• Density
• Cardinality
• Min, Max, Median
• Popularity / Ranking
Technical Metadata
• Schema
• Table
• Column
• etc
Tagging / Linkage
Critical parts
![Page 12: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/12.jpg)
Does Metadata Management provide answers to such questions across the whole workflow?
Search for data Work with data
Find Understand Trust Access Write
How to get access
to the data?
What tables are
important?
What table contains
production dates?
What is the difference
between production_date
and prod_dt?
How is this
column
calculated?
How to join the
tables?
Is FIN unique?
Who knows about
the data?
Is the data
reliable?
Daimler TSS Data Warehouse / DHBW 12
![Page 13: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/13.jpg)
Cataloging source systemsMany formats = many connectors
• RDBMS (Oracle, Db2, SQL Server, Teradata, …)
• Hadoop (HDFS, Hive, …; on-premises, Cloud)
• NoSQL DBs
• Files (Excel, csv, …)
• Powerdesigner, Erwin, and other data modeling tools
Daimler TSS Data Warehouse / DHBW 13
![Page 14: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/14.jpg)
Metadata differences: both tables contain same information, but different metadata
Vehicle_id Engine_id Cab_id Axle_id
WDB123 1234 ABCD XY12
Vehicle_id Type Id
WDB123 Engine 1234
WDB123 Cab ABCD
WDB123 Axle XY12
Daimler TSS Data Warehouse / DHBW 14
![Page 15: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/15.jpg)
Metadata import used to be simple with RDBMS
Where is the data and
where is the
metadata in this
logfile?
Data Lake:
decentralized control
of the data
Daimler TSS Data Warehouse / DHBW 15
![Page 16: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/16.jpg)
Data Lake / Hadoop
• Easy approach: Access Hive Metastore and import metadata
• Prerequisite: all data/files in HDFS require Hive access
• But unrealistic prerequisite
• Many logs are just dumped into the file system
• Interpreting ALL files by Catalog SW unrealistic, too.
• Huge computing power
• Huge number of variations (Cloud, on-premises, SW versions) lacks support of vendors for Catalog SW
• Sources should deliver metadata
Daimler TSS Data Warehouse / DHBW 16
![Page 17: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/17.jpg)
Cataloging @google
Source: https://ai.google/research/pubs/pub45390
Heavy usage of
Automation
and
Machine Learning
Daimler TSS Data Warehouse / DHBW 17
![Page 18: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/18.jpg)
Cataloging at Netflix, Twitter, Linkedin, etc.
Company Link
Netflix (Metacat)https://medium.com/netflix-techblog/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520
https://github.com/Netflix/metacat
Twitterhttps://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
LinkedIn (WhereHows)https://github.com/linkedin/WhereHows
https://github.com/linkedin/WhereHows/wiki
Google (Goods)https://ai.google/research/pubs/pub45390
https://www.buckenhofer.com/2016/10/goods-how-to-post-hoc-organize-the-data-lake/
Uberhttps://eng.uber.com/databook/
ebayhttps://www.ebayinc.com/stories/blogs/tech/bigdata-governance-hive-metastore-listener-for-apache-atlas-use-cases/
Lyfthttps://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9
Daimler TSS Data Warehouse / DHBW 18
![Page 19: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/19.jpg)
Cataloging @uber
Source: https://eng.uber.com/databook/
Daimler TSS Data Warehouse / DHBW 19
![Page 20: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/20.jpg)
Cataloging @twitter
Source: https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.htmlDaimler TSS Data Warehouse / DHBW 20
![Page 21: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/21.jpg)
Cataloging @linkedin (open source)
Source: https://github.com/LinkedIn/Wherehows/wiki
Daimler TSS Data Warehouse / DHBW 21
![Page 22: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/22.jpg)
Catalogs are everywhere … Google, Amazon
USER EXPERIENCEINVENTORY
Daimler TSS Data Warehouse / DHBW 22
![Page 23: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/23.jpg)
Inventory vs user experience
Suppliers provide inventory
•A Catalog should list everything that is actually available
Consumers require user experience
•A Catalog should provide data usage statistics, ratings, data samples, statistical profiles, lineage, lists of users and stewards, and tips on how the data should be interpreted
Daimler TSS Data Warehouse / DHBW 23
![Page 24: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/24.jpg)
Automation, crowd knowledge, and experts
• limitation of permissions to a trusted group
• A trusted group documents few datasets very well
• But most of the metadata is not documented
• Failure of many past approaches
• Automation, crowd knowledge and experts required
• Automation to get a broad coverage and use existing information like query logs
• Crowd to increase broad coverage
• Experts to confirm or reject „guesses“
-> Combination of coverage and accuracy
Daimler TSS Data Warehouse / DHBW 24
![Page 25: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/25.jpg)
Types of metadata
Business Metadata
• List of business
terms
• Glossary, etc
Legal Metadata
• Sensitive data
• GDPR
• Security
classification
Profiling
• Density
• Cardinality
• Min, Max, Median
• Popularity / Ranking
Technical Metadata
• Schema
• Table
• Column
• etc
Tagging / Linkage
Critical parts
Automation, crowd
knowledge and experts
required
![Page 26: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/26.jpg)
Data Catalog – Amazon for information
Data Catalog
Technical Metadata
Business Metadata
Collective Intelligence
Expert Sourcing
Data Access
Governance
MachineLearning
Automation
Inventory
User experience
& enrichment
Daimler TSS Data Warehouse / DHBW 26
![Page 27: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/27.jpg)
Catalog search
Daimler TSS Data Warehouse / DHBW 27
![Page 28: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/28.jpg)
Is the data Catalog a “metadata management reloaded”?
Name it as you like, but there are some critical developments
• Automation, Collective intelligence and expert knowledge
• Enable crowd sourcing and get help from other users
• Help to understand quality of data and usage of datasets
• Rating of information
• Web application for search / collaboration and API to access metadata
• Governance and legal framework for e.g. GDPR scenarios
• Capture metadata for security and end-user data consumption
• Identify the owner of the dataset and get access to source data
Daimler TSS Data Warehouse / DHBW 28
![Page 29: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/29.jpg)
Data Warehouse /
DHBWDaimler TSS 29
Frontend
![Page 30: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/30.jpg)
Standard Data Warehouse architecture
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
Daimler TSS Data Warehouse / DHBW 30
![Page 31: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/31.jpg)
Visualization in the usual case of life
Daimler TSS Data Warehouse / DHBW 31
![Page 32: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/32.jpg)
Russian Campaign of Napoleon
Source: https://de.wikipedia.org/wiki/Charles_Joseph_Minard
Daimler TSS Data Warehouse / DHBW 32
![Page 33: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/33.jpg)
Mapping the 1854 London Cholera Outbreak
Source: https://www1.udel.edu/johnmack/frec682/cholera/
Daimler TSS Data Warehouse / DHBW 33
![Page 34: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/34.jpg)
Mapping the 1854 London Cholera Outbreak
Daimler TSS Data Warehouse / DHBW 34
![Page 35: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/35.jpg)
Excercise: visualize as much as possible
Umsatz in €
2014 2015 2016
Kanada 16.000 14.000 17.000
England 8.000 9.000 8.000
Frankreich 7.000 4.000 5.000
USA 60.000 85.000 90.000
Deutschland 4.000 10.000 15.000
Australien 10.000 8.000 15.000
Umsatz 105.000 130.000 150.000
Daimler TSS Data Warehouse / DHBW 35
![Page 36: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/36.jpg)
Possible Solution 1
Daimler TSS Data Warehouse / DHBW 36
![Page 37: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/37.jpg)
Possible Solution 2
Daimler TSS Data Warehouse / DHBW 37
![Page 38: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/38.jpg)
Interface to the end user
• Reporting (Standard, ad-hoc)
• OLAP interactive analysis
• Dashboards, Scorecards
• Advanced Analytics / Data Mining / Text Mining
• Search & Discovery
Daimler TSS Data Warehouse / DHBW 38
![Page 39: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/39.jpg)
Reporting (Standard, ad-hoc)
• Standard Reports
• Prepared static reports that can be executed at request by end users
• Are executed at the end of an ETL process and e.g. send by email to end users
• Normally based on fact tables and its dimensions
• Reports are often lists similar to Excel-Sheets but can also contain graphics (e.g. line charts)
• Ad-hoc Reports
• End users create their own reports („Self service“)
Daimler TSS Data Warehouse / DHBW 39
![Page 40: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/40.jpg)
OLAP interactive analysis
ROLAP / MOLAP Client Frontend
• Prepared cubes (multidimensional or relational fact tables)
• User can perform interactive analysis of data
• Rollup / drill-down
• Pivot
• Slicing
• Dicing
Daimler TSS Data Warehouse / DHBW 40
![Page 41: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/41.jpg)
Dashboards, Scorecards
• „Progress reports“
• Provide an overall view of KPIs (Key Performance Indicators)
• Combination of several elements from Reporting and/or OLAP (e.g. line charts) into an overall view (like a „cockpit“)
• Dashboard is more focused on operational goals
• High-level overview what is happening
• Scorecard is more focused on strategic goals
• Plan a strategy and identify why something happens
Daimler TSS Data Warehouse / DHBW 41
![Page 42: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/42.jpg)
Software is eating the world
Machine learning will eat software• For many tasks, it’s easier to collect the data than to explicitly write the
program, e.g. face recognition or chess/go
• On the other hand, data collection isn’t always easy, e.g. billing SW
Advanced Analytics / Data Mining / Text MiningThe future of software development?
Source: https://www.oreilly.com/ideas/what-machine-learning-means-for-software-development
Daimler TSS Data Warehouse / DHBW 42
![Page 43: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/43.jpg)
Machine Learning will change SW development
• Google’s Jeff Dean has reported that 500 lines of TensorFlow code has replaced 500,000 lines of code in Google Translate
• Don’t understate the difficulty of training a neural network of any complexity, but neither should we underestimate the problem of managing and debugging a gigantic codebase
• The developer has to become a teacher, a curator of training data, and an analyst of results
Source: https://www.oreilly.com/ideas/what-machine-learning-means-for-software-development
Source: https://twitter.com/DynamicWebPaige/status/915326707107844097
Data Warehouse / DHBW 43Source: https://twitter.com/markmadsen/status/1194622452430712833?s=20
![Page 44: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/44.jpg)
Search & Discovery
• Not just numerical data
• Analysis of new data types gets more and more important
• Text
• GPS coordinates
• Pictures
• Videos
• Data can be available in RDBMS (e.g. text modules/indexes available), Hadoop or SQL DBs
Daimler TSS Data Warehouse / DHBW 44
![Page 45: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/45.jpg)
Many graphical elements to use in reports
Source: https://github.com/d3/d3/wiki/GalleryDaimler TSS Data Warehouse / DHBW 45
![Page 46: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/46.jpg)
Source: Hichert / Faisst, http://www.backup-page.hichert.com/
Many graphical elements … chamber of horror
Daimler TSS Data Warehouse / DHBW 46
![Page 47: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/47.jpg)
Many graphical elements … chamber of horror
Some remarks about previous slides
• 3D elements introduce clutter and give not more information
• Pie chart most often does not make sense
• Line chart barely readable
• Labels are placed outside of the graphic
• Tachometer costs a lot of space and show
• Too much color in general
• Color without meaning, e.g. red should be used for alarms / errors
Daimler TSS Data Warehouse / DHBW 47
![Page 48: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/48.jpg)
Do you use 3D usually ?
Daimler TSS Data Warehouse / DHBW 48
![Page 49: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/49.jpg)
36 + 89 + 57 + 61 = 100
Source: BIKE magazine 8/2019
Daimler TSS Data Warehouse / DHBW 49
![Page 50: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/50.jpg)
Did you know? Pizza is a real-time chart of how muchpizza is left
Daimler TSS Data Warehouse / DHBW 50
![Page 51: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/51.jpg)
Story telling with appropriate visualization
Famous example by Hans Rosling (watch 3:08 onwards)
https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen?language=de
Daimler TSS Data Warehouse / DHBW 51
![Page 52: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/52.jpg)
Daimler TSS Data Warehouse / DHBW 52
InfoGraphics - Chart Guide
Source: https://chart.guide/poster
![Page 53: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/53.jpg)
Daimler TSS Data Warehouse / DHBW 53
Top infographics - The Forces Shaping the Future of the Global Economy
Source: https://www.visualcapitalist.com/our-top-infographics-of-2018/
![Page 54: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/54.jpg)
Daimler TSS Data Warehouse / DHBW 54
Top infographics - The Forces Shaping the Future of the Global Economy: details
Source: https://www.visualcapitalist.com/the-8-major-forces-shaping-the-future-of-the-global-economy/
![Page 55: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/55.jpg)
Information Design
• Information design is the practice of presenting information in a way that fosters efficient and effective understanding of it.(source: Wikipedia, https://en.wikipedia.org/wiki/Information_design )
• Some authors are well known for their criticism of many graphical representations - they provide rules for good information design
• Edward Tufte
• Stephen Few
• Rolf Hichert
Daimler TSS Data Warehouse / DHBW 55
![Page 56: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/56.jpg)
Which productgroup has the highest win in june?
Daimler TSS Data Warehouse / DHBW 56
![Page 57: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/57.jpg)
Which productgroup has the highest win in june?Eye tracking
Daimler TSS Data Warehouse / DHBW 57
![Page 58: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/58.jpg)
Which productgroup has the highest win in june?Improved version
Daimler TSS Data Warehouse / DHBW 58
![Page 59: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/59.jpg)
Which productgroup has the highest win in june?Eye tracking
Daimler TSS Data Warehouse / DHBW 59
![Page 60: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/60.jpg)
Information DesignReduce to the essentials
Define standards, e.g.
• use always the same colors and with care, e.g.
• red = negative
• green = positive
• pie charts are rarely useful and should be avoided
• better use bar chart or line chart
• No 3D elements as these elements don’t enhance information but introduce clutter
• Standardize abbreviations, e.g. PY = previous year
Daimler TSS Data Warehouse / DHBW 60
![Page 61: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/61.jpg)
Eye-tracking - before and after
Daimler TSS Data Warehouse / DHBW 61
![Page 62: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/62.jpg)
Table with integrated bar charts
Source: Hichert, http://www.hichert.com/de/resource/table-template-02/
Daimler TSS Data Warehouse / DHBW 62
![Page 63: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/63.jpg)
BI end user roles
Consumers / BI Users
• use reports, OLAP and dashboards to obtain information
Power Users
• Use reports , OLAP and dashboards to obtain information
• Create new reports and dashboards
Data Scientists
• Statistical / mathematical geeks
• Analyze / explore data
• Need to analyze raw (non-cleansed, non-transformed) data
Daimler TSS Data Warehouse / DHBW 63
![Page 64: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/64.jpg)
Daimler TSS Data Warehouse / DHBW 64
Summary
• Infographics
• Comprehensive graphics
• Storytelling
• Information Design
• Use information efficiently and effectively
![Page 65: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/65.jpg)
Data Warehouse /
DHBWDaimler TSS 65
Data Security
![Page 66: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/66.jpg)
Standard Data Warehouse architecture
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
Daimler TSS Data Warehouse / DHBW 66
![Page 67: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/67.jpg)
GDPR – general data protection regulation
Regulation in EU law on data protection and privacy for all individuals in the EU
Requirements like
• Data protection by design and by default
• Right to erasure / Right to be forgotten
• Right to data portability
• Records of processing activities
Daimler TSS Data Warehouse / DHBW 67
![Page 68: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/68.jpg)
Challenges DWH & Big Data
• "capture-it-all" approach causes serious questions of privacy
• always ask yourself why you are capturing or storing data
Source: https://martinfowler.com/bliki/Datensparsamkeit.html
Challenges like
• Combining of sensitive data
sources allowed?
• Export restrictions that
forbid to combine data from
Germany, US, China
Daimler TSS Data Warehouse / DHBW 68
![Page 69: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/69.jpg)
Daimler TSS Data Warehouse / DHBW 69
Data processing principles (GDPR Art. 5.1)
Principle Description
Lawfulness, fairness and transparency
(Rechtmäßigkeit, Verarbeitung nach Treu
und Glauben, Transparenz)
Processing must be according to law and
comprehensible for individuals
Purpose limitation (Zweckbindung) Use data for defined purpose only
Data minimisation (Datenminimierung) Data usage must be adequate
Accuracy (Richtigkeit) Personal Data needs to be correct
Storage limitation (Speicherbegrenzung) Keep data no longer than necessary
Integrity and confidentiality (Integrität und
Vertraulichkeit)
Protection against unauthorised or unlawful
processing and against accidental loss,
destruction or damage
![Page 70: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/70.jpg)
Pseudonymization vs Anonymization
Personal data
Pseudonymized data are
data that allow re-
identification
Anonymized data are data
where the data subject is
no longer identifiable
Pseudonymization (e.g. Hashing)
Anonymization
Re-Identification with additional informationen
Name
VIN
IP-Adress
Home town
Telematic data
Daimler TSS Data Warehouse / DHBW 70
![Page 71: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/71.jpg)
Anonymization by deleting data
Name Email Gender Birthdate ZIP Salary
Peter Miller [email protected] M 15.05.2001 89075 80.000
Martin Bush [email protected] F 22.07.1967 70079 85.000
Susan Dill [email protected] F 03.11.1978 60067 90.000
Daimler TSS Data Warehouse / DHBW 71
![Page 72: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/72.jpg)
Anonymization by deleting data
Name Email Gender Birthdate ZIP Salary
M 15.05.2001 89075 80.000
F 22.07.1967 70079 85.000
F 03.11.1978 60067 90.000
Daimler TSS Data Warehouse / DHBW 72
People in the US can be identified in 87% by
gender, birthdate and zip (Latanya Sweeney)
https://dataprivacylab.org/projects/identifiability/paper1.pdf
Risiko Test: https://cpg.doc.ic.ac.uk/individual-risk/
![Page 73: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/73.jpg)
Facebook Data Leak
• Personality-App
• Data from 3 Mio users
• Data were anonymized according to Facebook
• Gender, birthdate, and zip ↯ ↯ ↯
• Results from personality tests
• (Additionally, username and password were available to access the data)
Source: https://www.newscientist.com/article/2168713-huge-new-facebook-data-leak-exposed-intimate-details-of-3m-users/
Daimler TSS Data Warehouse / DHBW 73
![Page 74: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/74.jpg)
Fine for false anonymization
Quelle: https://edpb.europa.eu/news/national-news/2019/danish-data-protection-agency-proposes-dkk-12-million-fine-danish-taxi_en
Daimler TSS Data Warehouse / DHBW 74
![Page 75: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/75.jpg)
CIA triad – data classification
Confidentiality
IntegrityAvailability
whether or not
information is kept
secret or private,
e.g. data theft
whether the
information is kept
accurate, e.g.
faking data
ensuring that
information is
available when it is
needed, e.g.
blackout
Daimler TSS Data Warehouse / DHBW 75
![Page 76: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/76.jpg)
Typical security functionalities and measures in databases
Auditing
Encryption
Masking (static, dynamic)
Row Level Security
Authentication & Password
policies
Patching
Secure installation
SSL – secure
communication
Authorization & roles,
profiles, ressource limits
Security checklists
![Page 77: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/77.jpg)
Daimler TSS 77
Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
![Page 78: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/78.jpg)
Source: https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/
![Page 79: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/79.jpg)
Data Warehouse /
DHBWDaimler TSS 79
Data Culture
![Page 80: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/80.jpg)
Data quality and Meta Data management
Domain knowledge
Data culture
Digitization hot spotsBiMa-studie 2018 (BARC + sopra steria consulting)
Daimler TSS Data Warehouse / DHBW 80
![Page 81: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/81.jpg)
Digitization - New giants „BAT“Baidu (Google) + Alibaba (Amazon) + Tencent (Facebook)
Sources:
https://venitism.wordpress.com/2017/12/15/beware-of-the-bats-baidu-alibaba-and-tencent/
https://www.afr.com/brand/business-summit/baidu-alibaba-tencent-to-disrupt-facebook-amazon-netflix-google-in-asia-20180228-h0wrdl
Daimler TSS Data Warehouse / DHBW 81
![Page 82: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/82.jpg)
Stephen Few: Big Data, Big Dupe (Analytics Press, 2018)
• “Taping into the potential of data involves data sensemaking. I prefer the term over the more popular term analytics because it better fits the full range of activities that are needed” (page 44)
• “Data sensemaking requires a significant investment in the development of human skills”(page 74)
• DWH/Big Data Architecture (instead of technology/tools)
• Data integration, Data visualization, Data modeling, …
Daimler TSS Data Warehouse / DHBW 82
![Page 83: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/83.jpg)
Data landscape manifesto@scout24
Source: Data Festival, Munich 2019
Daimler TSS Data Warehouse / DHBW 83
![Page 84: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/84.jpg)
Netflix’ three credos for data culture
Source: Phil Simons’ webinar on “The visual organization: data visualization, big data, and the quest for better decisions” from
Harvard Business Review (2014)
Data catalog: Data should be accessible, easy to discover, and easy to process for everyone
Data storytelling: Whether your dataset is large or small, being able to visualize it makes it easier to explain
Fail fast and iterate: The longer you take to find the data, the less valuable it becomes
Daimler TSS Data Warehouse / DHBW 84
![Page 85: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/85.jpg)
From DevOps to DataOps
A collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and consumers across an organization.
The goal of DataOps is to create predictable delivery and change management of data, data models and related artifacts. DataOps uses
technology to automate data delivery with the appropriate levels of security, quality and metadata to improve the use and value of data in a dynamic environment.
Source: https://medium.com/data-ops/the-best-dataops-articles-of-q3-2018-c39882be3d7b
Daimler TSS Data Warehouse / DHBW 85
![Page 86: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/86.jpg)
Data Culture
Establish a culture for
• Sharing data across the organization by following privacy and ethics
• Collaborating around data products and data platforms
• Data-driven decisions instead of e.g. experience or best paid employee in the room
• Speed: fail fast and iterate
Being successful as a data-driven company requires the active involvement of all employees
Daimler TSS Data Warehouse / DHBW 86
![Page 87: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/87.jpg)
We’re entering a new world in which
data may be more important than
software.
[Tim O’Reilly, Founder O’Reilly Media]
Data is a precious thing and will last longer than
the systems themselves.
[Tim Berners-Lee, Father of the Worldwide Web]
Information is the oil of the 21st
century
[Peter Sondergaard, Gartner]
Everything we do in the digital realm ... creates a data trail.
And if that trail exists, chances are someone is using it.
[Douglas Rushkoff, Author]
![Page 88: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/88.jpg)
Data creation is exploding
[Gavin Belson, HBOs Silicon Valley]
Data is the new gold
[Open Data Initiative, European Commission]
In a world deluged by irrelevant
information, clarity is power.
[Yuval Noah Harari, Author]
Big data is not about the data
[Gary King, Harvard University]
![Page 89: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/89.jpg)
Data WarehousE
Data Warehouse /
DHBWDaimler TSS 89
Applications come, applications go.
The data, however, lives forever.
It is not about building applications;
it really is about the data underneath these applications
(Tom Kyte, Oracle)
![Page 90: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/90.jpg)
Daimler TSS GmbH
Wilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com
Sitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Vorsitzender), Steffen Bäuerle
© Daimler TSS I Template Revision
![Page 91: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/91.jpg)
Visual Vocabulary
https://github.com/ft-interactive/chart-doctor/tree/master/visual-vocabulary
http://www.vizwiz.com/2018/07/visual-vocabulary.html
Daimler TSS Data Warehouse / DHBW 91
![Page 92: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/92.jpg)
Learn how to replace code
• Machine learning will no doubt change software development in significant ways
• Software developers will put much more effort into data collection and preparation
• Developers will have to do more than just collect data; they’ll have to build data pipelines and the infrastructure to manage those pipelines. We’ve called this “Data Engineering”
• Data engineers will be responsible for maintaining the data pipeline: ingesting data, cleaning data, feature engineering, and model discovery
Source: https://www.oreilly.com/ideas/what-machine-learning-means-for-software-developmentDaimler TSS Data Warehouse / DHBW 92
![Page 93: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/93.jpg)
Types of metadata (1)
Business Metadata
• Definition of business vocabulary and relationships
• Definition of the value range
• Linkage to physical representation
Daimler TSS Data Warehouse / DHBW 93
![Page 94: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/94.jpg)
Types of metadata (2)
Report and ETL metadata
• Report definitions
• Data sources
• Column definitions
• Computations
Logical and physical metadata of data model
• Table structure
• Definition of columns
• Relationships between tables and columns
• Dimension hierarchy Daimler TSS Data Warehouse / DHBW 94
![Page 95: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/95.jpg)
Benefits of Metadata management
Source: Detlef Apel: Datenqualität erfolgreich steuern, dpunkt 2015, chapter 14
• Data Lineage and dependencies
• Generating and controlling DWH processes
• Improve SW development quality
• Increase comprehensibility of KPIs
Daimler TSS Data Warehouse / DHBW 95
![Page 96: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/96.jpg)
Technical Metadata managementvery often not successful
Metadata Repository
OLTP-1
OLTP-2
Microservice-1Microservice-1
Microservice-1Microservice-1
DWH
Data Lake
Who enriches / tags
technical metadata
with
legal and business
relevant information???
Daimler TSS Data Warehouse / DHBW 96
![Page 97: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/97.jpg)
Data Catalog a hot topic
• New Data Catalog vendors are entering the market
• Established vendors rebrand and enrich their existing toolsDaimler TSS Data Warehouse / DHBW 97
![Page 98: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/98.jpg)
Alation architecture
Not just an RDBMS for structured
metadata, but also storage engines for text
data
Daimler TSS Data Warehouse / DHBW 98
![Page 99: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/99.jpg)
Data scientist sexiest job of the 21st century?
Source: https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientist-articleDaimler TSS Data Warehouse / DHBW 99
![Page 100: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/100.jpg)
SQL standard is evolving
Source: https://vimeo.com/289497563
Daimler TSS Data Warehouse / DHBW 100
![Page 101: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/101.jpg)
Exercise: compute most recent rows
Write an SQL statement that computes the most recent data for each customer.
Script to create the table including data: https://github.com/abuckenhofer/dwh_course/tree/master/scripts
Customer_
key
Name Status Valid_from
1 Brown Single 01-MAY-2014
2 Bush Married 05-JAN-2015
1 Miller Married 15-DEC-2015
3 Stein 15-DEC-2015
3 Stein Single 18-DEC-2015
SIN1.sqlDaimler TSS Data Warehouse / DHBW 101
![Page 102: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/102.jpg)
Exercise: compute most recent rows
Daimler TSS Data Warehouse / DHBW 102
![Page 103: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/103.jpg)
Exercise: compute most recent rows
create table s_customer (customer_key integer NOT NULL
, cust_name varchar2(100) NOT NULL, status varchar2(10), valid_from date NOT NULL, CONSTRAINT s_customer_pk PRIMARY KEY (customer_key, valid_from)
);
insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Brown', 'Single', to_date('01.05.2014', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (2, 'Bush', 'Married', to_date('05.01.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (1, 'Miller', 'Married', to_date('15.12.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein', NULL, to_date('15.12.2015', 'DD.MM.YYYY'));
insert into s_customer (customer_key, cust_name, status, valid_from) values (3, 'Stein', 'Single', to_date('18.12.2015', 'DD.MM.YYYY'));
commit;
Daimler TSS Data Warehouse / DHBW 103
![Page 104: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/104.jpg)
Exercise: compute most recent rowssolution 1: max-function
SELECT s.*
FROM S_CUSTOMER s
JOIN (SELECT i.customer_key,
max(i.valid_from) as max_valid_from
FROM S_CUSTOMER i
GROUP BY i.customer_key) b
ON s.customer_key = b.customer_key
AND s.valid_from = b.max_valid_from;
S2IN.sqlDaimler TSS Data Warehouse / DHBW 104
![Page 105: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/105.jpg)
Exercise: compute most recent rowssolution 2: exists
SELECT s.*
FROM S_CUSTOMER s
WHERE NOT EXISTS (SELECT 1
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key
AND s.valid_from < i.valid_from);
S2IN.sqlDaimler TSS Data Warehouse / DHBW 105
![Page 106: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/106.jpg)
Exercise: compute most recent rowssolution 3: max in correlated sub-select
SELECT s.*
FROM S_CUSTOMER s
WHERE s.valid_from = (SELECT MAX(i.valid_from)
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key);
S2IN.sqlDaimler TSS Data Warehouse / DHBW 106
![Page 107: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/107.jpg)
Exercise: compute most recent rowssolution 4: coalesce with sub-select
SELECT *
FROM (SELECT coalesce ((SELECT min (i.valid_from)
FROM S_CUSTOMER i
WHERE s.customer_key = i.customer_key
AND s.valid_from < i.valid_from
), to_date ('31.12.9999', 'DD.MM.YYYY'))
as end_ts,
s.*
FROM S_CUSTOMER s)
WHERE end_ts = to_date ('31.12.9999', 'DD.MM.YYYY');
S2IN.sqlDaimler TSS Data Warehouse / DHBW 107
![Page 108: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/108.jpg)
Exercise: compute most recent rowssolution 5: max-function and with-clause
WITH max_cust as (
SELECT i.customer_key,
max(i.valid_from) as max_valid_from
FROM S_CUSTOMER i
GROUP BY i.customer_key)
SELECT s.*
FROM S_CUSTOMER s
JOIN max_cust b ON s.customer_key = b.customer_key
AND s.valid_from = b.max_valid_from;
S2IN.sqlDaimler TSS Data Warehouse / DHBW 108
![Page 109: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/109.jpg)
partition data
compute functions over these partitions
Rank [sequential order], first [first row], last [last row], lag [previous row], lead [next row]
return result
Exercise: compute most recent rowsSQL Analytic / Windowing functions
Daimler TSS Data Warehouse / DHBW 109
![Page 110: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/110.jpg)
Exercise: compute most recent rowssolution 6: Analytic / Windowing function lead
WITH lead_cust as (
SELECT lead (s.valid_from, 1) OVER (PARTITION BY
s.customer_key
ORDER BY s.valid_from ASC) as end_ts
, s.*
FROM s_customer s)
SELECT *
FROM lead_cust b
WHERE b.end_ts IS NULL;
S3IN.sqlDaimler TSS Data Warehouse / DHBW 110
![Page 111: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/111.jpg)
Exercise: compute most recent rowssolution 7: Analytic function row_number
WITH lead_cust as (
SELECT row_number() OVER(PARTITION BY s.customer_key
ORDER BY s.valid_from DESC) as rn
, s.*
FROM s_customer s)
SELECT *
FROM lead_cust b
WHERE b.rn = 1;
S3IN.sqlDaimler TSS Data Warehouse / DHBW 111
![Page 112: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/112.jpg)
Max or Analytic / Windowing functions Which alternative would you recommend?
• Check execution plans, execution time including service + response time, resource usage for final decision
• Solutions with Analytic / Windowing do not need self-join and show better statistics compared to the other shown solutions
• Analytic / Windowing functions are very powerful
• Remark: Usage of with-clause in SQL statements is preferable compared to sub-selects as it improves readability, understandability, maintainability
Daimler TSS Data Warehouse / DHBW 112
![Page 113: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/113.jpg)
Temporal data storage (Bitemporal data)
Daimler TSS Data Warehouse / DHBW 113
![Page 114: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/114.jpg)
Temporal data storage (Bitemporal data)
10.09. 20.09. 30.09. 10.10.
Time
Price: 15EUR Price: 16EUR
New Price of 16EUR is
entered into the DB
Valid
Time
(20.09.)
Transaction
Time
(10.09.)
Daimler TSS Data Warehouse / DHBW 114
![Page 115: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/115.jpg)
• Time period when a fact is true in the real world
• The end user determines start and end date/time (or just a date/time for events)
Business validity:
Valid time
• Time period when a fact stored in the database is known
• ETL process determines start and end date/time
Technical validity:Transaction time
• Combines both Valid and Transaction TimeBitemporal data
Temporal data storage (Bitemporal data)Definition
Daimler TSS Data Warehouse / DHBW 115
![Page 116: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/116.jpg)
Temporal data storage (Bitemporal data)SQL standard
• SQL standard SQL:2011
• But different implementations by RDBMSes like Oracle, Db2, SQL Server and others
• Different syntax!
• Different coverage of standard!
• Very useful for slowly changing dimensions type 2, but also for other purposes
Daimler TSS Data Warehouse / DHBW 116
![Page 117: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/117.jpg)
Db2 Valid Time example
CREATE TABLE customer_address
( customerID INTEGER NOT NULL
, name VARCHAR(100)
, city VARCHAR(100)
, valid_start DATE NOT NULL
, valid_end DATE NOT NULL
, PERIOD BUSINESS_TIME(valid_start, valid_end)
, PRIMARY KEY(customerID, BUSINESS_TIME WITHOUT OVERLAPS) );
Daimler TSS Data Warehouse / DHBW 117
![Page 118: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/118.jpg)
Db2 Valid Time example
INSERT INTO customer_address VALUES
(1, 'Miller', 'Seattle', '01.01.2013', '31.12.2013');
UPDATE customer_address FOR PORTION OF BUSINESS_TIME
FROM '22.05.2013' TO '31.12.2013'
SET city = 'San Diego' WHERE customerID = 1;
customerID Name City Valid_start Valid_end
1 Miller Seattle 01.01.2013 22.05.2013
1 Miller San Diego 22.05.2013 31.12.2013
Daimler TSS Data Warehouse / DHBW 118
![Page 119: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/119.jpg)
Db2 Valid Time example
SELECT *
FROM customer_address
FOR BUSINESS_TIME AS OF '17.05.2013';
Daimler TSS Data Warehouse / DHBW 119
![Page 120: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/120.jpg)
Db2 transaction Time example
CREATE TABLE customer_info(
customerId INTEGER NOT NULL,
comment VARCHAR(1000) NOT NULL,
sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN,
sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END,
PERIOD SYSTEM_TIME (sys_start, sys_end)
);
Daimler TSS Data Warehouse / DHBW 120
![Page 121: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/121.jpg)
Db2 transaction Time example
Transaction on 15.10.2013:
INSERT INTO customer_info VALUES( 1, 'comment 1');
Transaction on 31.10.2013
UPDATE customer_address SET comment = 'comment 2'
WHERE customerID = 1;
CustomerI
d
comment Sys_start Sys_end
1 Comment
2
31.10.2013 31.12.2999
Daimler TSS Data Warehouse / DHBW 121
![Page 122: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/122.jpg)
Db2 transaction Time example
SELECT *
FROM customer_info FOR SYSTEM_TIME AS OF '17.10.2013';
Data comes from a history table:
Valid Time and Transaction Time can be combined = Bitemporal table
CustomerId comment Sys_start Sys_end
1 Comment 1 15.10.2013 31.10.2013
Daimler TSS Data Warehouse / DHBW 122
![Page 123: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/123.jpg)
Indexing - why
• Very important performance improvement technique
• Good for many reads with high selectivity, write penalty
• B-trees most common
root
branch branch
leaf leaf leaf
…
…
Table
Daimler TSS Data Warehouse / DHBW 123
![Page 124: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/124.jpg)
Indexing a star schema – which columns are candidates for an index?
• DBs index Primary Keys by default
• Dimension table columns that are regularly used in where clausesare candidates
• Maybe foreign Key columns in Fact table (see also later Star Transformation)
Daimler TSS Data Warehouse / DHBW 124
![Page 125: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/125.jpg)
Star transformation
• Fact table has normally much more rows compared to dimension tables
• Common join techniques would need to join first dimension table with the fact table
• Alternative technique: evaluate all dimensions(cartesian join)
• Then join into fact table in last step
• Oracle uses Bitmap indexes on foreign key columns in fact tables to achieveStar Join; not supported by many DBs
Daimler TSS Data Warehouse / DHBW 125
![Page 126: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/126.jpg)
Daimler TSS Data Warehouse / DHBW 126
Partitioning
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
3 C CC CCC
Col1 Col2
1 A
2 B
3 C
Col3 col4
AA AAA
BB BBB
CC CCC
Col1 Col2 Col3 col4
3 C CC CCC
Col1 Col2 Col3 col4
1 A AA AAA
2 B BB BBB
Vertical partitioning Horizontal partitioning
![Page 127: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/127.jpg)
Horizontal partitioning
• Very powerful feature in a DWH to reduce workload
• Split table into logical smaller tables
• Avoidance of full table scans
• How could a table be split?
• Introduction to (Oracle) partitioning: https://asktom.oracle.com/partitioning-for-developers.htm
Daimler TSS Data Warehouse / DHBW 127
![Page 128: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/128.jpg)
Horizontal Partitioning – splitting options
• By range
• Most common
• Use date field like order data to partition table into months, days, etc
• By list
• Use field that has limited number of different values, e.g. split customer data by country if end users most likely select customers from within a country
• By hash
• Use a filed that most likely splits the data in evenly distributed chunks
Daimler TSS Data Warehouse / DHBW 128
![Page 129: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/129.jpg)
Parallelism
• Statements are normally executed on one CPU
• Parallelism allows the DB to distribute the execution to several CPUs
• Powerful combination with partitioning
• Parallelism is limited by the number of CPUs: if parallelism is too high, performance will degrade
• Intra-query parallelism and inter-query parallelism
Daimler TSS Data Warehouse / DHBW 129
![Page 130: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/130.jpg)
Compression
• Data compression + Index compression
• Store more data in a block/page = read more data during I/O
• If CPU resources are available, often a very powerful feature to improve performance
• Additionally reduce storage
• Additionally reduce backup time
Daimler TSS Data Warehouse / DHBW 130
![Page 131: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/131.jpg)
Already covered in a previous lecture
• Relational columnar In-Memory DB
• Materialized Views / Query Tables
Daimler TSS Data Warehouse / DHBW 131
![Page 132: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/132.jpg)
Exercise - recapture ETL and DB specifics
• Recapture ETL and DB specific topics
• Which topics do you remember, or do you find important?
• Write down 1-2 topics on stick-it cards.
Daimler TSS Data Warehouse / DHBW 132
![Page 133: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/133.jpg)
Working with Alation
• Onboarding of data sources
• IT just creates cover and grants access
• Self-service: Done by application owners including source system connection
• No need for central password management
• Scales for onboarding of many systems
Daimler TSS Data Warehouse / DHBW 133
![Page 134: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/134.jpg)
Schema and its tables
Daimler TSS Data Warehouse / DHBW 134
![Page 135: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/135.jpg)
Table and its columns with sample data
Daimler TSS Data Warehouse / DHBW 135
![Page 136: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/136.jpg)
Columns and relationships
Daimler TSS Data Warehouse / DHBW 136
![Page 137: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/137.jpg)
Legal tagsGDPR and other regulations
Associate legal tags
• Articles 16-21
• Identify data
• Right to erasure
• Right to be forgotten
Daimler TSS Data Warehouse / DHBW 137
![Page 138: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/138.jpg)
Central vs local Data Catalogs
Central data Catalog
• Integrated views
• Mammoth task
• No redundancy
Local data Catalogs (reality)
• Legal requirements
• Feasibility
• Tool support very weak
Data
Cata
logSource 1
Source 2
Source 3
Data
Cata
log
Source 1
Source 2
Source 3
Data
Cata
log
Daimler TSS Data Warehouse / DHBW 138
![Page 139: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/139.jpg)
Summaryinformation is beautiful
Source: https://www.youtube.com/watch?v=hOex1iU57iw
Daimler TSS Data Warehouse / DHBW 139
![Page 140: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/140.jpg)
Data access
• End users should only have access to the Data Marts in a DWH
• Grant privileges and roles to users
• Good practice:
• Grant privileges directly to a role; do not nest roles (grant role to role)
• Grant role to users
• Distinguish end users, administrators and technical users with different password policies
• Much more difficult to grant access in a Data Lake
• Tool maturity
• Data often unknown with schema-on-read
Daimler TSS Data Warehouse / DHBW 140
![Page 141: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/141.jpg)
Evaluation criteria
Technical Metadata
Business Metadata incl. Glossary
Tagging (Linkage)
Collective Intelligence(Collaboration)
Search
Security
Source connectors
Data profiling
Data access
Lineage
API
Versioning
Architecture
Components
Prerequisites
LicencingDaimler TSS Data Warehouse / DHBW 141
![Page 142: Lecture @DHBW: Data Warehousebuckenhofer/20192DWH/...•For many tasks, it’s easier to collect the data than to explicitly write the program, e.g. face recognition or chess/go •On](https://reader036.vdocument.in/reader036/viewer/2022070713/5ed384d792fae60862734a29/html5/thumbnails/142.jpg)
Over 75%• of time is spent for
• say they least enjoy
DATA PREPARATION
Data Consumers