lecture @dhbw: data warehouse part iv: big data … · hadoop and nosql since 2013. i keep my...
TRANSCRIPT
A company of Daimler AG
LECTURE @DHBW: DATA WAREHOUSE
PART IV: BIG DATA INTRODUCTIONANDREAS BUCKENHOFER, DAIMLER TSS
ABOUT ME
https://de.linkedin.com/in/buckenhofer
https://twitter.com/ABuckenhofer
https://www.doag.org/de/themen/datenbank/in-memory/
http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/
https://www.xing.com/profile/Andreas_Buckenhofer2
Andreas Buckenhofer
Senior DB Professional
Since 2009 at Daimler TSS
Department: Big Data
Business Unit: Analytics
ANDREAS BUCKENHOFER, DAIMLER TSS GMBH
Data Warehouse / DHBWDaimler TSS 3
“Forming good abstractions and avoiding complexity
is an essential part of a successful data architecture”
Data has always been my main focus during my long-time occupation in the area of
data integration. I work for Daimler TSS as Database Professional and Data Architect
with over 20 years of experience in Data Warehouse projects. I am working with
Hadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn new
things, experiment, and program every day.
I share my knowledge in internal presentations or as a speaker at international
conferences. I'm regularly giving a full lecture on Data Warehousing and a seminar on
modern data architectures at Baden-Wuerttemberg Cooperative State University
DHBW. I also gained international experience through a two-year project in Greater
London and several business trips to Asia.
I’m responsible for In-Memory DB Computing at the independent German Oracle User
Group (DOAG) and was honored by Oracle as ACE Associate. I hold current
certifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big Data
Architect“, „Oracle Database 12c Administrator Certified Professional“, “IBM
InfoSphere Change Data Capture Technical Professional”, etc.
Contact/Connect
As a 100% Daimler subsidiary, we give
100 percent, always and never less.
We love IT and pull out all the stops to
aid Daimler's development with our
expertise on its journey into the future.
Our objective: We make Daimler the
most innovative and digital mobility
company.
NOT JUST AVERAGE: OUTSTANDING.
Daimler TSS
INTERNAL IT PARTNER FOR DAIMLER
+ Holistic solutions according to the Daimler guidelines
+ IT strategy
+ Security
+ Architecture
+ Developing and securing know-how
+ TSS is a partner who can be trusted with sensitive data
As subsidiary: maximum added value for Daimler
+ Market closeness
+ Independence
+ Flexibility (short decision making process,
ability to react quickly)
Daimler TSS 5
Daimler TSS
LOCATIONS
Data Warehouse / DHBW
Daimler TSS China
Hub Beijing
10 employees
Daimler TSS Malaysia
Hub Kuala Lumpur
42 employeesDaimler TSS IndiaHub Bangalore22 employees
Daimler TSS Germany
7 locations
1000 employees*
Ulm (Headquarters)
Stuttgart
Berlin
Karlsruhe
* as of August 2017
6
• After the end of this lecture you will be able to
• Understand ideas behind
• Big Data
• NoSQL
• NewSQL
WHAT YOU WILL LEARN TODAY
Data Warehouse / DHBWDaimler TSS 7
LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE
Data Warehouse / DHBWDaimler TSS 8
Data Warehouse
FrontendBackend
External data sources
Internal data sources
Staging
Layer
(Input
Layer)
OLTP
OLTP
Core
Warehouse
Layer
(Storage
Layer)
Mart Layer
(Output
Layer)
(Reporting
Layer)
Integration
Layer
(Cleansing
Layer)
Aggregation
Layer
Metadata Management
Security
DWH Manager incl. Monitor
•new hardware and software architectures to store and process the exponentially growing quantity of websites it needed to index
Amazon
•Webscale: transactional processing capability that could operate at massive scale
TWO MAJOR DEVELOPMENTS IN THE 2000IES LEAD TONEW DATABASES
Data Warehouse / DHBWDaimler TSS 9
ONE SIZE DOES NOT FIT ALL
Data Warehouse / DHBWDaimler TSS 10
Source: Harrison, Next Generation Databases, Apress 2016, p. 4
Stonebraker:
“One Size Fits All”:
An Idea Whose Time Has
Come and Gone
https://cs.brown.edu/~u
gur/fits_all.pdf
NOSQL – KEY VALUE STORES
Data Warehouse / DHBWDaimler TSS 11
• Simple data model with pairs of
• Unique key
• Values (atomic or
complex)
• Access only possible via Key
• Main use case is for Caching
• Examples: Redis, Aerospike,
Oracle NoSQL.
Key Value
userID1 ISBN1
userID2 ISBN2, ISBN8, ISBN9
userID3
NOSQL – DOCUMENT STORES
Data Warehouse / DHBWDaimler TSS 12
• Structures like XML, JSON,
BSON are stored in the DB
• Flexible schema
• Data and metadata are mixed
• Access via key or index
• DB does not interpret model
• Examples: MongoDB, CouchDB
ID: 12345
Name: Mustermann
Born: 04.02.1992
ID: 637
Name: Berger
Adress:From 01.01.2005
zip: 89004
from 01.07.2010
zip: 80990
city: München
NOSQL – WIDE COLUMN STORES
Data Warehouse / DHBWDaimler TSS 13
• Data are organized by keys and
flexible number of columns
• Column families separate data
• into different lists of columns
• Access via name (key)
• High scalability for write and
selective reads
• Examples: HBase, Cassandra
Name
(Key)Value
Time
stampRowKey
Name
(Key)Value
Time
stampRowKey
Name
(Key)Value
Time
tampRowKey
NOSQL – GRAPH STORES
Data Warehouse / DHBWDaimler TSS 14
• The data model contains
• edges
• Vertices
• Characteristics
• Relationships are of main
interest
• Optimized for graph queries
(graph traversal)
• Example: Neo4j
user1
user2
user4
user3
user5
• „Choose 2“ is not really correct and the
diagram on the left, too
• Actually, the CAP theorem says that it is
impossible for a system that guarantees
consistency to guarantee 100%
availability in the presence of a network
partition.
So if you can only choose one, it makes
sense to choose availability. (But 100%
availability is not real, eg 99.995)
• If X is 4: X = 10; Y = X + 8
What is the value of Y?
CAP THEOREM AND BASE (BASICALLY AVAILABLE, SOFT STATE, EVENTUAL CONSISTENCY)
Data Warehouse / DHBWDaimler TSS 15
Source: http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-failing-to.html
Consistency
Partition
ToleranceAvailability
CA CP
AP
(RDBMS)MongoDB
HBase
Redis
Cassandra
CouchDB
DynamoDB
“In all such systems, we find developers spend a significant fraction of
their time building extremely complex and error-prone mechanisms to
cope with eventual consistency and handle data that may be out of date”
(Google white paper)
• Importance of SQL
• Importance of Consistency / ACID
• Example: VoltDB
NEWSQL
Data Warehouse / DHBWDaimler TSS 16
Source: http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p769-shute.pdf
Are built to run in the cloud
• Ubiquitous and flexible: standard container run in any cloud
• Resilient and scalable: highly available, redundancy, graceful degradation.
• Dynamic: rolling upgrades, autonomous
• Automatable: everything is code, eg infrastructure
• Observable: logging, tracing, metrics (Netflix: https://medium.com/netflix-techblog/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17 )
• Distributed: take advantage of distributed cloud
Examples: Google Spanner, CockroachDB
CLOUD-NATIVE DATABASES
Data Warehouse / DHBWDaimler TSS 17
Source:
CNCF (CLOUD NATIVE COMPUTING FOUNDATION)CLOUD NATIVE LANDSCAPE
Data Warehouse / DHBWDaimler TSS 18
Source: https://github.com/cncf/landscape
• NoSQL, NewSQL, and BigData are vaguely defined, overhyped, and
overloaded terms
• NoSQL reject the constraints of the relational model like strict consistency and
schemas
• NewSQL retain many features of the relational model but enrich the model with
flexibility
• Big Data systems focus on technologies within the Hadoop ecosystem + spark
NOSQL, NEWSQL, BIGDATA
Data Warehouse / DHBWDaimler TSS 19
BIG DATA CHARACTERISTICS
Data Warehouse / DHBWDaimler TSS 20
Volume
• The amount of data
Velocity
• The speed at which data is generated
Variety
• The different types of data
Veracity
• The trustworthiness/ accuracy of data
What is a high amount of data?
• Walmart handles more than 1 million customer transactions every hour,
which are imported into databases estimated to contain more than 2.5
petabytes (2560 terabytes) of data — the equivalent of 167 times the
information contained in all the books in the US Library of Congress
• If all sensor data were recorded in Large Hadron Collider, the data flow
would be extremely hard to work with. The data flow would exceed 150
million petabytes annual rate, or nearly 500 exabytes per day, before
replication
VOLUME
Data Warehouse / DHBWDaimler TSS 21
https://en.wikipedia.org/wiki/Big_data
What is a high amount of data?
• Telecommunications (usage): AT&T transfers about 30 petabytes of
data through its networks each day.
• Internet: Google processed about 24 petabytes of data per day in 2009
• As of January 2013, Facebook users had uploaded over 240 billion photos,
with 350 million new photos every day. For each uploaded photo,
Facebook generates and stores four images of different sizes, which
translated to a total of 960 billion images and an estimated 357
petabytes of storage
VOLUME
Data Warehouse / DHBWDaimler TSS 22
https://en.wikipedia.org/wiki/Big_data
1 Kilobyte kB = 1.000 Byte
1 Megabyte MB = 1.000.000 Bytes = 10^6 Bytes
1 Gigabyte GB = 1.000.000.000 Bytes = 10^9 Bytes
1 Terabyte TB = 10^12 Bytes
1 Petabyte PB = 10^15 Bytes
1 Exabyte EB = 10^18 Bytes
1 Zettabyte ZB = 10^21 Bytes
1 Yottabyte ZB = 10^24 Bytes
TB, PB, EB, ZB, YB
Data Warehouse / DHBWDaimler TSS 23
Source: https://www.cmswire.com/cms/information-management/big-data-smart-data-and-the-fallacy-that-lies-between-017956.php#null
What is high velocity?
• The Large Hadron Collider experiments represent about 150 million
sensors delivering data 40 million times per second. There are nearly 600
million collisions per second. After filtering and refraining from recording
more than 99.99995% of these streams, there are 100 collisions of
interest per second
• Internet of Things
• Connected, autonomous Cars
VELOCITY
Data Warehouse / DHBWDaimler TSS 24
• Structured data like tables typically stored in relational databases
• Unstructured data usually generated by humans e.g. natural language,
voice, Wikipedia, Twitter posts
• Semi-structured data has some structure in tags but it changes with
documents E.g. HTML, XML, JSON files, server logs
Unstructured data is a bad phrase, e.g. Tweets are structured, too.
Better: data has low information density.
VARIETY
Data Warehouse / DHBWDaimler TSS 25
• Data involves some uncertainty and ambiguities
• Mistakes can be introduced by humans and machines
• #FakeNews
Data Quality is vital!
Garbage In – Garbage Out
Garbage data + perfect model => garbage results
VERACITY
Data Warehouse / DHBWDaimler TSS 26
WHAT HAPPENS IN AN INTERNET MINUTE?
Data Warehouse / DHBWDaimler TSS 27
Source: https://www.allaccess.com/merge/archive/28030/2018-update-what-happens-in-an-internet-minute#sthash.IKyiTou1.uxfs
• Information is the oil of the 21st century, and analytics is the
combustion engine (Peter Sondergaard, Gartner Research, 2011)
• Data creation is exploding. With all the selfies and useless files people
refuse to delete on the cloud. . . . The world’s data storage capacity will be
overtaken. . . . Data shortages, data rationing, data black markets . . .
data-geddon! (Gavin Belson, HBOs Silicon Valley, 2015)
• Data is the new gold (Open Data Initiative, European Commission)
• Big data is not about the data (Gary King, Harvard University)
SOME QUOTES
Data Warehouse / DHBWDaimler TSS 28
• Still no agreed definition
• Originally:
• Volume +
• Velocity +
• Variety
• Big data is a term used to refer to the study and applications of data
sets that are too complex for traditional data-processing application
software to adequately deal with. Big data challenges include capturing
data, data storage, data analysis, search, etc. [ https://en.wikipedia.org/wiki/Big_data ]
• part of this lecture
BIG DATA DEFINITION 1(2)
Data Warehouse / DHBWDaimler TSS 29
• Modern usage of the term "big data" tends to refer to the use
of predictive analytics, user behavior analytics, or certain other
advanced data analytics methods that extract value from data, and
seldom to a particular size of data set [ https://en.wikipedia.org/wiki/Big_data ]
• Not part of this lecture
BIG DATA DEFINITION 2(2)
Data Warehouse / DHBWDaimler TSS 30
BIG DATA LANDSCAPE
Data Warehouse / DHBWDaimler TSS 31
Source: http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png
Check the website from some NoSQL or NewSQL vendors
• Which (reference) customers do they have?
• What is the customer’s use case?
EXERCISE
Data Warehouse / DHBWDaimler TSS 32
Source: http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png
Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle
Data Warehouse / DHBWDaimler TSS 33
THANK YOU