lecture @dhbw: data warehouse part iv: big data … · hadoop and nosql since 2013. i keep my...

A company of Daimler AG

LECTURE @DHBW: DATA WAREHOUSE

PART IV: BIG DATA INTRODUCTIONANDREAS BUCKENHOFER, DAIMLER TSS

ABOUT ME

https://de.linkedin.com/in/buckenhofer

https://twitter.com/ABuckenhofer

https://www.doag.org/de/themen/datenbank/in-memory/

http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/

https://www.xing.com/profile/Andreas_Buckenhofer2

Andreas Buckenhofer

Senior DB Professional

[email protected]

Since 2009 at Daimler TSS

Department: Big Data

Business Unit: Analytics






mailto:[email protected]

ANDREAS BUCKENHOFER, DAIMLER TSS GMBH

Data Warehouse / DHBWDaimler TSS 3

“Forming good abstractions and avoiding complexity

is an essential part of a successful data architecture”

Data has always been my main focus during my long-time occupation in the area of

data integration. I work for Daimler TSS as Database Professional and Data Architect

with over 20 years of experience in Data Warehouse projects. I am working with

Hadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn new

things, experiment, and program every day.

I share my knowledge in internal presentations or as a speaker at international

conferences. I'm regularly giving a full lecture on Data Warehousing and a seminar on

modern data architectures at Baden-Wuerttemberg Cooperative State University

DHBW. I also gained international experience through a two-year project in Greater

London and several business trips to Asia.

I’m responsible for In-Memory DB Computing at the independent German Oracle User

Group (DOAG) and was honored by Oracle as ACE Associate. I hold current

certifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big Data

Architect“, „Oracle Database 12c Administrator Certified Professional“, “IBM

InfoSphere Change Data Capture Technical Professional”, etc.

Contact/Connect












https://www.doag.org/de/themen/datenbank/in-memory/




As a 100% Daimler subsidiary, we give

100 percent, always and never less.

We love IT and pull out all the stops to

aid Daimler's development with our

expertise on its journey into the future.

Our objective: We make Daimler the

most innovative and digital mobility

company.

NOT JUST AVERAGE: OUTSTANDING.

Daimler TSS

INTERNAL IT PARTNER FOR DAIMLER

+ Holistic solutions according to the Daimler guidelines

+ IT strategy

+ Security

+ Architecture

+ Developing and securing know-how

+ TSS is a partner who can be trusted with sensitive data

As subsidiary: maximum added value for Daimler

+ Market closeness

+ Independence

+ Flexibility (short decision making process,

ability to react quickly)

Daimler TSS 5

Daimler TSS

LOCATIONS

Data Warehouse / DHBW

Daimler TSS China

Hub Beijing

10 employees

Daimler TSS Malaysia

Hub Kuala Lumpur

42 employeesDaimler TSS IndiaHub Bangalore22 employees

Daimler TSS Germany

7 locations

1000 employees*

Ulm (Headquarters)

Stuttgart

Berlin

Karlsruhe

* as of August 2017

6

• After the end of this lecture you will be able to

• Understand ideas behind

• Big Data

• NoSQL

• NewSQL

WHAT YOU WILL LEARN TODAY


LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE


Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging

Layer

(Input

Layer)

OLTP

OLTP

Core

Warehouse

Layer

(Storage

Layer)

Mart Layer

(Output

Layer)

(Reporting

Layer)

Integration

Layer

(Cleansing

Layer)

Aggregation

Layer

Metadata Management

Security

DWH Manager incl. Monitor

Google

•new hardware and software architectures to store and process the exponentially growing quantity of websites it needed to index

Amazon

•Webscale: transactional processing capability that could operate at massive scale

TWO MAJOR DEVELOPMENTS IN THE 2000IES LEAD TONEW DATABASES


ONE SIZE DOES NOT FIT ALL


Source: Harrison, Next Generation Databases, Apress 2016, p. 4

Stonebraker:

“One Size Fits All”:

An Idea Whose Time Has

Come and Gone

https://cs.brown.edu/~u

gur/fits_all.pdf

https://cs.brown.edu/~ugur/fits_all.pdf

NOSQL – KEY VALUE STORES


• Simple data model with pairs of

• Unique key

• Values (atomic or

complex)

• Access only possible via Key

• Main use case is for Caching

• Examples: Redis, Aerospike,

Oracle NoSQL.

Key Value

userID1 ISBN1

userID2 ISBN2, ISBN8, ISBN9

userID3

NOSQL – DOCUMENT STORES


• Structures like XML, JSON,

BSON are stored in the DB

• Flexible schema

• Data and metadata are mixed

• Access via key or index

• DB does not interpret model

• Examples: MongoDB, CouchDB

ID: 12345

Name: Mustermann

Born: 04.02.1992

ID: 637

Name: Berger

Adress:From 01.01.2005

zip: 89004

from 01.07.2010

zip: 80990

city: München

NOSQL – WIDE COLUMN STORES


• Data are organized by keys and

flexible number of columns

• Column families separate data

• into different lists of columns

• Access via name (key)

• High scalability for write and

selective reads

• Examples: HBase, Cassandra

Name

(Key)Value

Time

stampRowKey

Name

(Key)Value

Time

stampRowKey

Name

(Key)Value

Time

tampRowKey

NOSQL – GRAPH STORES


• The data model contains

• edges

• Vertices

• Characteristics

• Relationships are of main

interest

• Optimized for graph queries

(graph traversal)

• Example: Neo4j

user1

user2

user4

user3

user5

• „Choose 2“ is not really correct and the

diagram on the left, too

• Actually, the CAP theorem says that it is

impossible for a system that guarantees

consistency to guarantee 100%

availability in the presence of a network

partition.

So if you can only choose one, it makes

sense to choose availability. (But 100%

availability is not real, eg 99.995)

• If X is 4: X = 10; Y = X + 8

What is the value of Y?

CAP THEOREM AND BASE (BASICALLY AVAILABLE, SOFT STATE, EVENTUAL CONSISTENCY)


Source: http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-failing-to.html

Consistency

Partition

ToleranceAvailability

CA CP

AP

(RDBMS)MongoDB

HBase

Redis

Cassandra

CouchDB

DynamoDB

http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-failing-to.html

“In all such systems, we find developers spend a significant fraction of

their time building extremely complex and error-prone mechanisms to

cope with eventual consistency and handle data that may be out of date”

(Google white paper)

• Importance of SQL

• Importance of Consistency / ACID

• Example: VoltDB

NEWSQL


Source: http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p769-shute.pdf

http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p769-shute.pdf

Are built to run in the cloud

• Ubiquitous and flexible: standard container run in any cloud

• Resilient and scalable: highly available, redundancy, graceful degradation.

• Dynamic: rolling upgrades, autonomous

• Automatable: everything is code, eg infrastructure

• Observable: logging, tracing, metrics (Netflix: https://medium.com/netflix-techblog/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17 )

• Distributed: take advantage of distributed cloud

Examples: Google Spanner, CockroachDB

CLOUD-NATIVE DATABASES


Source:

https://medium.com/netflix-techblog/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17

CNCF (CLOUD NATIVE COMPUTING FOUNDATION)CLOUD NATIVE LANDSCAPE


Source: https://github.com/cncf/landscape

https://github.com/cncf/landscape

• NoSQL, NewSQL, and BigData are vaguely defined, overhyped, and

overloaded terms

• NoSQL reject the constraints of the relational model like strict consistency and

schemas

• NewSQL retain many features of the relational model but enrich the model with

flexibility

• Big Data systems focus on technologies within the Hadoop ecosystem + spark

NOSQL, NEWSQL, BIGDATA


BIG DATA CHARACTERISTICS


Volume

• The amount of data

Velocity

• The speed at which data is generated

Variety

• The different types of data

Veracity

• The trustworthiness/ accuracy of data

What is a high amount of data?

• Walmart handles more than 1 million customer transactions every hour,

which are imported into databases estimated to contain more than 2.5

petabytes (2560 terabytes) of data — the equivalent of 167 times the

information contained in all the books in the US Library of Congress

• If all sensor data were recorded in Large Hadron Collider, the data flow

would be extremely hard to work with. The data flow would exceed 150

million petabytes annual rate, or nearly 500 exabytes per day, before

replication

VOLUME


https://en.wikipedia.org/wiki/Big_data


What is a high amount of data?

• Telecommunications (usage): AT&T transfers about 30 petabytes of

data through its networks each day.

• Internet: Google processed about 24 petabytes of data per day in 2009

• As of January 2013, Facebook users had uploaded over 240 billion photos,

with 350 million new photos every day. For each uploaded photo,

Facebook generates and stores four images of different sizes, which

translated to a total of 960 billion images and an estimated 357

petabytes of storage

VOLUME




1 Kilobyte kB = 1.000 Byte

1 Megabyte MB = 1.000.000 Bytes = 10^6 Bytes

1 Gigabyte GB = 1.000.000.000 Bytes = 10^9 Bytes

1 Terabyte TB = 10^12 Bytes

1 Petabyte PB = 10^15 Bytes

1 Exabyte EB = 10^18 Bytes

1 Zettabyte ZB = 10^21 Bytes

1 Yottabyte ZB = 10^24 Bytes

TB, PB, EB, ZB, YB


Source: https://www.cmswire.com/cms/information-management/big-data-smart-data-and-the-fallacy-that-lies-between-017956.php#null

https://www.cmswire.com/cms/information-management/big-data-smart-data-and-the-fallacy-that-lies-between-017956.php#null

What is high velocity?

• The Large Hadron Collider experiments represent about 150 million

sensors delivering data 40 million times per second. There are nearly 600

million collisions per second. After filtering and refraining from recording

more than 99.99995% of these streams, there are 100 collisions of

interest per second

• Internet of Things

• Connected, autonomous Cars

VELOCITY


• Structured data like tables typically stored in relational databases

• Unstructured data usually generated by humans e.g. natural language,

voice, Wikipedia, Twitter posts

• Semi-structured data has some structure in tags but it changes with

documents E.g. HTML, XML, JSON files, server logs

Unstructured data is a bad phrase, e.g. Tweets are structured, too.

Better: data has low information density.

VARIETY


• Data involves some uncertainty and ambiguities

• Mistakes can be introduced by humans and machines

• #FakeNews

Data Quality is vital!

Garbage In – Garbage Out

Garbage data + perfect model => garbage results

VERACITY


WHAT HAPPENS IN AN INTERNET MINUTE?


Source: https://www.allaccess.com/merge/archive/28030/2018-update-what-happens-in-an-internet-minute#sthash.IKyiTou1.uxfs

https://www.allaccess.com/merge/archive/28030/2018-update-what-happens-in-an-internet-minute#sthash.IKyiTou1.uxfs

• Information is the oil of the 21st century, and analytics is the

combustion engine (Peter Sondergaard, Gartner Research, 2011)

• Data creation is exploding. With all the selfies and useless files people

refuse to delete on the cloud. . . . The world’s data storage capacity will be

overtaken. . . . Data shortages, data rationing, data black markets . . .

data-geddon! (Gavin Belson, HBOs Silicon Valley, 2015)

• Data is the new gold (Open Data Initiative, European Commission)

• Big data is not about the data (Gary King, Harvard University)

SOME QUOTES


• Still no agreed definition

• Originally:

• Volume +

• Velocity +

• Variety

• Big data is a term used to refer to the study and applications of data

sets that are too complex for traditional data-processing application

software to adequately deal with. Big data challenges include capturing

data, data storage, data analysis, search, etc. [ https://en.wikipedia.org/wiki/Big_data ]

• part of this lecture

BIG DATA DEFINITION 1(2)



• Modern usage of the term "big data" tends to refer to the use

of predictive analytics, user behavior analytics, or certain other

advanced data analytics methods that extract value from data, and

seldom to a particular size of data set [ https://en.wikipedia.org/wiki/Big_data ]

• Not part of this lecture

BIG DATA DEFINITION 2(2)



BIG DATA LANDSCAPE


Source: http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png

http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png

Check the website from some NoSQL or NewSQL vendors

• Which (reference) customers do they have?

• What is the customer’s use case?

EXERCISE


Source: http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png

http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png

Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle


THANK YOU

lecture @dhbw: data warehouse part iv: big data … · hadoop and nosql since 2013. i keep my...

Documents