lecture @dhbw: data warehouse part ix: big data architecturesbuckenhofer/20182dwh/bucken... ·...

A company of Daimler AG

LECTURE @DHBW: DATA WAREHOUSE

PART IX: BIG DATA ARCHITECTURESANDREAS BUCKENHOFER, DAIMLER TSS

ABOUT ME

https://de.linkedin.com/in/buckenhofer

https://twitter.com/ABuckenhofer

https://www.doag.org/de/themen/datenbank/in-memory/

http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/

https://www.xing.com/profile/Andreas_Buckenhofer2

Andreas BuckenhoferSenior DB [email protected]

Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics



mailto:[email protected]

ANDREAS BUCKENHOFER, DAIMLER TSS GMBH

Data Warehouse / DHBWDaimler TSS 3

“Forming good abstractions and avoiding complexity is an essential part of a successful data architecture”

Data has always been my main focus during my long-time occupation in the area of data integration. I work for Daimler TSS as Database Professional and Data Architect with over 20 years of experience in Data Warehouse projects. I am working with Hadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn new things, experiment, and program every day.

I share my knowledge in internal presentations or as a speaker at international conferences. I'm regularly giving a full lecture on Data Warehousing and a seminar on modern data architectures at Baden-Wuerttemberg Cooperative State University DHBW. I also gained international experience through a two-year project in Greater London and several business trips to Asia.

I’m responsible for In-Memory DB Computing at the independent German Oracle User Group (DOAG) and was honored by Oracle as ACE Associate. I hold current certifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big Data Architect“, „Oracle Database 12c Administrator Certified Professional“, “IBM InfoSphere Change Data Capture Technical Professional”, etc.

DHBWDOAG

xing

Contact/Connect

mailto:[email protected]






https://www.doag.org/de/themen/datenbank/in-memory/


https://www.xing.com/profile/Andreas_Buckenhofer2

As a 100% Daimler subsidiary, we give

100 percent, always and never less.

We love IT and pull out all the stops to

aid Daimler's development with our

expertise on its journey into the future.

Our objective: We make Daimler the

most innovative and digital mobility

company.

NOT JUST AVERAGE: OUTSTANDING.

Daimler TSS

INTERNAL IT PARTNER FOR DAIMLER

+ Holistic solutions according to the Daimler guidelines

+ IT strategy

+ Security

+ Architecture

+ Developing and securing know-how

+ TSS is a partner who can be trusted with sensitive data

As subsidiary: maximum added value for Daimler

+ Market closeness

+ Independence

+ Flexibility (short decision making process,

ability to react quickly)

Daimler TSS 5

Daimler TSS

LOCATIONS

Data Warehouse / DHBW

Daimler TSS ChinaHub Beijing10 employees

Daimler TSS MalaysiaHub Kuala Lumpur42 employees

Daimler TSS IndiaHub Bangalore22 employees

Daimler TSS Germany

7 locations

1000 employees*

Ulm (Headquarters)

Stuttgart

Berlin

Karlsruhe

* as of August 2017

6

• Describe Big Data architectures

• Lambda architecture

• Kappa architecture

DWH LECTURE - LEARNING TARGETS


• There exist well-known reference architectures for Data Warehouses

• Many tools and schema-on-read came with the Hadoop ecosystem

• Was a “black box” at the beginning

• Gets more and more structure with different layers instead of a “black box”

• Structure, modeling, organization, governance instead of tool-only focus

• The slides provide some architectures with links to more information

BIG DATA / DATA LAKE ARCHITECTURESINTRODUCTION


• Architecture by Nathan Marz

• Realtime and batch processing

• Batch layer stores and historizes raw data

• Serving layer has to union batch and realtime layer

• Rather complex

• Author recommends graph data model and advises against schema-on-read

LAMBDA ARCHITECTURE


LAMBDA ARCHITECTURE


Source:

Batch Layer

BatchEngine Serving Layer

ServingBackend Queries

Raw historydata

Resultdata

Real-Time Layer

Real-TimeEngine

• Lack of human fault tolerance

• Bugs will be deployed

• Operational errors will happen, e.g. accidentally delete data

• Data loss / corruption is worst case scenario

• Without data loss / corruption, mistakes can be fixed if original data is still available

• Must design for human error like you’d design for any other fault

WHAT IS THE MAIN INVENITABLE PROBLEM IN DATASYSTEMS?

Data Warehouse and Big Data / DHBWDaimler TSS 11

• Atomic, immutable data

• Ensure that data can not be deleted (accidentally)

• Fundamentally simpler

• Easy to implement on top of a distributed filesystem, eg Hadoop

• CR instead of CRUD• No updates

• No deletes

• Create (insert) and read (select) only

Immutability restricts the range of errors that can cause data loss/corruption

IMMUTABLE DATA


NATHAN MARZ VIEW ON SCHEMA


• RawnessStore the data as it is. No transformations.

• ImmutabilityDon’t update or delete data, just

add more.

• Graph-like schema recommended

„Many developers go down the path of writing their raw data in a schemaless

format like JSON. This is appealing because of how easy it is to get started, but this

approach quickly leads to problems. Whether due to bugs or misunderstandings

between different developers, data corruption inevitably occurs“

(see page 103, Nathan Marz, „Big Data: Principles and best practices of scalable

realtime data systems", Manning Publications)

Source image: Nathan Marz, James Warren: Big Data: Principles and best practices of scalable realtime data systems, Manning Publications 2015

“My own personal opinion is that data analysis is much less important than data re-analysis. It’s hard for a data team to get things right on the very first try, and the team shouldn’t be faulted for their honest efforts. When everything is available for review, and when more data is added over time, you’ll increase your chances of converging to someplace near the truth.”–Jules J. Berman.

http://www.odbms.org/blog/2014/07/big-data-science-interview-jules-j-berman/

ANALYTICS VS RE-ANALYTICS


NEW APPROACH TO BUILD DATA SYSTEMS


Immutable data(Source of Truth / Single version of

facts)

(Materialized) Views on data





Query = Application

HDFS / NoSQLRDBMSNewSQL






Query = Application

WHAT IS THE LAMBDA ARCHITECTURE?


Batch Layer

All Data

Speed Layer

RealTime Views

Serving Layer

Batch Views

Query(merge)

Data Stream

OVERVIEW LAMBDA ARCHITECTURE


https://de.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action

https://de.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action

BATCH VIEWS AND REALTIME VIEWS


DATA ABSOBTION


https://dzone.com/articles/lambda-architecture-with-apache-spark

LAMBDA ARCHITECTURE – DATA EXAMPLECOMPUTE FOLLOWER LIST


Batch Layer

1.1. insert Jim1.1. insert Anne2.1 remove Jim3.1. insert George5.1. insert John (now)

Speed Layer

insert John (now)

ServingLayer

AnneGeorge

Query

Data Stream

AnneGeorgeJohn

Follower: 3

Computing best answer in real time may not always be possible

• Can compute exact answer in batch layer and approximate answer in realtime layer

• Best of both worlds of performance and accuracyFor example, a machine learning application where generation of the batch model requires so much time and resources that the best result achievable in real-time is computing and approximated updates of that model. In such cases, the batch and real-time layers cannot be merged, and the Lambda architecture must be used.

EVENTUAL ACCURACY


WHICH TOOLS COULD BE USED IN THE LAMBDAARCHITECTURE? DB REQUIREMENTS


Easier to implement fordatabase vendors

compared to randomaccess

Batch layer

Write sequential

once

Bulk sequential read many

times

Speed layer

Random write

Random read

Servinglayer

Batch write

Random read

WHICH TOOLS COULD BE USED IN THE LAMBDAARCHITECTURE? DB REQUIREMENTS


More challenging

More challenging More challenging

WHICH TOOLS COULD BE USED IN THE LAMBDAARCHITECTURE?


Batch Layer

All Data

Speed Layer

RealTime Views

Serving Layer

Batch Views

Query(merge)

Data Stream

CLOUD VENDORS OFFERINGS, E.G.AMAZON AWS


https://aws.amazon.com/de/blogs/big-data/unite-real-time-and-batch-analytics-using-the-big-data-lambda-architecture-without-servers/

https://aws.amazon.com/de/blogs/big-data/unite-real-time-and-batch-analytics-using-the-big-data-lambda-architecture-without-servers/

CLOUD VENDORS OFFERINGS, E.G.MICROSOFT AZURE


https://social.technet.microsoft.com/wiki/contents/articles/33626.lambda-architecture-implementation-using-microsoft-azure.aspx

https://social.technet.microsoft.com/wiki/contents/articles/33626.lambda-architecture-implementation-using-microsoft-azure.aspx


sou

rce:

Mar

kus

Sch

mid

ber

ger

-B

ig D

ata

ist

tot

–Es

leb

eB

usi

nes

s In

telli

gen

z?,

TDW

I 2

01

6


sou

rce:

Mar

kus

Sch

mid

ber

ger

-B

ig D

ata

ist

tot

–Es

leb

eB

usi

nes

s In

telli

gen

z?,

TDW

I 2

01

6

Lambda @glomex• Enrich batch-

driven data processing with real-time requirements

• Adapt Lambda architecture to own requirements

Remark: AWS lambda # lambda architecture

LAMBDA ARCHITECTURE – PROS AND CONS


Pro Con

Architecture emphasizes to keep data immutable. Mistakes can be corrected via recomputation

Maintaining code that needs to produce the same result in two complex distributed systems

Reprocessing is one of the key challenges of stream processing but is very often ignored

Operational burden of running and debugging two systems

• Incoming data is sent to batch and speed layer

• Batch layer constantly (re-) computes batch views

• Master data is stored in the batch layer in raw format: immutable & append-only

• Contains data except most up-to-date data due to high latency

• Replaces data in speed layer as soon as data in newer compared to speed layer

• Speed layer uses incremental algorithms to refresh real-time Views• Receives data stream for real-time processing

• Contains most up-to-date data only

• Serving layer merges results from batch and speed layer

SUMMARY LAMBDA ARCHITECTURE


Jay Kreps wrote an article about „Questioning the Lambda architecture“

• He wrote about his experience with the Lambda architecture

• It works, but not very pleasant or productive

• Keeping code in sync is really hard

• Need to build complex, low-latency processing systems

• Scalable high-latency batch system

• Low-latency stream stream-processing system

• Instead of duct taping batch & speed:

→ Kappa architecture / Log-centric architecture / Stream data platform

QUESTIONING THE LAMBDA ARCHITECTURE


Source: https://www.oreilly.com/ideas/questioning-the-lambda-architecture

https://www.oreilly.com/ideas/questioning-the-lambda-architecture

• Architecture by Jay Kreps

• Logcentric, write-ahead logging

• Each event is an immutable log entry and is added to the end of the log

• Read and write operations are separated

• Materialized views can be recomputed consistently from data in the log

KAPPA ARCHITECTURE


WHAT ARE THE THREE PARADIGMS OF PROGRAMMING?

• Request/Response

• Batch

• Stream processing


Source: https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619

https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619

WHAT IS A LOG?


Source: Jay Kreps: I heart logs, O’Reilly 2014

• Database transactions/data

• User, products, etc.

• Events• Tweets, clicks, impressions, pageviews, etc.

• Application metrics

• CPU usage, requests, etc.

• Application logs• Service calls, errors, etc.

TYPES OF DATA


• A bank account’s current balance can be built from a complete list of its debits and credits, but the inverse is not true.

• In this way, the log of transactions is the more “fundamental” data structure than the database records storing the results of those transactions.

A software application’s database is better thought of as a series of time-ordered immutable facts collected since that system was born, instead of as a current snapshot of all data records as of right now.

WHAT IS A LOG?


Source: https://blog.parse.ly/post/1550/kreps-logs/

https://blog.parse.ly/post/1550/kreps-logs/

ENTERPRISE-WIDE ARCHITECTURE TODAYDATA INTEGRATION CHALLENGE


Source: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/

https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/

ENTERPRISE-WIDE ARCHITECTUREUSING LOG-CENTRIC APPROACH




ENTERPRISE-WIDE ARCHITECTUREUSING LOG-CENTRIC APPROACH AND CDC




LOG CENTRIC ARCHITECTURE / KAPPA ARCHITECTURE / DATA STREAMING PLATFORM



SUMMARY: ENTERPRISE-WIDE ARCHITECTURE



Most common understanding (doing the „T“ in ETL):

• Stream processing is the parallel processing of data in motion = computing on data directly as it is produced or received.

• Not necessarily transient, approximate, lossy (assumptions from Lambda architecture and other event processing systems)

WHAT IS STREAM PROCESSING?


• Active MQ, RabbitMQ

• Problems:• Not distributed

• Throughput

• Persistence

• Ordering

MESSAGING SYSTEMS @LINKEDIN

1st attempt

• Kafka

• Key abstraction: Logs

• build from scratch• Distributed system by design

• Partitioning with local ordering

• Elastic scaling

• Fault tolerance

2nd attempt


• Scalability

• Hundreds of MB/sec/server throughput

• Many TB per node

• Guarantees of a database

• All messages strictly ordered (within a partition)

• All data persistent

• Distributed by default

• Replication

• Partitioning

• Producers + consumers all fault tolerant and horizontally scalable

KAFKA: A MODERN DISTRIBUTED SYSTEM FOR STREAMS


ETL REVISITED WITH KAFKA CONNECT AND KAFKA STREAMS


Source: https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619

https://de.slideshare.net/JayKreps1/distributed-stream-processing-with-apache-kafka-71737619

• Apache Spark streaming

• Apache Storm (twitter, Nathan Marz)

• Apache Samza (linkedin)

• Apache Flink

• Apache Kafka Streams

• Simple library

• Reprocessing

• No microbatch = everything is a stream

• Local state

• Key operations: filter, aggregate, join

WHAT ARE STREAM PROCESSING FRAMEWORKS?


Lambda architecture

• Complex

• It works, but not very pleasant or productive

• Keeping code in sync is really hard

• Addresses the challenge of reprocessing

• Bugs

• New requirements

Logs unify• Batch processing

• Stream processing

REPROCESSING


KAFKA @LINKEDIN


Source: https://de.slideshare.net/JayKreps1/i-32858698

https://de.slideshare.net/JayKreps1/i-32858698

SUMMARY: STREAM DATA PLATFORM USING KAFKAUNIFYING BATCH AND STREAM PROCESSING


Source: https://de.slideshare.net/JayKreps1/i-32858698

https://de.slideshare.net/JayKreps1/i-32858698

• A message-oriented implementation requires an efficient messaging backbone that facilitates the exchange of data in a reliable and secure way with the lowest latency possible.

• Creating small, self-contained, data-driven applications that meld streaming data and microservices together is a good practice to break down large problems and projects into approachable chunks, reduce risk, and deliver value faster.

• Think of combinations of data-processing applications with microservices to deliver specific features and insights from a data stream.

KAFKA AND MICROSERVICES


KAPPA ARCHITECTURE


Source:

Real-Time Layer

Real-TimeEngine

Serving Layer

ServingBackendData Queries

Raw historydata

Resultdata

Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle


THANK YOU

PS-3C ARCHITECTURE


Source: TDWI 2016

Data Library

StorageEngine

3C Layer

PreparationEngineData Queries

Serving Layer

DeliveryEngine

Raw historydata

Integrated subjects

Resultdata

• Architecture by Rogier Werschkull

• Store incoming data in Data Library Layer (Persistent staging = PS)

• Prepare data in a 3C layer for “Concept – Context – Connector”-model

• Concept + Connector can be virtualized on data in Data Library Layer

PS-3C ARCHITECTURE


• Architecture by Joe Caserta

• Big Data Warehouse may live in one or more platforms on premise or in the cloud

• Hadoop only

• Hadoop + MPP or RDBMS

• Additionally NoSQL or Search

POLYGLOT WAREHOUSE


POLYGLOT WAREHOUSE


Source: https://www.slideshare.net/CasertaConcepts/hadoop-and-your-data-warehouse

• Architecture by Claudia Imhoff

• combine the stability and reliability of the BI architectures while embracing new and innovative technologies and techniques

• 3 components that extend the EDW environment

• Investigative computing platform

• Data refinery

• Real-time (RT) analysis platform

THE EXTENDED DATA WAREHOUSE ARCHITECTURE (XDW)THE ENTERPRISE ANALYTICS ARCHITECTURE


THE EXTENDED DATA WAREHOUSE ARCHITECTURE (XDW)THE ENTERPRISE ANALYTICS ARCHITECTURE


Source: https://upside.tdwi.org/articles/2016/03/15/extending-traditional-data-warehouse.aspx

• Inflow Lake: accommodates a collection of data ingested from many different sources that are disconnected outside the lake but can be used together by being colocated within a single place

• Outflow Lake: a landing area for freshly arrived data available for immediate access or via streaming. It employs schema-on-read for the downstream data interpretation and refinement.

• Data Science Lab: most suitable for data discovery and for developing new advanced analytics models

GARTNER DATA LAKE ARCHITECTURE STYLES


Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/

GARTNER DATA LAKE ARCHITECTURE STYLES


Source: http://blogs.gartner.com/nick-heudecker/data-lake-webinar-recap/

GARTNER – THE LOGICAL DWH


https://blogs.gartner.com/henry-cook/2018/01/28/the-logical-data-warehouse-and-its-jobs-to-be-done/

https://blogs.gartner.com/henry-cook/2018/01/28/the-logical-data-warehouse-and-its-jobs-to-be-done/

SUMMARY


Landing Area

StorageEngine

Data Lake

IntegrationEngineData Queries

Data Presentation

DeliveryEngine

Raw historydata

Lightly integrated data

Resultdata

• Architecture by Eckerson Group

• DWHs exist together with MDM, ODS, and portions of the data lake as a collection of data that is curated, profiled, and trusted for enterprise reporting and analysis

DATA CORE – DAVE WELLS


https://www.eckerson.com/articles/the-future-of-the-data-warehouse

DATA CORE – DAVE WELLS



DATA LIFECYCLE – DAVE WELLS



DWH AND DATA LAKE – DAVE WELLSIN PARALLEL VS INSIDE



• Dimensional modeling is not dead.

• The benefits are still valid in the age of Big Data, Hadoop, Spark, etc:

• Eliminate joins

• Data model is understandable for end users

• Well-suited for columnar storage + processing (e.g. SIMD)

• Nesting technique

• E.g. tables with lower granularity can be nested into large fact table

• Usage in SQL: Flatten(kvgen(<json>))

DIMENSIONAL MODELING IN THE AGE OF BIG DATA


SELF-SERVICE DATA


https://www.oreilly.com/ideas/how-self-service-data-avoids-the-dangers-of-shadow-analytics

Functional area Why important Self-service approach

Data acceleration

With shadow analytics, users create redundant data copies.

The system must be capable of autonomously identifying the best optimizations and adapting to emerging query patterns over time.

Data catalog Data consumers struggle to find data that is important to their work. Users keep private notes about data sources and data quality, meaning there is no governance.

In the self-service approach, the catalog is automatic—as new data sources are brought online.

Data virtualization

It is virtually impossible for an organization to centralize all data in a single system.

Data consumers need to be able to access all data sets equally well, regardless of the underlying technology or location of the system.

SELF-SERVICE DATA




Data curation There is no single “shape” of data that works for everyone.

Data consumers need the ability to interact with data sets from the context of the data itself, not exclusively from simple metadata that fails to tell the whole story. Data consumers should be capable of reshaping data to their own needs without writing any code or learning new languages.

Data lineage As data is accessed by data consumers and in different processes, it is important to track the provenance of the data, who accessed the data, how the data was accessed, what tools were used, and what results were obtained.

As users reshape and share data sets with one another through a virtual context, a self-service data platform can seamlessly track these actions and all states of data along the way, providing full audit capabilities as well.

SELF-SERVICE DATA




Open source Because data is essential to every area of every business, the underlying data formats and technologies used to access and process the data should be open source

Self-service data platforms build on open source standards like Apache Parquet, Apache Arrow, and Apache Calcite to store, query, and analyze data from any source.

Security controls

Organizations safeguard their data assets with security controls that govern authentication (you are who you say you are), authorization (you can perform specific actions), auditing (a record of the actions you take), and encryption (you can only read the data if you have the right key).

Self-service data platforms integrate with existing security controls of the organization, such as LDAP and Kerberos.

Hadoop (and for similar reasons Spark) has its strengths but no as a DWH replacement, e.g.

• Fast query reads only possible in HBase with an inflexible (use case specific) data model

• No sophisticated query optimizer

• Hadoop is very complex with many tools/versions/vendors and no standard

• Security is still at the beginning

IS THE DWH DEAD?


• The LDW is a multi-server / multi-engine architecture

• The LDW can have multiple engines, the main ones being: The datawarehouse(DW), the operational data store(ODS), data marts and thedata lake

• We should not be trying to choose a single engine for all ourrequirements. Instead, we should be sensibly distributing all ourrequirements across the various components we have to choose from

LOGICAL DWH (GARTNER)


https://blogs.gartner.com/henry-cook/2018/01/18/logical-data-warehouse-project-plans/

https://blogs.gartner.com/henry-cook/2018/01/18/logical-data-warehouse-project-plans/

lecture @dhbw: data warehouse part ix: big data architecturesbuckenhofer/20182dwh/bucken... ·...

Documents