beyond map-reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · beyond map-reduce...

33
Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1 / 31

Upload: others

Post on 23-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Beyond Map-Reduce

Roman Kern

ISDS, TU Graz

2018-12-03

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1 / 31

Page 2: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data ManagementOptimised for distributed computing

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 2 / 31

Page 3: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Properties of data - best practiceRaw

I Unstructured data is more raw than structured dataI For example, normalisation of data may throw away informationI → Store the raw data or the derived data i� the extraction algorithm is simple and accurate

ImmutableI Data is only added, but not modified→ simpleI → fault-tolerant (data cannot be lost)I In practice data items may be associated with a time-stamp

PerpetuityI The eternal trueness of data (i.e. a fact)I Data items should be assigned with a time-stamp (date when the fact has been true)

Data here denotes information that cannot be derived

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 3 / 31

Page 4: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

When should one delete data?

Never… unless no longer needed

I e.g. historical dataI To free up space and speed up the processing

… ethical reasons or regulationsWhen deleting data

I Make a copy where the to be deleted data is filtered outI And run analytics jobs on the copy for testing purposeI And finally swap the data sets

Rolling granularity (the older the data more coarse grained)

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 4 / 31

Page 5: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

When should one delete data?Never

… unless no longer neededI e.g. historical dataI To free up space and speed up the processing

… ethical reasons or regulationsWhen deleting data

I Make a copy where the to be deleted data is filtered outI And run analytics jobs on the copy for testing purposeI And finally swap the data sets

Rolling granularity (the older the data more coarse grained)

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 4 / 31

Page 6: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

When should one delete data?Never… unless no longer needed

I e.g. historical dataI To free up space and speed up the processing

… ethical reasons or regulationsWhen deleting data

I Make a copy where the to be deleted data is filtered outI And run analytics jobs on the copy for testing purposeI And finally swap the data sets

Rolling granularity (the older the data more coarse grained)

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 4 / 31

Page 7: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Star SchemaDenormalised

I Simpler queries, (o�en) fasterI Data integrity is not enforced

Two di�erent types of tables:I Fact tablesI Dimension tables

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 5 / 31

Page 8: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Star Schema - Comparison of table types

Fact Tables Dimension Tables

# Rows Many Few# Columns Few ManyGranularity Low HighExamples Events, Snapshots, Aggregates Product, Time, Geography

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 6 / 31

Page 9: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Star Schema (Fact-based data model)Deconstruct the data into fundamental unitsFacts

I AtomicI Timestamped

OptionallyI Identifiable

Data is just added, by default not deleted or updated

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 7 / 31

Page 10: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Figure: Traditional data model, each row represents a single data item

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 8 / 31

Page 11: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Figure: Fact based data model, where data items are separated into multiple facts, each stored in anown table together with a timestamp

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 9 / 31

Page 12: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Figure: Fact based data model, where one fact has changed by adding new information that supersedesexisting information (User 1 is now located in Manchester)

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 10 / 31

Page 13: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Advantages of fact-based data model�eryable at any timeHuman fault tolerant

I Easy to fall back to an older state

Handles partial informationI No need for Nulls

Data storage and query processing separate

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 11 / 31

Page 14: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Disadvantages of fact-based data modelAdditional e�ort to collect all relevant informationDeleted information requires e�ort at query time

I As there is no deleted information, just superseded information

Normalisation of the dataI Time-consuming to assemble all necessary data

Timestamp information might not be necessaryI Additional storage costs

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 12 / 31

Page 15: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Data SchemasDescribe the model (facts)Can be used to enforce the structure of the data

I In practice this enforcement is mandatoryI To detect errors early on

Typically data schemas are independent from the programming language

Is used as base for the serialisation of the data

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 13 / 31

Page 16: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Graph SchemasNodes

I Typed entities

EdgesI Typed relationship between the entities

PropertiesI Information about the entities

AdvantagesGraph schemas should allow evolution of the data schema (as the application matures).

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 14 / 31

Page 17: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Figure: Graph scheme, consisting of nodes (green, yellow), edges (relations between nodes) andproperties (blue)

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 15 / 31

Page 18: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Serialisation FrameworksResponsible to map between storage and run-time

I i.e. between program language specific representation and a serialised representation (o�enJSON)

ExamplesI Apache Thri�I Google Protocol Bu�ersI Avro

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 16 / 31

Page 19: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Master datasetConstantly growing

I → Scalability

No need to support updateI Just adding of new information

No random access necessaryI “Write once, bulk read many times”

Parallel readers

SidenoteEven a distributed file system satisfies these requirements

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 17 / 31

Page 20: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Scalable Data Management

Best practicesStore data in big files

I One job per block

Vertical partitioningI E.g. by date

Compress data if possibleI Trade-o� between speed and size

ConsolidationI Maintainance tasks (e.g. compaction)

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 18 / 31

Page 21: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Map/Reduce WorkflowsMultiple map/reduce jobs in sequence

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 19 / 31

Page 22: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Map/Reduce Workflows

FlowSequence of manipulation on the data

… that can be mapped to one or more map/reduce jobs

Tools exists to model these workflows

… instead of manually creating the jobs

Especially important for ETL processes

… Extract, Transform, Load

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 20 / 31

Page 23: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Map/Reduce Workflows

CascadingModel the whole process

… framework to optimise the mapping

https://github.com/Cascading/CoPA/wiki

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 21 / 31

Page 24: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Stream ProcessingStream processing instead of batch processing?

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 22 / 31

Page 25: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Stream Data Processing

Two basic stream processing typesOne-at-a-time

I Each event is processed individually

Micro-batchI Multiple events are combined into a single batch

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 23 / 31

Page 26: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Stream Data Processing

Basic execution semanticsAt-least-once

I Each event is guaranteed to be executedI But might be processed more o�en (thus the same result might be reported multiple times)I Might introduce inaccuracies

At-most-onceI Each event is optionally executed, but no more than once

Exactly-onceI Each event is guaranteed to be executed only once

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 24 / 31

Page 27: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Stream Data Processing

Trade-o� between the stream processing types

One-at-a-time Micro-Batch

Lower Latency XHigher Throughput XAt-least-once semantic X XExactly-once semantic (sometimes) XSimpler programming model X

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 25 / 31

Page 28: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Stream Data Processing

ImplicationsExactly-once can be achieved via strictly ordered processing

I Fully process an event before continuing

More e�icient for micro-batchesI Multiple batches in parallelI Need to store the batch-id of the last successfully processed batch

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 26 / 31

Page 29: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Stream Data Processing

Strength Examples

Batch High throughput Hadoop, SparkOne-at-a-time Low latency StormMicro-batch Tradeo� Trident, Spark

Table: Comparison of the main distributed architectures

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 27 / 31

Page 30: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Asynchronous Architectures - Storm Stream Processing

Storm architectureBasis of the Apache Storm so�ware

Developed for stream processing

Improvement over the queues and workers architectureDoes not need intermediary queues

I Instead, tracks the data using a DAGI An e�icient implementation does exist

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 28 / 31

Page 31: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Asynchronous Architectures - Storm Stream Processing

Storm modelTuples: the data items

Streams: consist of a sequence of tuples

Spouts: produces streamsBolts: process streams (i.e. consumes and produces tuples)

I Any number of in or out streams

Tasks: spouts or boltsI Inherently parallel

Topology: how the spouts and bolts are connected

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 29 / 31

Page 32: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

Asynchronous Architectures - Storm Stream Processing

Strategies to partition the streams (stream grouping)Randomly

I Called shu�le groupingI Splits the tuples evenly

By data valueI Called field grouping

Specialised: All grouping, specific bolts, …

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 30 / 31

Page 33: Beyond Map-Reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · Beyond Map-Reduce Roman Kern ISDS, TU Graz 2018-12-03 Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1/31

The EndNext: Discussion of student projects (January)

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 31 / 31