beyond map-reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · beyond map-reduce...

Beyond Map-Reduce

Roman Kern

ISDS, TU Graz

2018-12-03

Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1 / 31

Scalable Data ManagementOptimised for distributed computing


Scalable Data Management

Properties of data - best practiceRaw

I Unstructured data is more raw than structured dataI For example, normalisation of data may throw away informationI → Store the raw data or the derived data i� the extraction algorithm is simple and accurate

ImmutableI Data is only added, but not modified→ simpleI → fault-tolerant (data cannot be lost)I In practice data items may be associated with a time-stamp

PerpetuityI The eternal trueness of data (i.e. a fact)I Data items should be assigned with a time-stamp (date when the fact has been true)

Data here denotes information that cannot be derived



When should one delete data?

Never… unless no longer needed

I e.g. historical dataI To free up space and speed up the processing

… ethical reasons or regulationsWhen deleting data

I Make a copy where the to be deleted data is filtered outI And run analytics jobs on the copy for testing purposeI And finally swap the data sets

Rolling granularity (the older the data more coarse grained)



When should one delete data?Never

… unless no longer neededI e.g. historical dataI To free up space and speed up the processing






When should one delete data?Never… unless no longer needed

I e.g. historical dataI To free up space and speed up the processing






Star SchemaDenormalised

I Simpler queries, (o�en) fasterI Data integrity is not enforced

Two di�erent types of tables:I Fact tablesI Dimension tables



Star Schema - Comparison of table types

Fact Tables Dimension Tables

# Rows Many Few# Columns Few ManyGranularity Low HighExamples Events, Snapshots, Aggregates Product, Time, Geography



Star Schema (Fact-based data model)Deconstruct the data into fundamental unitsFacts

I AtomicI Timestamped

OptionallyI Identifiable

Data is just added, by default not deleted or updated



Figure: Traditional data model, each row represents a single data item



Figure: Fact based data model, where data items are separated into multiple facts, each stored in anown table together with a timestamp



Figure: Fact based data model, where one fact has changed by adding new information that supersedesexisting information (User 1 is now located in Manchester)



Advantages of fact-based data model�eryable at any timeHuman fault tolerant

I Easy to fall back to an older state

Handles partial informationI No need for Nulls

Data storage and query processing separate



Disadvantages of fact-based data modelAdditional e�ort to collect all relevant informationDeleted information requires e�ort at query time

I As there is no deleted information, just superseded information

Normalisation of the dataI Time-consuming to assemble all necessary data

Timestamp information might not be necessaryI Additional storage costs



Data SchemasDescribe the model (facts)Can be used to enforce the structure of the data

I In practice this enforcement is mandatoryI To detect errors early on

Typically data schemas are independent from the programming language

Is used as base for the serialisation of the data



Graph SchemasNodes

I Typed entities

EdgesI Typed relationship between the entities

PropertiesI Information about the entities

AdvantagesGraph schemas should allow evolution of the data schema (as the application matures).



Figure: Graph scheme, consisting of nodes (green, yellow), edges (relations between nodes) andproperties (blue)



Serialisation FrameworksResponsible to map between storage and run-time

I i.e. between program language specific representation and a serialised representation (o�enJSON)

ExamplesI Apache Thri�I Google Protocol Bu�ersI Avro



Master datasetConstantly growing

I → Scalability

No need to support updateI Just adding of new information

No random access necessaryI “Write once, bulk read many times”

Parallel readers

SidenoteEven a distributed file system satisfies these requirements



Best practicesStore data in big files

I One job per block

Vertical partitioningI E.g. by date

Compress data if possibleI Trade-o� between speed and size

ConsolidationI Maintainance tasks (e.g. compaction)


Map/Reduce WorkflowsMultiple map/reduce jobs in sequence


Map/Reduce Workflows

FlowSequence of manipulation on the data

… that can be mapped to one or more map/reduce jobs

Tools exists to model these workflows

… instead of manually creating the jobs

Especially important for ETL processes

… Extract, Transform, Load


Map/Reduce Workflows

CascadingModel the whole process

… framework to optimise the mapping

https://github.com/Cascading/CoPA/wiki


https://github.com/Cascading/CoPA/wiki

Stream ProcessingStream processing instead of batch processing?


Stream Data Processing

Two basic stream processing typesOne-at-a-time

I Each event is processed individually

Micro-batchI Multiple events are combined into a single batch



Basic execution semanticsAt-least-once

I Each event is guaranteed to be executedI But might be processed more o�en (thus the same result might be reported multiple times)I Might introduce inaccuracies

At-most-onceI Each event is optionally executed, but no more than once

Exactly-onceI Each event is guaranteed to be executed only once



Trade-o� between the stream processing types

One-at-a-time Micro-Batch

Lower Latency XHigher Throughput XAt-least-once semantic X XExactly-once semantic (sometimes) XSimpler programming model X



ImplicationsExactly-once can be achieved via strictly ordered processing

I Fully process an event before continuing

More e�icient for micro-batchesI Multiple batches in parallelI Need to store the batch-id of the last successfully processed batch



Strength Examples

Batch High throughput Hadoop, SparkOne-at-a-time Low latency StormMicro-batch Tradeo� Trident, Spark

Table: Comparison of the main distributed architectures


Asynchronous Architectures - Storm Stream Processing

Storm architectureBasis of the Apache Storm so�ware

Developed for stream processing

Improvement over the queues and workers architectureDoes not need intermediary queues

I Instead, tracks the data using a DAGI An e�icient implementation does exist



Storm modelTuples: the data items

Streams: consist of a sequence of tuples

Spouts: produces streamsBolts: process streams (i.e. consumes and produces tuples)

I Any number of in or out streams

Tasks: spouts or boltsI Inherently parallel

Topology: how the spouts and bolts are connected



Strategies to partition the streams (stream grouping)Randomly

I Called shu�le groupingI Splits the tuples evenly

By data valueI Called field grouping

Specialised: All grouping, specific bolts, …


The EndNext: Discussion of student projects (January)


beyond map-reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · beyond map-reduce...

Documents