beyond map-reducekti.tugraz.at/staff/rkern/courses/dbase2/slides_beyond.pdf · beyond map-reduce...
TRANSCRIPT
Beyond Map-Reduce
Roman Kern
ISDS, TU Graz
2018-12-03
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 1 / 31
Scalable Data ManagementOptimised for distributed computing
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 2 / 31
Scalable Data Management
Properties of data - best practiceRaw
I Unstructured data is more raw than structured dataI For example, normalisation of data may throw away informationI → Store the raw data or the derived data i� the extraction algorithm is simple and accurate
ImmutableI Data is only added, but not modified→ simpleI → fault-tolerant (data cannot be lost)I In practice data items may be associated with a time-stamp
PerpetuityI The eternal trueness of data (i.e. a fact)I Data items should be assigned with a time-stamp (date when the fact has been true)
Data here denotes information that cannot be derived
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 3 / 31
Scalable Data Management
When should one delete data?
Never… unless no longer needed
I e.g. historical dataI To free up space and speed up the processing
… ethical reasons or regulationsWhen deleting data
I Make a copy where the to be deleted data is filtered outI And run analytics jobs on the copy for testing purposeI And finally swap the data sets
Rolling granularity (the older the data more coarse grained)
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 4 / 31
Scalable Data Management
When should one delete data?Never
… unless no longer neededI e.g. historical dataI To free up space and speed up the processing
… ethical reasons or regulationsWhen deleting data
I Make a copy where the to be deleted data is filtered outI And run analytics jobs on the copy for testing purposeI And finally swap the data sets
Rolling granularity (the older the data more coarse grained)
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 4 / 31
Scalable Data Management
When should one delete data?Never… unless no longer needed
I e.g. historical dataI To free up space and speed up the processing
… ethical reasons or regulationsWhen deleting data
I Make a copy where the to be deleted data is filtered outI And run analytics jobs on the copy for testing purposeI And finally swap the data sets
Rolling granularity (the older the data more coarse grained)
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 4 / 31
Scalable Data Management
Star SchemaDenormalised
I Simpler queries, (o�en) fasterI Data integrity is not enforced
Two di�erent types of tables:I Fact tablesI Dimension tables
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 5 / 31
Scalable Data Management
Star Schema - Comparison of table types
Fact Tables Dimension Tables
# Rows Many Few# Columns Few ManyGranularity Low HighExamples Events, Snapshots, Aggregates Product, Time, Geography
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 6 / 31
Scalable Data Management
Star Schema (Fact-based data model)Deconstruct the data into fundamental unitsFacts
I AtomicI Timestamped
OptionallyI Identifiable
Data is just added, by default not deleted or updated
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 7 / 31
Scalable Data Management
Figure: Traditional data model, each row represents a single data item
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 8 / 31
Scalable Data Management
Figure: Fact based data model, where data items are separated into multiple facts, each stored in anown table together with a timestamp
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 9 / 31
Scalable Data Management
Figure: Fact based data model, where one fact has changed by adding new information that supersedesexisting information (User 1 is now located in Manchester)
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 10 / 31
Scalable Data Management
Advantages of fact-based data model�eryable at any timeHuman fault tolerant
I Easy to fall back to an older state
Handles partial informationI No need for Nulls
Data storage and query processing separate
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 11 / 31
Scalable Data Management
Disadvantages of fact-based data modelAdditional e�ort to collect all relevant informationDeleted information requires e�ort at query time
I As there is no deleted information, just superseded information
Normalisation of the dataI Time-consuming to assemble all necessary data
Timestamp information might not be necessaryI Additional storage costs
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 12 / 31
Scalable Data Management
Data SchemasDescribe the model (facts)Can be used to enforce the structure of the data
I In practice this enforcement is mandatoryI To detect errors early on
Typically data schemas are independent from the programming language
Is used as base for the serialisation of the data
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 13 / 31
Scalable Data Management
Graph SchemasNodes
I Typed entities
EdgesI Typed relationship between the entities
PropertiesI Information about the entities
AdvantagesGraph schemas should allow evolution of the data schema (as the application matures).
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 14 / 31
Scalable Data Management
Figure: Graph scheme, consisting of nodes (green, yellow), edges (relations between nodes) andproperties (blue)
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 15 / 31
Scalable Data Management
Serialisation FrameworksResponsible to map between storage and run-time
I i.e. between program language specific representation and a serialised representation (o�enJSON)
ExamplesI Apache Thri�I Google Protocol Bu�ersI Avro
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 16 / 31
Scalable Data Management
Master datasetConstantly growing
I → Scalability
No need to support updateI Just adding of new information
No random access necessaryI “Write once, bulk read many times”
Parallel readers
SidenoteEven a distributed file system satisfies these requirements
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 17 / 31
Scalable Data Management
Best practicesStore data in big files
I One job per block
Vertical partitioningI E.g. by date
Compress data if possibleI Trade-o� between speed and size
ConsolidationI Maintainance tasks (e.g. compaction)
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 18 / 31
Map/Reduce WorkflowsMultiple map/reduce jobs in sequence
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 19 / 31
Map/Reduce Workflows
FlowSequence of manipulation on the data
… that can be mapped to one or more map/reduce jobs
Tools exists to model these workflows
… instead of manually creating the jobs
Especially important for ETL processes
… Extract, Transform, Load
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 20 / 31
Map/Reduce Workflows
CascadingModel the whole process
… framework to optimise the mapping
https://github.com/Cascading/CoPA/wiki
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 21 / 31
Stream ProcessingStream processing instead of batch processing?
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 22 / 31
Stream Data Processing
Two basic stream processing typesOne-at-a-time
I Each event is processed individually
Micro-batchI Multiple events are combined into a single batch
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 23 / 31
Stream Data Processing
Basic execution semanticsAt-least-once
I Each event is guaranteed to be executedI But might be processed more o�en (thus the same result might be reported multiple times)I Might introduce inaccuracies
At-most-onceI Each event is optionally executed, but no more than once
Exactly-onceI Each event is guaranteed to be executed only once
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 24 / 31
Stream Data Processing
Trade-o� between the stream processing types
One-at-a-time Micro-Batch
Lower Latency XHigher Throughput XAt-least-once semantic X XExactly-once semantic (sometimes) XSimpler programming model X
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 25 / 31
Stream Data Processing
ImplicationsExactly-once can be achieved via strictly ordered processing
I Fully process an event before continuing
More e�icient for micro-batchesI Multiple batches in parallelI Need to store the batch-id of the last successfully processed batch
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 26 / 31
Stream Data Processing
Strength Examples
Batch High throughput Hadoop, SparkOne-at-a-time Low latency StormMicro-batch Tradeo� Trident, Spark
Table: Comparison of the main distributed architectures
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 27 / 31
Asynchronous Architectures - Storm Stream Processing
Storm architectureBasis of the Apache Storm so�ware
Developed for stream processing
Improvement over the queues and workers architectureDoes not need intermediary queues
I Instead, tracks the data using a DAGI An e�icient implementation does exist
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 28 / 31
Asynchronous Architectures - Storm Stream Processing
Storm modelTuples: the data items
Streams: consist of a sequence of tuples
Spouts: produces streamsBolts: process streams (i.e. consumes and produces tuples)
I Any number of in or out streams
Tasks: spouts or boltsI Inherently parallel
Topology: how the spouts and bolts are connected
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 29 / 31
Asynchronous Architectures - Storm Stream Processing
Strategies to partition the streams (stream grouping)Randomly
I Called shu�le groupingI Splits the tuples evenly
By data valueI Called field grouping
Specialised: All grouping, specific bolts, …
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 30 / 31
The EndNext: Discussion of student projects (January)
Roman Kern (ISDS, TU Graz) Dbase2 2018-12-03 31 / 31