components of big dataeldawy/18fcs226/slides/cs... · components of big data 10/01/2018 25. storage...

Components

of Big Data

10/01/2018 25

Storage of Big Data

Data is growing faster

than Moore’s Law

Too much data to fit

on a single machine

Partitioning

Replication

Fault-tolerance

10/01/2018 26

Hadoop Distributed File System(HDFS)

The most widely used distributed file system

Fixed-sized partitioning

3-way replication

Write-once read-many

10/01/2018

128MB 128MB 128MB 128MB 128MB 128MB …

Indexing

Data-aware organization

Global Index partitions the records into blocks

Local Indexes organize the records in a partition

Challenges:

Big volume

HDFS limitation

New programming

paradigms

Ad-hoc indexes

10/01/2018

Global index

Local indexes

Fault Tolerance

Replication

Redundancy

Multiple masters

10/01/2018 29

Streaming

Sub-second latency for queries

One scan over the data

(Partial) preprocessing

Continuous queries

Eviction strategies

In-memory indexes

10/01/2018

…1000100010101011101110101010110111010111011101110100…

Processing window

Task ExecutionMapReduce

Map-Shuffle- Reduce

Resiliency through

materialization

Resilient Distributed Datasets (RDD)

Directed-Acyclic-Graph (DAG)

In-memory processing

Resiliency through lineages

Hyracks

Stragglers

Load balance10/01/2018

M1 M2 … Mm

R1 R2 Rn

Query Optimization

Finding the most efficient query plan

e.g., grouped aggregation

Cost model (CPU – Disk – Network)

10/01/2018

Partition

Provenance

Debugging in distributed systems is painful

We need to keep track of transformations on

each record

10/01/2018 33

Big Graphs

Motivated by social networks

Billions of nodes and trillions of edges

Tens of thousands of insertions per second

Complex queries with graph traversals

10/01/2018 34

Hadoop Ecosystem

10/01/2018

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

MapReduce Query Engine

Administration

Spark Ecosystem

10/01/2018

Yet Another

Resource Negotiator (YARN)

Resilient Distributed Dataset (RDD) a.k.a Spark Core

Data Frames MLlib GraphX SparkRSpark

Streaming

Spark SQL

Kubernetes

10/01/2018

Hyracks Data-parallel Platform

Algebricks

Algebra Layer

Hadoop MapReduce

CompatibilityPregelix

HiveSterixAsteixDBOther

compilersHyracks

Pregel

MapReduce

PigLatinHiveQLAsterixQL

Impala

10/01/2018

Query Executor

Query Planner

Query Parser

SpatialHadoop

10/01/2018

Hadoop Distributed File System (HDFS) + Spatial Indexing

MapReduce Processing + Spatial Query Processing

Spatial Visualization

Pig Latin + Pigeon

Reading Material

“The Age of Analytics in a Data-driven World”

[Executive Summary]

by McKinsey & Company

10/01/2018 40

components of big dataeldawy/18fcs226/slides/cs... · components of big data 10/01/2018 25. storage...

Documents

big data solutions - big data technology

big data and business analytics: the engine of digital...

big success with big data - accenture · big success with...

introduction to big data, big data processing, and big...

msa220/mve440 statistical learning for big data - lecture...

2.3 methods for big data what is “big data”? summarizing...

cs014 introduction to data structures and...

cs226 big-data...

introduction to big data, big data processing, and big

big data ในภาครัฐ -...

big data curation - pdfs.semanticscholar.org · big data...

big data, big commerce, big challenge

introduction to big data. reference: what is “big...

big data, smart data and big analysis

2016 big data for beginners understanding smart big data,...

introduction to big data, big data processing, and big...

big vulnerabilities + big data = big intelligence

big data meets big data

real time big data applications: file · web viewunit i....

caterpillar big data infrastructure big data, data...