cs226 big-data managementeldawy/19fcs226/slides/cs226-01... · 2019-09-30 · project groups of 4-5...

40
CS226 Big-Data Management Instructor: Ahmed Eldawy 1

Upload: others

Post on 22-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

CS226

Big-Data Management

Instructor: Ahmed Eldawy

1

Page 2: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Welcome (back) to UCR!

2

Page 3: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Class information

Classes: Monday, Wednesday, Friday 1:00 –

1:50 PM at Humanities and Social

Sciences1501

Instructor: Ahmed Eldawy

TA: Saheli Ghosh

Office hours: TBD

Website:

http://www.cs.ucr.edu/~eldawy/19FCS226/

iLearn (Any UCRX students?)

Email: [email protected]

Subject: “[CS226] …” 3

Page 4: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Course work

Active participation in the class (5%)

Reading and review tasks (10%)

Assignments (20%)

Mid-term (15%)

Project (50%)

4

Page 5: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Project

Groups of 4-5 students

Milestones

Group Selection

Project proposal (5%)

Literature survey (10%)

Report outline (5%)

Class presentation (5%)

Final report (15%)

Poster presentation (10%)

5

Page 6: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Course goals

What are your goals?

Understand what big data means

Identify the internal components of big data

platforms

Recognize the differences between different

big data platforms

Explain how a distributed query runs on big

data

6

Page 8: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Big-data Expert

Understand how the big-data platforms really

work

Control those thousands of processors

efficiently to carry out your task

8

Page 9: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Syllabus

Overview of big data

Big-data storage

Big-data processing

Big-data indexing

Big-SQL processing

Programming packages

9

Page 10: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Introduction

10

Page 11: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

11

Page 12: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

12

Page 13: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Jan 2012: World Economic Forum Report

13

Page 14: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Interest in Big Data in the US

■March 2012: Obama administration

unveils BIG DATA initiative: $200 Million

in R&D investment

■June 2013:

Washington

Post is calling

Obama “The Big

Data President”

14

Page 15: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Interest in Big Data in Europe

March 2014: David Cameron and Angela Merkel talking about

Big Data in a Computer Expo in Hannover, Germany

15

Page 16: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

The Market of Big Data

16

Page 17: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Four Three V’s of Big Data

17

Page 18: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Big Data Vs Big Computation

Full scans (e.g., log processing)

Range scans

Point lookups

Iterations

Joins (self, binary, or multiway)

Proximity queries

Closures and graph traversals

18

Page 19: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Big Data Applications

Web search

Marketing and advertising

Data cleaning

Knowledge base

Information retrieval

Internet of Things (IoT)

Visualization

Behavioral studies

19

Page 20: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Publicly Available Datasets

Data.gov

Data.gov.uk

Twitter Streaming API

Yahoo! Webscope

[http://webscope.sandbox.yahoo.com/]

GDELT [http://www.gdeltproject.org/]

Instagram API

20

Page 21: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Big Data Landscape 2012

http://mattturck.com/2012/06/29/a-chart-of-the-big-data-ecosystem/21

Page 22: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Big Data Landscape 2014

http://mattturck.com/2014/05/11/the-state-of-big-data-in-2014-a-chart/22

Page 23: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Big Data Landscape 2016

http://mattturck.com/2016/02/01/big-data-landscape/ 23

Page 24: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Big Data Landscape 2018

24

Page 25: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Components

of Big Data

25

Page 26: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Storage of Big Data

Data is growing faster

than Moore’s Law

Too much data to fit

on a single machine

Partitioning

Replication

Fault-tolerance

26

Page 27: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Hadoop Distributed File System(HDFS)

The most widely used distributed file system

Fixed-sized partitioning

3-way replication

Write-once read-many

128MB 128MB 128MB 128MB 128MB 128MB …

27

Page 28: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Indexing

Data-aware organization

Global Index partitions the records into blocks

Local Indexes organize the records in a partition

Challenges:

Big volume

HDFS limitation

New programming

paradigms

Ad-hoc indexes

Global index

Local indexes

28

Page 29: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Fault Tolerance

Replication

Redundancy

Multiple masters

29

Page 30: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Streaming

Sub-second latency for queries

One scan over the data

(Partial) preprocessing

Continuous queries

Eviction strategies

In-memory indexes

…1000100010101011101110101010110111010111011101110100…

Processing window

30

Page 31: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Task ExecutionMapReduce

Map-Shuffle- Reduce

Resiliency through

materialization

Resilient Distributed Datasets (RDD)

Directed-Acyclic-Graph (DAG)

In-memory processing

Resiliency through lineages

Hyracks

Stragglers

Load balance

M1 M2 … Mm

R1 R2 Rn

31

Page 32: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Query Optimization

Finding the most efficient query plan

e.g., grouped aggregation

Cost model (CPU – Disk – Network)

Agg

Agg

Agg

Merge

Merge

Partition

Partition

Partition

Agg

Agg

Vs

32

Page 33: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Provenance

Debugging in distributed systems is painful

We need to keep track of transformations on

each record

33

Page 34: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Big Graphs

Motivated by social networks

Billions of nodes and trillions of edges

Tens of thousands of insertions per second

Complex queries with graph traversals

34

Page 35: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Hadoop Ecosystem

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

MapReduce Query Engine

Administration

Pig

35

Page 36: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Spark Ecosystem

Hadoop Distributed File System (HDFS)

Yet Another

Resource Negotiator (YARN)

Resilient Distributed Dataset (RDD) a.k.a Spark Core

Data Frames MLlib GraphX SparkRSpark

Streaming

Spark SQL

36

Kubernetes

Page 37: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Hyracks Data-parallel Platform

Algebricks

Algebra Layer

Hadoop MapReduce

CompatibilityPregelix

HiveSterixAsteixDBOther

compilersHyracks

jobs

Pregel

Jobs

MapReduce

Jobs

PigLatinHiveQLAsterixQL

37

Page 38: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Impala

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

Query Executor

Query Planner

Query Parser

38

Page 39: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

SpatialHadoop

Hadoop Distributed File System (HDFS) + Spatial Indexing

Yet Another Resource Negotiator (YARN)

MapReduce Processing + Spatial Query Processing

Spatial Visualization

Pig Latin + Pigeon

39

Page 40: CS226 Big-Data Managementeldawy/19FCS226/slides/CS226-01... · 2019-09-30 · Project Groups of 4-5 students Milestones Group Selection Project proposal (5%) Literature survey (10%)

Reading Material

“The Age of Analytics in a Data-driven World”

[Executive Summary]

by McKinsey & Company

40