computer archicture f07 - cs.uh.edu · 7 performance metrics (iii) •scaleup: ratio of the...

1

COSC 6397

Big Data Analytics

Fundamentals

Edgar Gabriel

Fall 2018

Overview

• Data Characteristics

• Performance Characteristics

• Platform Considerations

2

What makes large scale Data Analysis

hard?

• Often summarized as VVVV

• Volume:

– 5 Exabytes of data created until 2003

– The same amount of data created in 2011 in two days

– Estimate for 2013: 10 minutes for creating the same

amount of data

– Example: a communication service provider with 100

million customers generates ~5 petabytes of location data

per day

From WWW to VVVV

• Velocity:

– Throughput: amount of data moved ‘through the pipes’

• Mobile data volumes growing at 78% per year

• Expected to reach 10.8 exabytes per month in 2016

– Latency:

• Analytics used to be ‘store and report’

– Data shown was from ‘yesterday’

• Real-time analytics gaining popularity

– Some services available which guarantee analysis in

10ms

3

From WWW to VVVV

• Variety

– Data comes from a variety of sources in different formats

– Example: call center which needs to integrate

information from

• Email

• Trouble ticket

• Conversation

• Social media blogs

From WWW to VVVV

• Veracity

– Data suffers from significant correctness and accuracy

problems

– Credibility: e.g. social media response to a campaign

should not be based on third party ‘likes’

• ‘likes’ can be purchased

• Response by disgruntled employees

– Audience Suitability

• Customer service identifying a problem in a product

has to share the information selectively

4

Analyzing large data volumes

• Large:

– More data than can be processed on a single ‘PC’

– Takes too long to be processed on a single ‘PC’

• Three questions

– How to utilize multiple processors

– How to evaluated whether we did a good job in using

multiple processors

– Administrative options for using multiple processors for

large scale analysis

Performance Metrics (I)

• Speedup: how much faster does a problem run on p

processors compared to 1 processor?

– Optimal: S(p) = p (linear speedup)

• Parallel Efficiency: Speedup normalized by the number

of processors

– Optimal: E(p) = 1.0

)(

)1()(

pT

TpS

total

total

p

pSpE

)()(

5

Performance Metrics (II)

• Example: Application A takes 35 min. on a single

processor, 27 on two processors and 18 on 4 processors.

29.127

35)2( S 645.0

2

29.1)2( E

94.118

35)4( S 485.0

4

94.1)4( E

Amdahl’s Law (I)

• Basic idea: most applications have a (small) sequential

fraction, which limits the speedup

f: fraction of the code which can only be executed

sequentially

p

ffT

p

ff

TpS

total

total

1

1

)1()1

(

)1()(

TotalTotalparallelsequentialtotal TffTTTT )1(

6

Example for Amdahl’s Law

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1 2 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

f=0 f=0.05 f=0.1 f=0.2

Amdahl’s Law (II)

• Amdahl’s Law assumes, that the problem size is

constant

• In most applications, the sequential part is independent

of the problem size, while the part which can be

executed in parallel is not.

7

Performance Metrics (III)

• Scaleup: ratio of the execution time of a problem of

size n on 1 processor to the execution time of the same

problem of size n*p on p processors

– Optimally, execution time remains constant, e.g.

)*,(

),1()(

pnpT

nTpS

total

totalc

)2,2(),( npTnpT totaltotal

Cluster Computing

• Cluster: collection of individual PC’s (compute nodes)

connected by a (high performance) network interconnect

– Each compute node is an independent entity with its own

• Processor

• Main memory

• One ore multiple networking cards

– All compute nodes typically have access to a shared file

system (e.g. Network File System (NFS) )

• Removes the necessity to replicate programs and data

on all compute nodes

• All accesses to files require communication over the

network

8

Conceptual View

Memory

CPU

CPU

CPU

Network

card

Hard driveCPU

Memory

CPU

CPU

CPU

Network

card

Hard driveCPU

Netw

ork

Inte

rconnect

Memory

CPU

CPU

CPU

Network

card

Hard driveCPU

Memory

CPU

CPU

CPU

Network

card

Hard driveCPU

Memory

CPU

CPU

CPU

Network

card

Hard driveCPU

Memory

CPU

CPU

CPU

Network

card

Hard driveCPU

Memory

CPU

CPU

CPU

Network

card

Hard driveCPU

Memory

CPU

CPU

CPU

Network

card

Hard driveCPU

Cluster Components (I)

• Compute nodes mostly based on regular PC technology

– Intel or AMD processors

– 1-4GB of main memory per core

• Operating Systems: typically Linux/UNIX

• Managements of resources: cluster scheduler

– Manages allocation of compute nodes to users

9

Cluster Components (II)

• Networking metrics:

– Latency: minimal time to send a very short message from one communication endpoint to an other endpoint

• Unit: ms, μs

– Bandwidth: amount of data which can be transferred from one processor to another in a certain time frame

• Unit: Bytes/sec, …, GB/s; Bits/sec,…, Gb/s

• Of-the-shelf technology vs. High-End Technology

– Gigabit-Ethernet, 10GE, 40 GE, InfiniBand, Omnipath,…

– Most clusters contain both, a high-end and a low-end network

interconnect

• Network Topology of importance for large clusters

– If more than one switch is required how are nodes connected

– Metric: Bisection bandwidth

Parallel Databases

• A parallel database system seeks to improve performance

through parallelization of various operations

– data is stored in a distributed fashion

– distribution is governed by performance considerations.

– improves processing and input/output speeds by using

multiple CPUs and disks in parallel.

• Parallel databases often use multiprocessor architecture

– Shared memory architecture: multiple processors share

the main memory space

– Shared disk architecture: each node has its own main

memory, all nodes share mass storage

– Shared nothing architecture: each node has its own mass

storage as well as main memory.

10

Advantages of Parallel Database

Systems• System uses an optimizer to translate e.g. SQL

commands into a query plan whose execution is divided

among compute nodes

• High level programming (SQL) does not require any

knowledge of underlying hardware

• A lot of data is already stored in database systems

• 20+ years of experience in parallel database systems

SQL vs. NoSQL database systems

• Relational database systems have certain requirements on the data

format

– Difficult to handle irregular, unstructured or incomplete data

sets

– Database systems not efficient in adding large data volumes

– Price of large scale commercial SQL database systems a major

factor

• NoSQL database systems: specialized database systems for various

application scenario, do not require fixed schema

– Key-value stores

– Column-oriented databases

– Document databases

– Graph databases

11

Cloud Computing

• Cloud Computing: general term used to describe a

class of network based computing

– a collection/group of integrated and networked

hardware, software and Internet infrastructure (called a

platform).

– Using the Internet for communication and transport

provides hardware, software and networking services to

clients

• Hides the complexity and details of the underlying

infrastructure from users and applications by providing

very simple graphical interface or API

Cloud Computing (II)

• The platform provides on demand services, that are

always on, anywhere, anytime and any place.

– Pay for use and as needed

– Scale up and down in capacity and functionality

• The hardware and software services are available to

– general public, enterprises, corporations and businesses

markets

• Services or data are hosted on remote infrastructure

12

Cloud Service Models

• Software as a Service (SaaS):

– execute a specific application required for business /

research

• Platform as a Service (PaaS):

– deploy customer created applications

• Infrastructure as a Service (IaaS):

– rent processing and compute capacity, storage, etc.

Cloud Computing Summary

• Positive

– Reduces the need for local IT infrastructure

– Scalability

– Reliability not a major concern

– Implicit software updates

• Negative

– No performance guarantees – utilization of shared resources

– Privacy, security, compliance, trust

– Need to evaluate utilization/costs benefits

13

Comparison of the platformsCluster

computing

Parallel Database Cloud Computing

Initial hardware

investment costs

High High Zero

Initial software

investment costs

Low High Zero

Maintenance

costs

Low Low-Medium Zero

Software

development

efforts

High Low Low…High

Software

Flexibility

High Low Low…High

Efficiency High High Low

Costs per job Low Low High

computer archicture f07 - cs.uh.edu · 7 performance metrics (iii) •scaleup: ratio of the...

Documents