computer archicture f07 - cs.uh.edu · 7 performance metrics (iii) •scaleup: ratio of the...
TRANSCRIPT
1
COSC 6397
Big Data Analytics
Fundamentals
Edgar Gabriel
Fall 2018
Overview
• Data Characteristics
• Performance Characteristics
• Platform Considerations
2
What makes large scale Data Analysis
hard?
• Often summarized as VVVV
• Volume:
– 5 Exabytes of data created until 2003
– The same amount of data created in 2011 in two days
– Estimate for 2013: 10 minutes for creating the same
amount of data
– Example: a communication service provider with 100
million customers generates ~5 petabytes of location data
per day
From WWW to VVVV
• Velocity:
– Throughput: amount of data moved ‘through the pipes’
• Mobile data volumes growing at 78% per year
• Expected to reach 10.8 exabytes per month in 2016
– Latency:
• Analytics used to be ‘store and report’
– Data shown was from ‘yesterday’
• Real-time analytics gaining popularity
– Some services available which guarantee analysis in
10ms
3
From WWW to VVVV
• Variety
– Data comes from a variety of sources in different formats
– Example: call center which needs to integrate
information from
• Trouble ticket
• Conversation
• Social media blogs
From WWW to VVVV
• Veracity
– Data suffers from significant correctness and accuracy
problems
– Credibility: e.g. social media response to a campaign
should not be based on third party ‘likes’
• ‘likes’ can be purchased
• Response by disgruntled employees
– Audience Suitability
• Customer service identifying a problem in a product
has to share the information selectively
4
Analyzing large data volumes
• Large:
– More data than can be processed on a single ‘PC’
– Takes too long to be processed on a single ‘PC’
• Three questions
– How to utilize multiple processors
– How to evaluated whether we did a good job in using
multiple processors
– Administrative options for using multiple processors for
large scale analysis
Performance Metrics (I)
• Speedup: how much faster does a problem run on p
processors compared to 1 processor?
– Optimal: S(p) = p (linear speedup)
• Parallel Efficiency: Speedup normalized by the number
of processors
– Optimal: E(p) = 1.0
)(
)1()(
pT
TpS
total
total
p
pSpE
)()(
5
Performance Metrics (II)
• Example: Application A takes 35 min. on a single
processor, 27 on two processors and 18 on 4 processors.
29.127
35)2( S 645.0
2
29.1)2( E
94.118
35)4( S 485.0
4
94.1)4( E
Amdahl’s Law (I)
• Basic idea: most applications have a (small) sequential
fraction, which limits the speedup
f: fraction of the code which can only be executed
sequentially
p
ffT
p
ff
TpS
total
total
1
1
)1()1
(
)1()(
TotalTotalparallelsequentialtotal TffTTTT )1(
6
Example for Amdahl’s Law
0.00
10.00
20.00
30.00
40.00
50.00
60.00
1 2 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
f=0 f=0.05 f=0.1 f=0.2
Amdahl’s Law (II)
• Amdahl’s Law assumes, that the problem size is
constant
• In most applications, the sequential part is independent
of the problem size, while the part which can be
executed in parallel is not.
7
Performance Metrics (III)
• Scaleup: ratio of the execution time of a problem of
size n on 1 processor to the execution time of the same
problem of size n*p on p processors
– Optimally, execution time remains constant, e.g.
)*,(
),1()(
pnpT
nTpS
total
totalc
)2,2(),( npTnpT totaltotal
Cluster Computing
• Cluster: collection of individual PC’s (compute nodes)
connected by a (high performance) network interconnect
– Each compute node is an independent entity with its own
• Processor
• Main memory
• One ore multiple networking cards
– All compute nodes typically have access to a shared file
system (e.g. Network File System (NFS) )
• Removes the necessity to replicate programs and data
on all compute nodes
• All accesses to files require communication over the
network
8
Conceptual View
Memory
CPU
CPU
CPU
Network
card
Hard driveCPU
Memory
CPU
CPU
CPU
Network
card
Hard driveCPU
Netw
ork
Inte
rconnect
Memory
CPU
CPU
CPU
Network
card
Hard driveCPU
Memory
CPU
CPU
CPU
Network
card
Hard driveCPU
Memory
CPU
CPU
CPU
Network
card
Hard driveCPU
Memory
CPU
CPU
CPU
Network
card
Hard driveCPU
Memory
CPU
CPU
CPU
Network
card
Hard driveCPU
Memory
CPU
CPU
CPU
Network
card
Hard driveCPU
Cluster Components (I)
• Compute nodes mostly based on regular PC technology
– Intel or AMD processors
– 1-4GB of main memory per core
• Operating Systems: typically Linux/UNIX
• Managements of resources: cluster scheduler
– Manages allocation of compute nodes to users
9
Cluster Components (II)
• Networking metrics:
– Latency: minimal time to send a very short message from one communication endpoint to an other endpoint
• Unit: ms, μs
– Bandwidth: amount of data which can be transferred from one processor to another in a certain time frame
• Unit: Bytes/sec, …, GB/s; Bits/sec,…, Gb/s
• Of-the-shelf technology vs. High-End Technology
– Gigabit-Ethernet, 10GE, 40 GE, InfiniBand, Omnipath,…
– Most clusters contain both, a high-end and a low-end network
interconnect
• Network Topology of importance for large clusters
– If more than one switch is required how are nodes connected
– Metric: Bisection bandwidth
Parallel Databases
• A parallel database system seeks to improve performance
through parallelization of various operations
– data is stored in a distributed fashion
– distribution is governed by performance considerations.
– improves processing and input/output speeds by using
multiple CPUs and disks in parallel.
• Parallel databases often use multiprocessor architecture
– Shared memory architecture: multiple processors share
the main memory space
– Shared disk architecture: each node has its own main
memory, all nodes share mass storage
– Shared nothing architecture: each node has its own mass
storage as well as main memory.
10
Advantages of Parallel Database
Systems• System uses an optimizer to translate e.g. SQL
commands into a query plan whose execution is divided
among compute nodes
• High level programming (SQL) does not require any
knowledge of underlying hardware
• A lot of data is already stored in database systems
• 20+ years of experience in parallel database systems
SQL vs. NoSQL database systems
• Relational database systems have certain requirements on the data
format
– Difficult to handle irregular, unstructured or incomplete data
sets
– Database systems not efficient in adding large data volumes
– Price of large scale commercial SQL database systems a major
factor
• NoSQL database systems: specialized database systems for various
application scenario, do not require fixed schema
– Key-value stores
– Column-oriented databases
– Document databases
– Graph databases
11
Cloud Computing
• Cloud Computing: general term used to describe a
class of network based computing
– a collection/group of integrated and networked
hardware, software and Internet infrastructure (called a
platform).
– Using the Internet for communication and transport
provides hardware, software and networking services to
clients
• Hides the complexity and details of the underlying
infrastructure from users and applications by providing
very simple graphical interface or API
Cloud Computing (II)
• The platform provides on demand services, that are
always on, anywhere, anytime and any place.
– Pay for use and as needed
– Scale up and down in capacity and functionality
• The hardware and software services are available to
– general public, enterprises, corporations and businesses
markets
• Services or data are hosted on remote infrastructure
12
Cloud Service Models
• Software as a Service (SaaS):
– execute a specific application required for business /
research
• Platform as a Service (PaaS):
– deploy customer created applications
• Infrastructure as a Service (IaaS):
– rent processing and compute capacity, storage, etc.
Cloud Computing Summary
• Positive
– Reduces the need for local IT infrastructure
– Scalability
– Reliability not a major concern
– Implicit software updates
• Negative
– No performance guarantees – utilization of shared resources
– Privacy, security, compliance, trust
– Need to evaluate utilization/costs benefits
13
Comparison of the platformsCluster
computing
Parallel Database Cloud Computing
Initial hardware
investment costs
High High Zero
Initial software
investment costs
Low High Zero
Maintenance
costs
Low Low-Medium Zero
Software
development
efforts
High Low Low…High
Software
Flexibility
High Low Low…High
Efficiency High High Low
Costs per job Low Low High