cost minimization for big data processing in geo- …cssongguo/papers/bigdata14-ppt.pdf · 2014....
TRANSCRIPT
COST MINIMIZATION FOR BIG
DATA PROCESSING IN GEO-
DISTRIBUTED DATA CENTERS
1
Song Guo
The University of Aizu
Homepage: http://www.u-aizu.ac.jp/~sguo
Email: [email protected]
System Model
• Topology:
– geo-distributed data centers (DCs) connected with switches
• Cost:
– Inter-DC cost “CR” vs Intra-DC cost “CL”
– Server cost when a server is turned on
CR
CL
2
What Is The Problems?
• Where to put the data and computation? (Data & task placement)
– Same server : “0”
– Same DC : “CL” (0 < CL < CR)
– Different DCs : “CR”
• How to utilize physical resources of servers?
– Server ON/OFF (DCR)
– To balance storage and computation resources
• How to route the data transmission?
– What is the transmission rate?
– What is the transmission path? (Data flow routing)
3
General Problem Formulation
• What is our objective?
– To minmize the total cost: Both server cost and network cost
• What is the constraints?
– Data and task placement
– Hadoop Distributed File System
– Data flow transmission
– QoS satisfaction
– 2D Markov Chain
4
Data and Task Placement
• Multiple copies of data and at least one task computation unit for each
task must be put in a server
• Each required resource (storage and computation, etc.) must not exceed
the server capacity
• The total task rate in all servers shall equal to original user task rate
• If a storage or computation unit is located in a server, this server must be
turned on
5
Hadoop Distributed File System
6
• P- copy storage policy
• HDFS data distribution example (P=3)
Rack 1 Rack 2 Rack 3 Rack 4 Rack 5
Data Flow Transmission
7
Rack 1 Rack 2
2
4
5
5
4
2
1
Rack N
…
…
Rack 1
CL
CR
1
…
1
…
DC 2
DC 1
Storage
Computation
Data Flow Transmission
• Only severs with data residence can be flow source nodes
• The total outgoing flow from source nodes shall not exceed the user
request rate λ
• the destination receives all data from others only when it does not hold a
copy of data
8
QoS Satisfaction
• Fluid flow model
– Pipelined transmission
– Computation process starts ASA first chunk arrives
9
Bottleneck
2D Markov Chain
Data
Storage Computation User
• Step 1: User requests arrive with rate λ
• Step 2: Data is transmitted to the computation unit with rate γ
• Step 3: Computation is executed with rate μ
Cloud services
Results
Rate λ Rate μ Rate γ
10
This process can be modeled by a 2D Markov Chain
2D Markov Chain
• 2D Markov chain
– User request rate λ
– Computaion rate μ
– Data transmission rate γ
• Computation can happen when
and only when data arrives
– The total system delay T will be affected by λ , μ and γ
– Computaion rate μ is related to how much computation
resource is distributed to each task
– Data transmission rate γ is related to the data flow path
– T shall not exceed the QoS
11
QoS Satisfaction
• By solving the ODEs, we can derive the state probability πjk(p, q) as:
12
• When B goes to infinity, the mean number of tasks for chunk k on
server j Tjk is
• Finally,
Notations
13
Formulation
14
Data&Request Placement
Data Flow Transmission
QoS Satisfaction
Performance Evaluation
15
Performance Evaluation
16
• Our proposal outperforms the traditional mechanism under all settings
• Our proposal saves approximately 20% overall cost than the traditional
“locate computation with data” mechanism
Contributions
• We propose a two-dimensional Markov chain and derive the
expected task completion time in closed form. We explore the big
data placement problem to answer the following questions:
– a) how to place these data chunks in the servers,
– b) how to distribute tasks onto servers without violating the resource
constraints, and
– c) how to resize data centers to achieve the operation cost minimization
goal.
Previous works ONLY focus on the “locate data with computation” policy, but
we show that jointly consider “data and computation location” will give a
better performance in cost minimization.
17