status update of colo project xiaowei yang, huawei and will auld, intel

Status of COLO Project

Eddie Dong*, Xiaowei Yang#

*Intel Open Source Technology Center

#Huawei Technology Co.

Key Contributors: Jianshan Lai, Congyang Wen, Tao Hong

1

Notices and Disclaimers

2

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT.

Intel may make changes to specifications, product descriptions, and plans at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

All dates provided are subject to change without notice.

Intel and Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2013, Intel Corporation. All rights reserved.

Agenda

3

Background

Status

Performance

Call for action

What is COLO ?

COarse-grain LOck-stepping Virtual Machines for Non-stop Service

Solution for Client / Server application without application awareness

Dual VM based high availability solution

Relaxed constraints for higher performance

Replicated network

Copy client request to both PVM/SVM

Compare response packets from PVM and SVM with compare module

When both are the same the response is send to the client

When they are not the same, sync PVM and SVM and then send the

response

Non-Stop Service with VM Replication

6

Hardware

VMM

PVM

OS

Hardware

VMM

SVM

OS

Network

Hardware

Failure

VM Replication

Storage

Fail Over

Primary Secondary

APPs APPs

Compare w/

Remus

Problems with existing approaches

7

Instruction level lock-stepping

Excessive overhead from maintaining the exact machine state

memory access in an MP-guest is un-deterministic

Periodic Check-pointing

Extra network latency

Excessive VM checkpoint overhead

Relaxed constraints help

8

Relaxing constraints tends to lower the rate of synchronization

Periodic check-pointing defines the rate of synchronization

Tying the rate of synchronization to dissimilar responses ties it to the

application characteristics

In most cases this lowers the rate as compared to the periodic mothod

Architecture of COLO

9

COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Agenda

10

Background

Status

Performance

Call for action

Current Status

11

Patches for Xen are sent to the mailing list

Academia paper published at ACM Symposium on Cloud Computing (SOCC’13)

Refer to “COLO: COarse-grained LOck-stepping Virtual Machines for Non-stop Service” for details

http://www.socc2013.org/home/program

Industry announcement

Huawei FusionSphere uses COLO

http://enterprise.huawei.com/ilink/enenterprise/about/news/news-list/HW_308817?KeyTemps=




http://enterprise.huawei.com/ilink/enenterprise/about/news/news-list/HW_308817?KeyTemps



TCP/IP optimization

Per-Connection Comparison (no modification to TCP/IP)

Coarse-grain TCP Timestamp

Coarse-grain TCP Notification Window Size

Deterministic Algorithm to segment application data

Deterministic Algorithm to generate Initail Seq Number

Deterministic Algorithm to generate ID(IP packet header)

Immediately Acknowledgement

Use separated packet to send FIN

…

EXAMPLE:Coarse-grain TCP Notification Window

Size

Coarse-grain Window size rules:

if origin window < 256

rounds down to the nearest power of 2

else

masks the 8 least significant bits

For example:

1.orgin window size=172(10101100b)

set window size to 128(1000000b)

2. orgin window size=283(100011011b)


3. orgin window size=789(1100010101b)


3000 B 2000 B

1360 B 1360 B 280 B 1080 B 920 B

1360 B 1360 B 280 B 1360 B 640 B

App data1 (Time point1)

App data2(Time point2)

Method1:Find latest unsent skb and append app data2 to unused tail skb payload

Application data to send at T1 and T2

Method2:Find latest unsent skb(skb==NULL) and use new skb to send app data2

Colo Deterministic Method:NOT check the latest unsend skb and use new skb to send app data2

EXAMPLE :Deterministic segmentation

TCP/IP packet header

Write

Pnode

DM sends the Write request (offset, len, data) to PVM

cache in Snode

DM calls block driver to write to storage

Snode

DM saves Write request in SVM cache

Read

Snode

From SVM cache, or storage otherwise

Pnode

From storage

Checkpoint

DM calls block driver to flush PVM cache

Failover

DM calls block driver to flush SVM cache

Storage process

Memory sync

One of the biggest time-consume step

Asynchronous sends dirty memory when the PVM/SVM are running

Less dirty memory transmission during VM checkpoint

Less CPU pressure and latency

Critical for the case where the VM

checkpoint happens very few

Faster VBD/VIF frontend/backend suspend/resume

Old method:

communication between Frontend and backend through xenstored - low

efficient

New method:

Use event channel to speed frontend/backend communication

Agenda

18

Background

Status

Performance

Call for action

*Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance

tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and

functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to

assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Web Server Performance - Web Bench

19

Source: Intel For more complete information about performance and benchmark results, visit

www.intel.com/benchmarks

http://www.intel.com/benchmarks

Web Server Performance - Web Bench (MP)

20

Source: Intel

For more complete information about performance and benchmark results, visit



PostgreSQL Performance - Pgbench

21

Source: Intel For more complete information about performance and benchmark results, visit



PostgreSQL Performance - Pgbench (MP)

22

Source: Intel

For more complete information about performance and benchmark results, visit



Upstream

Initial patch series are posted

More comments are welcome

Depend on the readiness of the Remus on top of XL

COLO reuses Remus for VM checkpoint and heartbeat

Agenda

24

Background

Status

Performance

Call for action

Next and Call for actions

Work good when HVM linux guest + PV driver

Window guest support is under developement

Need more participants and fast turn over of upstreaming

status update of colo project xiaowei yang, huawei and will auld, intel

Technology

orgin window size

intel logo

intel reserves

origin window

coarsegrain lock

local intel sales office

pvm cache

new skb