addthis: scaling cassandra up and down into containers with zfs

57
Scaling Cassandra up and down into containers with ZFS Chris Burroughs AddThis 2015-09-24 Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 1 / 52

Upload: datastax-academy

Post on 12-Jan-2017

920 views

Category:

Technology


0 download

TRANSCRIPT

Scaling Cassandra up and down into containers with ZFS

Chris Burroughs

AddThis

2015-09-24

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 1 / 52

Hello!

Chris Burroughs [email protected] @csby54

Engineer at AddThis

Co-organizer of the Cassandra DC Meetup

Occasional contributor

interrupt me!

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 2 / 52

1 Cassandra at AddThis

2 ZFS

3 Scaling up

4 Scaling down

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 3 / 52

Table of Contents

1 Cassandra at AddThis

2 ZFS

3 Scaling up

4 Scaling down

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 4 / 52

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 5 / 52

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 6 / 52

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 7 / 52

AddThis by the Numbers

80,000 request/second (3 billion views/day)

Tools on over 14 million domains

Mostly java on Linux

towards SOA microservices

multiple engineering “squads” with significant discretion

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 8 / 52

Cassandra at AddThis

Cassandra in production since 0.6

About a dozen clusters, new one created per use-case or SLA

Primarily used for latency sensitive, read-mostly storage

Every cluster is multi-DC

Virtual every page load with AddThis tools results in a least one read to Cassandra

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 9 / 52

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 10 / 52

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 11 / 52

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 12 / 52

Table of Contents

1 Cassandra at AddThis

2 ZFS

3 Scaling up

4 Scaling down

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 13 / 52

On Abstractions

Typical Storage:

Many moving parts: block devices, partitions, raid, volume manager.

Big Plan Up Front: changing partition either not done or painful.

Data integrity not warm and fuzzy: hope fsck works.

Typical Memory:

Virtual memory and malloc/free. Add more DRAM if needed.

Maybe worry about NUMA, at runtime.

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 14 / 52

ZFS

A storage sub-system (fs, raid, volume manager)I Always consistent on disk (no fsck)I End-to-end data integrityI Universal: file-system, block, NFS, SMBI Concise, simple administrative toolsI scalable data structures (278 max pool size)

Started by Jeff Bonwick and Matthew Ahrens at Sun around 2001.

Available for: Illumos, Solaris, FreeBSD, Linux, MacOS X

Does for storage what VM did for memory.

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 15 / 52

Timeline

2001: development started at Sun by Jeff Bonwick and Matthew Ahrens

2005: ZFS source code released

2008: ZFS released in FreeBSD 7.0

2010: Oracle proprietary fork, illumos project continues open-source development

2013: ZFS on (native) Linux GA

2013: Open-source ZFS bands together to form OpenZFS

2014: (new) OpenZFS for Mac OS X launch

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 16 / 52

Rampant Layering Violation

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 17 / 52

Universal Storage

Compression, snapshots, etc. are common features of all datasets.

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 18 / 52

MOOOOOOOOOOOOOOOOO

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 19 / 52

COW Bonus: SnapshotsSnapshots: Read only copy at a point in time

create delete incremental

Traditional (rsync-esque) O(n) O(n) O(n)ZFS O(1) O(∆) O(∆)

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 20 / 52

Clones

“Clone” a snapshot to create a writeable dataset

Only pay for the difference in accumulated changes

Clones with no changes take up no space!

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 21 / 52

Hybrid Storage

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 22 / 52

Wait, there is more!

End-To-End data integrity

Online Everything: Expansion, scrubbing, resilvering, “fsck”, etc.

Copy On Write with linearized writes

Transparent compression

Dataset send/recv

Flexible mount points

Nested property based configuration

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 23 / 52

zpool A collection of devices that provides physical storage and data replication.

vdev A device (or collection of devices) with certain performance or fault-tolerancecharacteristics. Building blocks of a zpool.

dataset The “data things” (usually filesystems) created on your zpool. Nested in ahierarchy.

property Key-value pairs used for configuration or reporting of datasets . Inherited inhierarchy.

ARC Adaptive Replacement Cache. Like a ZFS specific page cache.

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 24 / 52

The ARC: your new best friend

$ arcstat.py -f time,read,miss,hit% 1

time read miss hit%

16:26:04 16 7 56

16:26:05 1.4K 671 52

16:26:06 1.9K 900 53

16:26:07 2.1K 972 53

16:26:08 1.5K 697 54

16:26:09 2.1K 906 56

16:26:10 1.9K 844 54

$ arcstat.py -f time,read,miss,hit% 1

time read miss hit%

16:25:47 5.1K 0 100

16:25:48 413K 0 100

16:25:49 403K 0 100

16:25:50 402K 0 100

16:25:51 452K 1 99

16:25:52 553K 0 100

16:25:53 354K 0 100

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 25 / 52

dstat plugin

$ dstat --cpu --zfs-arc --zfs-l2arc

----total-cpu-usage---- -----------ZFS-ARC----------- -------------ZFS-L2ARC-------------

usr sys idl wai hiq siq| mem hit miss reads hit%| size hit miss hit% read write

1 1 94 4 0 0|15.0G 796B 372B 1167B 68.2B| 206G 343B 28.8B 92.3B 37.8M 137k

1 1 95 4 0 0|15.0G 557B 395B 952B 58.5B| 206G 376B 19.0B 95.2B 41.5M 0

1 1 95 3 0 0|15.0G 553B 358B 911B 60.7B| 206G 344B 14.0B 96.1B 38.2M 0

1 1 95 4 0 0|15.0G 686B 412B 1098B 62.5B| 206G 396B 16.0B 96.1B 43.7M 0

1 1 95 4 0 0|15.0G 712B 409B 1121B 63.5B| 206G 386B 23.0B 94.4B 42.5M 0

1 1 96 3 0 0|15.0G 446B 331B 777B 57.4B| 206G 307B 24.0B 92.7B 34.0M 0

1 1 86 13 0 0|15.0G 708B 332B 1040B 68.1B| 206G 310B 22.0B 93.4B 33.7M 0

2 1 93 4 0 0|15.0G 1094B 294B 1388B 78.8B| 206G 280B 14.0B 95.2B 31.0M 4608B

hat tip: @AlTobey

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 26 / 52

Intelligent Prefetch

Cassandra Test cat */*.junk > /dev/null:

page cache 98th percentile reads reported by Cassandra increased 4-6x

arc 98th percentile reads reported by Cassandra increased 2x

LSM trees (Cassandra, HBase, LevelDB, RocksDB, BerkelyDB) mean linear scans are commonon modern storage systems.

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 27 / 52

Commands

Only 2 you are ever likely to use

zpool Configure pools

zfs Configure file systems

zdb (Detailed debugging dump)

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 28 / 52

production-esque example

# zpool create -f tank mirror /dev/sdb /dev/sdc \

mirror /dev/sdd /dev/sde \

cache /dev/sdf

# zfs create tank/sstables

# zfs set mountpoint=/data/sstables tank/sstables

# zfs set compression=lz4 tank

# zfs set atime=off tank

# chown cassandra:cassandra /data/sstables/

# zfs list

NAME USED AVAIL REFER MOUNTPOINT

tank 374G 1.42T 30K /tank

tank/sstables 374G 1.42T 374G /data/sstables

# zfs get compressratio tank/sstables

NAME PROPERTY VALUE SOURCE

tank/sstables compressratio 1.08x -Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 29 / 52

ZFS on Linux History

In days gone by there was a FUSE project.

Port started by LLNL as a backend for their supercomputer.

Early 2013: “Ready for wide scale deployment on everything from desktops to supercomputers.”

Late 2014: Illumos/FreeBSD/Linux developers form OpenZFS group to coordinatedevelopment.

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 30 / 52

Today

0.6.5 released September 2015

Native packages for most distributions:

Active user communityI http://zfsonlinux.orgI #zfsonlinux on freenodeI [email protected]

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 31 / 52

Zero in front of the version number?

clusterhq.com/blog/state-zfs-on-linux/

Close to feature parity with Illumos and FreeBSD.

Key end-to-end data integrity features work on Linux like other platforms.

Performance is workload dependent.

ZFS on Linux may be better than other options today for your use cases. It is notbetter in all cases.

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 32 / 52

Table of Contents

1 Cassandra at AddThis

2 ZFS

3 Scaling up

4 Scaling down

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 33 / 52

Initial Problem Statement

Large-ish Cassandra cluster serving ML-derived data about URLs using AddThis.

Internal DC storage SLA: 98th percentile of 35ms

Data size and request volume growing, failing to meet SLA even while throwing hardwareat it.

Multiple revenue lines & products affected or launch blocked on cluster performance.

zipfian web traffic with a very long tail.

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 34 / 52

L2ARC

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 35 / 52

Setup

# zpool create -f tank mirror /dev/sdb /dev/sdc \

mirror /dev/sdd /dev/sde \

cache /dev/sdf

NAME STATE READ WRITE CKSUM

tank ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

sdb ONLINE 0 0 0

sdc ONLINE 0 0 0

mirror-1 ONLINE 0 0 0

sdd ONLINE 0 0 0

sde ONLINE 0 0 0

cache

sdf ONLINE 0 0 0

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 36 / 52

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 37 / 52

ResultsTwice the performance with half the physical nodes.

(Mileage will vary with workload and DRAM:SSD:working-set ratios.)Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 38 / 52

Table of Contents

1 Cassandra at AddThis

2 ZFS

3 Scaling up

4 Scaling down

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 39 / 52

Cute Little Clusters

Datacenter: IAD

===============

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address Load Tokens Owns Host ID Rack

UN x.xx.xxx.125 154.14 MB 256 ? 87d41c52-2b25-466b-93c1-d65c72f5fc61 NOP

UN x.xx.xxx.124 154.17 MB 256 ? c1a44486-2133-40fc-ba9d-0e671c5b2fc7 NOP

UN x.xx.xxx.126 154.17 MB 256 ? 824f6018-eba6-4b44-b716-3b4eeaf69228 NOP

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 40 / 52

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 41 / 52

Constraints

Need more efficient hardware allocationfor small clusters (multi-tenancy)

Among the most latency sensitive services

Non-trivial legacy network requirements

Application transparency

Infrastructure transparency (inventory,dns, dhcp, config managemnt)

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 42 / 52

Container Spectrum

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 43 / 52

Container Spectrum

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 43 / 52

Container Spectrum

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 43 / 52

Container Spectrum

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 43 / 52

ZFS Enabling Containerization

Create from base image

Backup running applications

Migrate to new host

Manage quotas

→ zfs clone

→ zfs snapshot

→ zfs send/recv

→ zfs properties

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 44 / 52

lxc-create(1)

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 45 / 52

Glue

$ ./bin/port register-container --hostname=HOSTNAME

$ ./bin/port build-container --tag=CONTAINER_TAG \

--resources=standard-small \

--host-tag=PHYSICAL_HOST_TAG

$ ./bin/cobbling-time add-chef-roles --tag=CONTAINER_TAG \

--roles=’ROLES’

$ ./bin/cobbling-time signoff --tag=CONTAINER_TAG \

--next-status=Allocated

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 46 / 52

# lxc-ls -l

drwxrwx--- 3 root root 5 Apr 17 15:23 drydock-2015-04-17T15:23:02

drwxrwx--- 3 root root 5 Mar 12 2015 T5501a951541e62d5

drwxrwx--- 3 root root 5 Mar 17 2015 T55081d472f641f10

drwxrwx--- 3 root root 5 Apr 14 10:27 T552d238170fc63be

drwxrwx--- 3 root root 5 Apr 16 11:38 T552fd746af5be6e9

drwxrwx--- 3 root root 5 Sep 4 08:46 T55e9928f940e0155

# zfs list

NAME USED AVAIL REFER MOUNTPOINT

tank/lxc 52.8G 2.60T 79K /lxc

tank/lxc/T5501a951541e62d5 483M 150G 957M /lxc/T5501a951541e62d5/rootfs

tank/lxc/T55081d472f641f10 15.4G 585G 15.8G /lxc/T55081d472f641f10/rootfs

tank/lxc/T552d238170fc63be 22.1G 278G 22.5G /lxc/T552d238170fc63be/rootfs

tank/lxc/T552fd746af5be6e9 8.53G 591G 8.51G /lxc/T552fd746af5be6e9/rootfs

tank/lxc/T55e9928f940e0155 1.45G 124G 1.87G /lxc/T55e9928f940e0155/rootfs

tank/lxc/drydock-2015-04-17T15:23:02 734M 2.60T 734M /lxc/drydock-2015-04-17T15:23:02/rootfs

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 47 / 52

cadvisor

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 48 / 52

cadvisor

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 48 / 52

cadvisor

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 48 / 52

Results

> 2x consolidation

. . . able to defer hardware purchase for a year

Clean method for multiple Cassandra clusters per physical host. Can continue to breakapart clusters by use case and SLA!

Virtually every view (3 billion/day) of AddThis tools involves Cassandra, in acontainer, on ZFS.

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 49 / 52

Summary & Future Work

ZFS allows us to significantly improve the efficiency of both large and small clusters

ZFS is fundamental to container storage

Future: Continued performance investigations (align block sizes?)

Future: Is this going to evolve into writing our own IaaS?

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 50 / 52

We Are Hiring

http://www.addthis.com/careers

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 51 / 52

Questions?

Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 52 / 52