accumulo summit 2014: four orders of magnitude: running large scale accumulo clusters

Four Orders of Magnitude: Running Large Scale Accumulo Clusters

Aaron Cordova Accumulo Summit, June 2014

Scale, Security, Schema

to scale1 - (vt) to change the size of something

“let’s scale the cluster up to twice the original size”

to scale2 - (vi) to function properly at a large scale

“Accumulo scales”

What is Large Scale?

Notebook Computer

• 16 GB DRAM

• 512 GB Flash Storage

• 2.3 GHz quad-core i7 CPU

Modern Server

• 100s of GB DRAM

• 10s of TB on disk

• 10s of cores

Large ScaleLaptop Server 10 Node

Cluster100

Nodes1000

Nodes10,000 Nodes

100 GB

100 TB

100 PB

In RAM On Disk

Data Composition

January February March April

Original Raw Derivative QFDs Indexes

Accumulo Scales

• From GB to PB, Accumulo keeps two things low:

• Administrative effort

• Scan latency

Scan Latency

0 250 500 750 1000

Administrative Overhead

0 250 500 750 1000

Failed Machines Admin Intervention

Accumulo Scales

• From GB to PB three things grow linearly:

• Total storage size

• Ingest Rate

• Concurrent scans

Ingest Benchmark

0 250 500 750 1000

AWB Benchmark

http://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf

1000 machines

100 M entries written per second

408 terabytes

7.56 trillion total entries

Graph Benchmark

http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

1200 machines

4.4 trillion vertices

70.4 trillion edges

149 M edges traversed per second

1 petabyte

Graph Analysis

Billions of Edges

Twitter Yahoo! Facebook Accumulo

70,000

6.61.5

Accumulo is designed after Google’s BigTable

BigTable powers hundreds of applications at Google

BigTable serves 2+ exabytes

http://hbasecon.com/sessions/#session33

600 M queries per second organization wide

From 10 to 10,000

Starting with ten machines 101

One rack

1 TB RAM

10-100 TB Disk

Hardware failures rare

Test Application Designs

Designing Applications for Scale

Keys to Scaling

1. Live writes go to all servers

2. User requests are satisfied by few scans

3. Turning updates into inserts

Keys to Scaling

Writes on all servers Few Scans

Hash / UUID KeysRowID Col Value

af362de4 Bob

b23dc4be Annie

b98de2ff Joe

c48e2ade $30

c7e43fb2 $25

d938ff3d 32

e2e4dac4 59

e98f2eab3 43

Key Value

userA:name Bob

userA:age 43

userA:account $30

userB:name Annie

userB:age 32

userB:account $25

userC:name Joe

userC:age 59

Uniform writes

MonitorParticipating Tablet Servers

MyTable

Servers Hosted Tablets … Ingest

r1n1 1500 200k

r1n2 1501 210k

r2n1 1499 190k

r2n2 1500 200k

Hash / UUID KeysRowID Col Value

af362de4 Bob

b23dc4be Annie

b98de2ff Joe

c48e2ade $30

c7e43fb2 $25

d938ff3d 32

e2e4dac4 59

e98f2eab3 43

3 x 1-entry scans on 3 servers

get(userA)

Keys to Scaling

Hash / UUID Keys

Group for LocalityKey Value

userA:name Bob

userA:age 43

userB:name Annie

userB:age 32

userC:name Fred

userC:age 29

userD:name Joe

userD:age 59

Key Value

userA:name Bob

userA:age 43

userA:account $30

userB:name Annie

userB:age 32

userB:account $25

userC:name Joe

userC:age 59

RowID Col Value

af362de4 name Annie

af362de4 age 32

af362de4 account $25

c48e2ade name Joe

c48e2ade age 59

e2e4dac4 name Bob

e2e4dac4 age 43

e2e4dac4 account $30

Still fairly uniform writes

Group for LocalityRowID Col Value

af362de4 name Annie

af362de4 age 32

c48e2ade name Joe

c48e2ade age 59

e2e4dac4 name Bob

e2e4dac4 age 43

1 x 3-entry scan on 1 server

get(userA)

Keys to Scaling

Grouped Keys

Temporal KeysKey Value

userA:name Bob

userA:age 43

userB:name Annie

userB:age 32

userC:name Fred

userC:age 29

userD:name Joe

userD:age 59

Key Value

20140101 44

20140102 22

20140103 23

RowID Col Value

20140101 44

20140102 22

20140103 23

userA:name Bob

userA:age 43

userB:name Annie

userB:age 32

userC:name Fred

userC:age 29

userD:name Joe

userD:age 59

Key Value

20140101 44

20140102 22

20140103 23

20140104 25

20140105 31

RowID Col Value

20140101 44

20140102 22

20140103 23

20140104 25

20140105 31

userA:name Bob

userA:age 43

userB:name Annie

userB:age 32

userC:name Fred

userC:age 29

userD:name Joe

userD:age 59

Key Value

20140101 44

20140102 22

20140103 23

20140104 25

20140105 31

20140106 27

20140107 25

20140108 17

RowID Col Value

20140101 44

20140102 22

20140103 23

20140104 25

20140105 31

20140106 27

20140107 25

20140108 17

Always write to one server

No write parallelism

Temporal KeysRowID Col Value

20140101 44

20140102 22

20140103 23

20140104 25

20140105 31

20140106 27

20140107 25

20140108 17

Fetching ranges uses few scans

get(20140101 to 201404)

Keys to Scaling

Temporal Keys

Binned Temporal KeysKey Value

userA:name Bob

userA:age 43

userB:name Annie

userB:age 32

userC:name Fred

userC:age 29

userD:name Joe

userD:age 59

Key Value

20140101 44

20140102 22

20140103 23

RowID Col Value

0_20140101 44

1_20140102 22

2_20140103 23

Uniform Writes

userA:name Bob

userA:age 43

userB:name Annie

userB:age 32

userC:name Fred

userC:age 29

userD:name Joe

userD:age 59

Key Value

20140101 44

20140102 22

20140103 23

20140104 25

20140105 31

20140106 27

RowID Col Value

0_20140101 44

0_20140104 25

1_20140102 22

1_20140105 31

2_20140103 23

2_20140106 27

Uniform Writes

userA:name Bob

userA:age 43

userB:name Annie

userB:age 32

userC:name Fred

userC:age 29

userD:name Joe

userD:age 59

Key Value

20140101 44

20140102 22

20140103 23

20140104 25

20140105 31

20140106 27

20140107 25

20140108 17

RowID Col Value

0_20140101 44

0_20140104 25

0_20140107 25

1_20140102 22

1_20140105 31

1_20140108 17

2_20140103 23

2_20140106 27

Uniform Writes

Binned Temporal KeysRowID Col Value

0_20140101 44

0_20140104 25

0_20140107 25

1_20140102 22

1_20140105 31

1_20140108 17

2_20140103 23

2_20140106 27

One scan per bin

get(20140101 to 201404)

Keys to Scaling

Binned Temporal Keys

Keys to Scaling

• Key design is critical

• Group data under common row IDs to reduce scans

• Prepend bins to row IDs to increase write parallelism

Splits

• Pre-split or organic splits

• Going from dev to production, can ingest a representative sample, obtain split points and use them to pre-split a larger system

• Hundreds or thousands of tablets per server is ok

• Want at least one tablet per server

Effect of Compression

• Similar sorted keys compress well

• May need more data than you think to auto-split

Inserts are fast 10s of thousands per second per

machine

Updates *can* be …

Update Types

• Overwrite

• Combine

• Complex

Update - Overwrite

• Performance same as insert

• Ignore (don’t read) existing value

• Accumulo’s Versioning Iterator does the overwrite

Update - OverwriteRowID Col Value

af362de4 name Annie

af362de4 age 32

c48e2ade name Joe

c48e2ade age 59

e2e4dac4 name Bob

e2e4dac4 age 43

userB:age -> 34

Update - OverwriteRowID Col Value

af362de4 name Annie

af362de4 age 34

c48e2ade name Joe

c48e2ade age 59

e2e4dac4 name Bob

e2e4dac4 age 43

userB:age -> 34

Update - Combine

• Things like X = X + 1

• Normally one would have to read the old value to do this, but Accumulo Iterators allow multiple inserts to be combined at scan time, or compaction

• Performance is same as inserts

Update - CombineRowID Col Value

af362de4 name Annie

af362de4 age 34

c48e2ade name Joe

c48e2ade age 59

e2e4dac4 name Bob

e2e4dac4 age 43

userB:account -> +10

af362de4 name Annie

af362de4 age 34

c48e2ade name Joe

c48e2ade age 59

e2e4dac4 name Bob

e2e4dac4 age 43

userB:account -> +10

af362de4 name Annie

af362de4 age 34

c48e2ade name Joe

c48e2ade age 59

e2e4dac4 name Bob

e2e4dac4 age 43

getAccount(userB) $35

Update - Combine

After compaction

RowID Col Value

af362de4 name Annie

af362de4 age 34

c48e2ade name Joe

c48e2ade age 59

e2e4dac4 name Bob

e2e4dac4 age 43

Update - Complex

• Some updates require looking at more data than Iterators have access to - such as multiple rows

• These require reading the data out in order to write the new value

• Performance will be much slower

Update - ComplexuserC:account = getBalance(userA) + getBalance(userB)

RowID Col Value

af362de4 name Annie

af362de4 age 34

c48e2ade name Joe

c48e2ade age 59

c48e2ade account $40

e2e4dac4 name Bob

e2e4dac4 age 43

35+30 = 65

Update - ComplexuserC:account = getBalance(userA) + getBalance(userB)

RowID Col Value

af362de4 name Annie

af362de4 age 34

c48e2ade name Joe

c48e2ade age 59

c48e2ade account $65

e2e4dac4 name Bob

e2e4dac4 age 43

35+30 = 65

Planning a Larger-Scale Cluster 102 - 104

Storage vs Ingest

1000000

10 100 1000 10000

Ingest Rate 1x1TB 12x3TB

120,000

12,000

10,000

10 Stor

Model for Ingest Rates

A = 0.85log2N * N * S

N - Number of machines S - Single Server throughput (entries / second) A - Aggregate Cluster throughput (entries / second)

Expect 85% increase in write rate when doubling the size of the cluster

Estimating Machines Required

N = 2 (log2(A/S) / 0.7655347)

N - Number of machines S - Single Server throughput (entries / second) A - Target Aggregate throughput (entries / second)

Expect 85% increase in write rate when doubling the size of the cluster

Predicted Cluster SizesN

r of M

Millions of Entries per Second

0 150 300 450 600

100 Machines 102

Multiple racks

10 TB RAM

100 TB - 1PB Disk

Some hardware failures in the first week

(burn in)

Expect 3 failed HDs in first 3 mo

Another 4 within the first year

http://static.googleusercontent.com/media/research.google.com/en/us/archive/disk_failures.pdf

Can process the 1000 Genomes data set

260 TB

www.1000genomes.org

Can store and index the Common Crawl Corpus

2.8 Billion web pages 541 TB

commoncrawl.org

One year of Twitter 182 trillion tweets

483 TB

http://www.sec.gov/Archives/edgar/data/1418091/000119312513390321/d564001ds1.htm

Deploying an ApplicationTablet ServersClientsUsers

May not see the affect of writing to disk for a while

1000 machines 103

Multiple rows of racks

100 TB RAM

1-10 PB Disk

Hardware failure is a regular occurrence

Hard drive failure about every 5 days (average).

Will be skewed towards beginning of the year

Can traverse the ‘brain graph’ 70 trillion edges, 1 PB

http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

Facebook Graph 1s of PB

http://www-conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105_DhrubaBorthakur.pdf

Netflix Video Master Copies 3.14 PB

http://www.businessweek.com/articles/2013-05-09/netflix-reed-hastings-survive-missteps-to-join-silicon-valleys-elite

World of Warcraft Backend Storage 1.3 PB

http://www.datacenterknowledge.com/archives/2009/11/25/wows-back-end-10-data-centers-75000-cores/

Webpages, live on the Internet 14.3 Trillion

http://www.factshunt.com/2014/01/total-number-of-websites-size-of.html

Things like the difference between two compression algorithms start

to make a big difference

Use range compactions to affect changes on portions of table

Lay off Zookeeper

Watch Garbage Collector and Namenode ops

Garbage Collection > 5 minutes?

Start thinking about NameNode Federation

Accumulo 1.6

Multiple NameNodes

Accumulo

Namenode Namenode

DataNodesDataNodes

Multiple HDFS Clusters

Multiple NameNodes

Accumulo

DataNodes

Multiple NameNodes, shared DataNodes (Federation. Requires Hadoop 2.0)

Namenode Namenode

More Namenodes = higher risk of one going down.

Can use HA Namenodes in conjunction w/ Federation

10,000 machines 104

You, my friend, are here to kick a** and chew bubble gum

1 PB RAM

10-100 PB Disk

1 hardware failure every hour on average

Entire Internet Archive 15 PB

http://www.motherjones.com/media/2014/05/internet-archive-wayback-machine-brewster-kahle

A year’s worth of data from the Large Hadron Collider

http://home.web.cern.ch/about/computing

0.1% of all Internet traffic in 2013 43.6 PB

http://www.factshunt.com/2014/01/total-number-of-websites-size-of.html

Facebook Messaging Data 10s of PB

Facebook Photos 240 billion

High 10s of PB

Must use multiple NameNodes

Can tune back heartbeats, periodicity of central processes in

general

Can combine multiple PB data sets

Up to 10 quadrillion entries in a single table

While maintaining sub-second lookup times

Only with Accumulo 1.6

Dealing with data over time

Data Over Time - Patterns

• Initial Load

• Increasing Velocity

• Focus on Recency

• Historical Summaries

Initial Load

• Get a pile of old data into Accumulo fast

• Latency not important (data is old)

• Throughput critical

Bulk Load RFiles

Bulk Loading

MapReduce

RFiles Accumulo

Increasing velocity

If your data isn’t big today, wait a little while

Accumulo scales up dynamically, online. No downtime

The first scale, ‘can change size’

Scaling UpClients

Accumulo

3 physical servers Each running

a Tablet Server process and a Data Node process

Scaling UpClients

Accumulo

HDFS Start 3 new Tablet Server procs

3 new Data node processes

Scaling UpClients

Accumulo

HDFS master immediately assigns tablets

Scaling UpClients

Accumulo

Clients immediately begin querying new

Tablet Servers

Scaling UpClients

Accumulo

new Tablet Servers read data from old Data nodes

Scaling UpClients

Accumulo

new Tablet Servers write data to new Data Nodes

Never really seen anyone do this

Except myself

20 machines in Amazon EC2

to 400 machines

all during the same MapReduce job reading data out of Accumulo, summarizing, and writing back

Scaled back down to 20 machines when done

Just killed Tablet Servers

Decommissioned Data Nodes for safe data consolidation to

remaining 20 nodes

Other ways to go from 10x to 10x+1

Accumulo Table Export

followed by HDFS DistCP to new cluster

Maybe new replication feature

Newer Data is Read more Often

Accumulo keeps newly written data in memory

Block Cache can keep recently queried data in memory

Combining Iterators make maintaining summaries of large

amounts of raw events easy

Reduces storage burden

Historical Summaries

April May June July

Unique Entities Stored Raw Events Processed

Age-off iterator can automatically remove data over a certain age

IBM estimates 2.5 exabytes of data is created every day

http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html

90% of available data created in last 2 years

http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html

25 new 10k node Accumulo clusters per day

Accumulo is doing it’s part to get in front of the big data trend

Questions ?

@aaroncordova

accumulo summit 2014: four orders of magnitude: running large scale accumulo clusters

rowid col

temporal keyskey

single server

af362de4 account

key valueusera

overwriterowid

combinerowid

initial load

Technology

accumulo summit 2015: real-time distributed and reactive...

accumulo summit 2014: a tour of internal accumulo testing

accumulo summit 2015: reactive programming in accumulo: the...

accumulo summit 2015: building aggregation systems on...

the build-up of the colour-magnitude relation in low-z...

nsa accumulo talk

accumulo summit 2014: data-center replication with apache...

accumulo security and encryption

accumulo summit 2014: accismus -- percolating with accumulo

accumulo summit 2015: zookeeper, accumulo, and you...

accumulo summit 2015: tracing in accumulo and hdfs...

accumulo summit 2014: past and future threats: encryption...

apache accumulo overview

accumulo design

accumulo summit 2015: ambari and accumulo: hdp 2.3 upcoming...

an introduction to accumulo

accumulo summit 2015

geospatially enabling your spark and accumulo clusters with...

accumulo summit 2014: accumulo on yarn

accumulo fotovoltaico "litio"