accumulo summit 2015: accumulo in-depth: building bulk ingest [sponsored]

21
Building Bulk Import Eric Newton SW Complete, Inc.

Upload: accumulo-summit

Post on 15-Jul-2015

107 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Building Bulk ImportEric Newton

SW Complete, Inc.

Page 2: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Ingest 101

The life of a mutation:○ Send to server○ Write to Write-Ahead Log○ Store in memory○ Write memory to a file○ Merge, re-write as needed

At least two writes, small in-memory sort

Page 3: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Bulk Ingest

● Accumulo: heavy ingest● Use Map-Reduce efficiency● Pre-sort incoming data● Hand whole sorted files to accumulo● One write● Larger sorts

Page 4: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Version 1

importDirectory(String dir, String failDir)Client computes the servers that need the file:

■ Analysis of file■ Moves directory under Accumulo■ Retry logic (to handle splits, failures)

Page 5: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Version 1: problems

● Limited to the client’s computational power● Permission: client had to be all-knowing● Files could be added to servers many times● Defer file collection while bulk importing● Clients can fail

Page 6: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Version 1.1

● Clients hand bulk imports to the master● Fixed permission problems● Added a bulk import test to the Random

Walk test suit

Page 7: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

The Test

Create a sorted file with a lot of 1’s:12345678 -> 1

Create an identical file, with lots of -1’s:12345678 -> -1

Add a summation iterator over the tableVerify: every entry should be zero:

12345678 -> 0

Page 8: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Random Walk 101

Randomly:○ Import files in random order○ Split the files into random sizes○ Split tablets and random points○ Kill tablet servers (agitate)

Under loadAt scale

Page 9: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Version 2.0

● Master only coordinates file processing● Distribute work to tablet servers● Master distributes files to tablet servers● Tablet serves

○ Analyze files for assignments○ Retry○ Communicate with destination tablet servers

Page 10: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

FilesFilesFiles

Command Flow

Client Master TabletServer

TabletServer

TabletServer

TabletServer

Page 11: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Problems

Problems solved:○ Permissions controlled by Master○ File processing is distributed

Bulk import tested heavilyConsistency

○ Not so much

Page 12: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Version 3.0

Problem: file imported more than once○ Repeated reloading○ RPC timeouts○ Tablet migration○ Tablet split

Add flags to metadata table to prevent:○ file garbage collection○ repeated imports

Page 13: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Version 3.0

● Problem Solved○ Reduced Name Node ops○ Reduced trash laying around from failed imports○ Imports not repeated

● Does it stand up to the Random Walk Test?○ Not so much

Page 14: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Death by Slow Thread“Please take this file”

“OK… looks good, never saw this before”Sleep

“Hey, Please take this file, again”“OK… looks good, never saw this before”“Thanks!”Compact!Wakeup! Time to import that file from the 1st request!

Page 15: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Zookeeper to the Rescue

Add a 3rd party negotiator● Define a session● Add a file only while session is active● Store session in zookeeper● Get agreement about the session at each critical point,

including metadata table updatesSession guides clean-up of markersSession closes only after all agree

Page 16: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Session

“Take this file, for session 123”“OK, working on session 123”“Never saw this file before”Sleep

“Repeat, file for session 123”“Never saw this file before”

“Anybody working on session 123?”

Page 17: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Session

“Yes, I’m processing session 123!”“OK, finish up.”

Double check on session 123, import file“Anybody working on session 123?”

“What’s session 123?”Remove markers in metadata

Page 18: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Bulk Import

● Problem solved:○ Distributed processing○ Permissions○ Files imported once, and only once○ Markers are cleaned up in the face of failures

● Performance○ Not so much

Page 19: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Master as BottleneckBulk import thousands of filesEvery 15 minutesMaster renames files and puts them under /accumulo

BottleneckOne master competes for NN ops with N tablet servers

Page 20: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Master, Go Faster

Add configurable thread poolPush more move requests to the NNCompete more fiercely for resources

Page 21: Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Bulk Import

More efficient data ingestEasy: just a file to the right tabletsHard: consistency in a distributed systemTesting is your friendNothing prepares you for large-scale problems!