accumulo summit 2015: event-driven big data with accumulo - leveraging big data in motion...

Post on 15-Jul-2015

238 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Accumulo Summit - 4/28/2015

Event-Driven Big Data with Accumulo

Leveraging Big Data in M o t i o n …

John HebelerLockheed Martin Inc.

john.w.hebeler@lmco.com

“It is a capital mistake to theorize before one has data.” Sherlock Holmes

Plan…✴Brief Event-Driven Overview✴Accumulo Event Management✴Demonstration/Access to EC2

2

Events❖ Events drive our world - it is our context

❖ Data processing often reflects these events but with batch latency, poor resolution, longitudinal conflicts, and pull-type architectures

❖ If you don’t ask - no one hears…

❖ Event consequences are delayed and possibly lost

❖ Especially true “In Context” with related events

❖ Time plays a critical factor - before, after, simultaneous…

❖ Focus on Accumulo Role and Implementation

3

Event-Driven Architecture

❖ Events drive to consequences

❖ Multiple Levels/Iterations

❖ Clients (or downstream events) analyze the consequences in near real-time

❖ Stateless except for Big Data (Accumulo) which makes it possible!

❖ Resolution, Fidelity, Query, …4

Accumulo Data Model

❖ Decomposable, Flexible Key

❖ Lexicographical Index (only) from Row ID

❖ Family and Qualifier can be “Columns” or Row/Key “Enrichment”

❖ Visibility controls row level flexible “security”

❖ Timestamp usually automatic and allows “versions”

❖ Value

❖ Anything but not really “searchable”

❖ Any above can be quite h u g e❖ Atomic only at Row Level

KeyValue

Row IDColumn

TimestampFamily Qualifier Visibility

Events and Context❖ Store events for easy retrieval

❖ Events continue to grow; Context reaches steady state

❖ Proper interpretation of an event within its context

❖ Idempotence

6

Categories

1. Direct Accumulo Operations

2. Event Programming

3. Event Management with Accumulo

Direct Accumulo Operations

Query❖ Key constructs - Packed fields vs Column based - your choice

❖ Lexigraphical Index Only Index - (Another word for build a new table)

❖ a finds a.a.a.b

❖ Not usually practical to search in the Value

❖ Query for the past values (versions)

❖ Time

ArrayList<Range> ranges = new ArrayList<Range>( );// Populate rangesBatchScanner bs = conn.createBatchScanner(table,… );b.setRanges(ranges)

TableOperations to = conn.tableOperations()to.setProperty(tableName, “table.iterator.scan.vers.opt.maxVersions”, N);to.setProperty(tableName, “table.iterator.majc.vers.opt.maxVersions”, N);to.setProperty(tableName, “table.iterator.minc.vers.opt.maxVersions”, N);

RowID Family Qualifier Value

9

Event Update❖ Store events for easy retrieval

❖ Maintain context surrounding the event

❖ Write with same key - updates valueRowID Family Qualifier Value

10

EventID1 EventID2 EventID3 Event** JSON or Serialized Object

Event Cursor❖ Accumulo Cursor automatically buffers responses to conserve memory

❖ Events constructed directly from an Accumulo row do not

❖ If not careful, out of memory exceptions (especially true in big data)

RowID Family Qualifier ValueClass EventCursor {Iterator rowIterator = null;public EventCursor(Scanner s) {

rowIterator = s.iterator();}

public Event next() { return( row2Event(s.iterator.next())); } }

A Word About Accumulo Visibility…

❖ Different

❖ (part of the key)

Event Programming

Exception based Programming❖ Don’t ask for permission but plan for exceptions…

❖ Faster and more efficient

❖ Program to expect that they won’t happen and if they do, handle it

❖ Watch out for thread contention - can use LockRowID Family Qualifier Value

// Optional - openLock.lock();while(true){ try { wr = aClient.createBatchWriter(EVENT_CONTEXT_TABLE, new BatchWriterConfig()); break; } catch (TableNotFoundException e) {

// Create Table and retry - also need to catch TableExistsException aClient.tableOperations().create(EVENT_CONTEXT_TABLE);

}}// Optional - openLock.unlock();

Avoid Transactions❖ Big data transactions expensive (and difficult)

❖ Make the need rare and solution lazy

❖ Distributed partial state dilemma

Append and update a single row does not require formal transactions

Race Condition lazy recognition and repair

Accumulo only ensures row level transactions (but can still be of value for each field can hold a lot of data)

Event conclusions too close in time are just reprocessed or properly thread bundled

RowID Family Qualifier Value

15

Progressive Provenience❖ Retrieve origin of event combinations

❖ Maintain context surrounding the event

❖ Use same key in different tables for rapid traversalRowID Family Qualifier Value

16

Test Events

❖ Test Flag allows In-Stream Test and Validation

❖ Availability

❖ Performance

❖ Quality

❖ What Ifs

❖ Flag indicates different storage table, queues, …

Event Management with Accumulo

Turning an Event Off❖ Event assertion no longer supported (but was)

RowID Family Qualifier Value

19

Forgetting an Event (Error)❖ Store events for easy retrieval

❖ Maintain context surrounding the event

RowID Family Qualifier Value

20

Time Travel❖ Rerun (Time) Events due to corrupted data,

out-of-order events, event error, event correction, or “what if”scenarios

❖ Develop context surrounding the event

❖ Remixing the cake

** Need to Run Topic X again since last October due to error then

// Collect all events for Topic since October (already in time order)

// Clear Topic X Context

// Rerun collected events in order (all corrected now!)

RowID Family Qualifier Value

21

Future Events

❖ Future Events (Expiring State, Travel Plans, …)

❖ May not happen or change…

RowID Family Qualifier Value

❖ Store event as always

❖ Schedule timer (or interval timer) to ignite future events

❖ Events easily removed due to update, timer finds nothing

❖ Requires careful consideration of index/RowId

Extra Extra❖ Analytics

❖ Events create a rich foundation for longitudinal analytics - but must consider the data model for efficient queries (proper indexing)

❖ Backup/Recovery

❖ Take advantage of Accumulo clone and pause processing

❖ Hybrid Systems

❖ Semantic Web

❖ Related NoSQL - MongoDB and Neo4J

❖ Map Reduce

❖ Gotcha

❖ Accumulo built upon Hadoop, Zookeeper…

Follow Up❖ Email for EC2 accumulo and event driven prototype

❖ john.w.hebeler@lmco.com

❖ Questions any time

❖ Play - free micro computer one year

top related