flume and hbase

12
Buzzwords Berlin HBase Hackathon, June 2012 Apache Flume and HBase Alexander Alten-Lorenz | Customer Operations Engineer 1

Upload: alexander-alten-lorenz

Post on 11-May-2015

4.459 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Flume and HBase

Buzzwords Berlin HBase Hackathon, June 2012

Apache Flume and HBaseAlexander Alten-Lorenz | Customer Operations Engineer

1

Page 2: Flume and HBase

©2012 Cloudera, Inc. All Rights Reserved.

About Me

• COPS Engineer @ Cloudera• Apache Flume Contributor• Working with hadoop since 2009• Blogger (mapredit.blogspot.com)• Speaker at Conferences / Meetups /

Tooling Events

2

2

Page 3: Flume and HBase

©2012 Cloudera, Inc. All Rights Reserved.

Flume 1.x

• Mass event collector• Stream data (events, not files) from clients

to sinks• Clients: files, syslog, avro, seq, exec• Sinks: HDFS files, HBase, …• Configurable routing / topology

3

3

Page 4: Flume and HBase

©2012 Cloudera, Inc. All Rights Reserved.

Architecture

Component Function

Agent The JVM running Flume. One per machine. Runs many sources and sinks.

Client Produces data in the form of events. Runs in a separate thread.

Sink Receives events from a channel. Runs in a separate thread.

Channel Connects sources to sinks (like a queue). Implements the reliability semantics.

Event A single datum; a log record, an avro object, etc. Normally around ~4KB.

4

4

Page 5: Flume and HBase

©2012 Cloudera, Inc. All Rights Reserved.

Agent

• Runs many clients and sinks• Java properties-based configuration• Low overhead (-Xmx20m)

– adding RAM increases performance– setting Xms prevent in time memory allocation– Batching increase performance dramatically

5

5

Page 6: Flume and HBase

©2012 Cloudera, Inc. All Rights Reserved.

Sources

• Plugin interface• Managed by a SourceRunner that controls

threading and execution model (e.g. polling vs. event-based)

• Included: exec, avro, syslog, seq

6

6

Page 7: Flume and HBase

©2012 Cloudera, Inc. All Rights Reserved.

HBase sinkls -la flume-ng-sinks/flume-ng-hbase-sink/src/main/java/org/apache/flume/sink/hbase/

HBaseSink.javaHbaseEventSerializer.java SimpleHbaseEventSerializer.javaSimpleRowKeyGenerator.java

7

7

Page 8: Flume and HBase

©2012 Cloudera, Inc. All Rights Reserved.

HBaseSink.java

• Control flush()• Using serializer• Control the transaction• Control rollbacks (in case of events couldn’t

written)

8

8

Page 9: Flume and HBase

©2012 Cloudera, Inc. All Rights Reserved.

Configuration

• Source Seq interface• Listening on a defined port @localhost• Serializer need some parameters• Column family and column must be known• Valid hbase-site.xml in $CLASSPATH

9

9

Page 10: Flume and HBase

©2012 Cloudera, Inc. All Rights Reserved.

Configuration Example

10

host1.sources = src1host1.sinks = sink1 host1.channels = ch1

host1.sources.src1.type = seq host1.sources.src1.port = 25001host1.sources.src1.bind = localhosthost1.sources.src1.channels = ch1host1.sinks.sink1.type = org.apache.flume.sink.hbase.HBaseSink host1.sinks.sink1.channel = ch1host1.sinks.sink1.table = test3host1.sinks.sink1.columnFamily = testinghost1.sinks.sink1.column = foohost1.sinks.sink1.serializer = org.apache.flume.sink.hbase.SimpleHbaseEventSerializerhost1.sinks.sink1.serializer.payloadColumn = pcolhost1.sinks.sink1.serializer.incrementColumn = icol host1.channels.ch1.type=memory

10

Page 11: Flume and HBase

©2012 Cloudera, Inc. All Rights Reserved.

Take Away

• Flume collects events• Source - Channel - Sink concept• HBase sink needs a serializer interface• Column family and column must be known

11

11