kafka for dbas
TRANSCRIPT
1© Cloudera, Inc. All rights reserved.
Apache Kafka for Oracle DBAsWhat is KafkaWhy should you careHow to learn Kafka
2© Cloudera, Inc. All rights reserved.
• Oracle DBA
• Turned Oracle Consultant
• Turned Hadoop Solutions Architect
• Turned Developer
Committer on Apache Sqoop
Contributor to Apache Kafka and Apache Flume
About me
3© Cloudera, Inc. All rights reserved.
Apache Kafka is a publish-subscribe messaging rethought as a distributed commit log.
An Optical Illusion
4© Cloudera, Inc. All rights reserved.
• Redo log as an abstraction
• How redo logs are useful
• Pub-sub message queues
• How message queues are useful
• What exactly is Kafka
• How do people use Kafka
• Where can you learn more
We’ll talk about:
5© Cloudera, Inc. All rights reserved.
Redo Log:
The most crucial structure for recovery operations … store all changes made to the database as they occur.
6© Cloudera, Inc. All rights reserved.
Important Point
The redo log is the only reliable source of information about current state of the database.
7© Cloudera, Inc. All rights reserved.
Redo Log is used for
• Recover consistent state of a database
• Replicate the database (Dataguard, Streams, GoldenGate…)
• Update materialized logs (well, it’s a log anyway)
If you look far enough into archive logs – you can reconstruct the entire database
8© Cloudera, Inc. All rights reserved.
What if…
You built an entire data storage system that is just a transaction log?
9© Cloudera, Inc. All rights reserved.
Kafka can log
• Transactions from any database
• Clicks from websites
• Application logs (ERROR, WARN, INFO…)
• Metrics– cpu, memory, io
• Audit events
• And any system can read those logs: Hadoop, alerts, dashboards, databases.
10© Cloudera, Inc. All rights reserved.
Only one thing is missing
Q: How do you query a redo log?
A: Not very efficiently
Sometimes we just need the events – no need to query.
Other times, we need to load the results into a database.
While messages are in transit – we can do all kinds of transformations.
13© Cloudera, Inc. All rights reserved.
Raise your hand if this sounds familiar
“My next project was to get a working Hadoop setup…
Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy algorithms. “
--Jay Kreps, Kafka PMC
16© Cloudera, Inc. All rights reserved.16
Client Backend
Client
Client
Client
Then we add consumers to the existing sources
Another Backend
17© Cloudera, Inc. All rights reserved.17
Client Backend
Client
Client
Client
Then it starts to look like this
Another Backend
Another Backend
Another Backend
18© Cloudera, Inc. All rights reserved.18
Client Backend
Client
Client
Client
With maybe some of this
Another Backend
Another Backend
Another Backend
20© Cloudera, Inc. All rights reserved.
This is where we are trying to get
20
Source System Source System Source System Source System
Kafka decouples Data Pipelines
Hadoop Security SystemsReal-time
monitoringData Warehouse
Kafka
Producers
Brokers
Consumers
Kafka decouples Data Pipelines
21© Cloudera, Inc. All rights reserved.
Important notes:
• Producers and Consumers don’t need to know about each other
• Performance issues on Consumers don’t impact Producers
• Consumers are protected from herds of Producers
• Lots of flexibility in handling load
• Messages are available for anyone –lots of new use cases, monitoring, audit, troubleshooting
http://www.slideshare.net/gwenshap/queues-pools-caches
23© Cloudera, Inc. All rights reserved.
Kafka provides a fast, distributed, highly scalable, highly available, publish-subscribe messaging system.
In turn this solves part of a much harder problem:
Communication and integration between components of large software systems
Click to enter confidentiality information
24© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.
•Messages are organized into topics
•Producers push messages
•Consumers pull messages
•Kafka runs in a cluster. Nodes are called brokers
The Basics
25© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.
Topics, Partitions and Logs
26© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.
Each partition is a log
27© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partition 2 Partion 2
28© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
29© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
31© Cloudera, Inc. All rights reserved.
Why is Kafka better than other MQ?
• Can keep data forever
• Scales very well – high throughputs, low latency, lots of storage
• Scales to any number of consumers
32© Cloudera, Inc. All rights reserved.
How do people use Kafka?
• As a message bus
• As a buffer for replication systems (Like AdvancedQueue in Streams)
• As reliable feed for event processing
• As a buffer for event processing
• Decouple apps from database (both OLTP and DWH)
33© Cloudera, Inc. All rights reserved.
Need More Kafka?
• https://kafka.apache.org/documentation.html
• My video tutorial: http://shop.oreilly.com/product/0636920038603.do
• http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
• Try with Cloudera Manager:http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/topics/kafka_install.html
35© Cloudera, Inc. All rights reserved.
Schema is a MUST HAVE for data integration
Click to enter confidentiality information
36© Cloudera, Inc. All rights reserved.
Kafka only stores Bytes – So where’s the schema?
• People go around asking each other:“So, what does the 5th field of the messages in topic Blah contain?”
• There’s utility code for reading/writing messages that everyone reuses
• Schema embedded in the message
• A centralized repository for schemas
• Each message has Schema ID
• Each topic has Schema ID
Click to enter confidentiality information
37© Cloudera, Inc. All rights reserved.
I Avro
• Define Schema
• Generate code for objects
• Serialize / Deserialize into Bytes or JSON
• Embed schema in files / records… or not
• Support for our favorite languages… Except Go.
• Schema Evolution
• Add and remove fields without breaking anything
Click to enter confidentiality information
39© Cloudera, Inc. All rights reserved.
Schemas are Agile
• Leave out MySQL and your favorite DBA for a second
• Schemas allow adding readers and writers easily
• Schemas allow modifying readers and writers independently
• Schemas can evolve as the system grows
• Allows validating data soon after its written
• No need to throw away data that doesn’t fit!
Click to enter confidentiality information