singer, pinterest's logging infrastructure

31

Upload: discover-pinterest

Post on 08-Sep-2014

232 views

Category:

Technology


2 download

DESCRIPTION

Krishna Gade and Roger Wang talk about Pinterest and Singer, our Logging Infrastructure.

TRANSCRIPT

Page 1: Singer, Pinterest's Logging Infrastructure
Page 2: Singer, Pinterest's Logging Infrastructure

Krishna GadeData Engineering Manager

Discover PinterestBig Data and Apache Mesos

Page 3: Singer, Pinterest's Logging Infrastructure

Connor Doyle

Mesosphere

Roger Wang

Pinterest

Bernardo Gomez Palacio

Guavus

Page 4: Singer, Pinterest's Logging Infrastructure

Pinterest is a data product.

Page 5: Singer, Pinterest's Logging Infrastructure

A/B Experimentation

Promoted Pins

Product Insights

Spam Control Related Pins

Home Feed

Search Quality

DATA

Page 6: Singer, Pinterest's Logging Infrastructure

Numbers

• > 30 billion Pins

• 10 billion messages-a-day logged to Kafka

• 10 petabytes of data in S3

• Ingest 20 terabytes of new data each day

• Petabyte-a-day processed in Hadoop

• 6 Hadoop clusters of 3000 nodes in AWS

• Over 100 regular users running over 2,000 jobs each day

Page 7: Singer, Pinterest's Logging Infrastructure

4x Data Growth

Page 8: Singer, Pinterest's Logging Infrastructure

Data Architecture Overview

pins

repins, likes

impressions

Kafka

App

Storm

HadoopSinger

HBase

Redshift

Insights

Features

Page 9: Singer, Pinterest's Logging Infrastructure

Roadmap

• Switch to Kafka 0.8 for all data streams

• Invest in scalable stream processing for realtime insights and products

• Migrate to a robust Hadoop 2.0 platform

• Experiment with Spark esp., for machine learning

• Unified batch and stream compute framework

Page 10: Singer, Pinterest's Logging Infrastructure

Roger WangSoftware Engineer

SingerA High-Performance Logging Infrastructure

Page 11: Singer, Pinterest's Logging Infrastructure

Logging Infrastructure before Singer

Storm

kafka agent

app

app

kafka agent

app

app

Host

app

app Kafka Consumer

S3Kafka copier

Kafka Cluster

Hadoop cluster

Page 12: Singer, Pinterest's Logging Infrastructure

Logging Infrastructure with SingerLogging infrastructure with Singer

Storm

kafka agent

app

app

kafka agent

app

app

Host

singer agent

app

appKafka

Consumer

S3Secor

Kafka Cluster

Hadoop cluster

Page 13: Singer, Pinterest's Logging Infrastructure

Singer Logging Agent

•Simple logging mechanism for applications• Decouple applications from log repository

• Existing applications that logs to disk

• Isolate applications from Singer agent failure

• Isolate applications from log repository failure• Avoid internal buffering and log loss

•Better resource usage• Connection consolidation

• Flexible batching

Page 14: Singer, Pinterest's Logging Infrastructure

Singer Features

•At-least-once delivery

•Configurable adaptive log latency by periodical tailing

•Dynamically discover new log streams

•Dynamically pick up new log configuration

•Pluggable log stream reader

•Pluggable log stream writer

•Rich set of stats via Ostrich

Page 15: Singer, Pinterest's Logging Infrastructure

Singer Architecture

LogStream monitor

Configuration watcher

Reader Writer

Log repository

Reader Writer

Reader Writer

Reader Writer

Log configuration

LogStream processorsA - 1

A -2

B - 1

C - 1

Page 16: Singer, Pinterest's Logging Infrastructure

Singer Concepts and Components

•LogStream/LogFile

•LogPosition

•LogStreamMonitor

•LogStreamProcessor

•LogStreamReader/LogFileReader

•LogStreamWriter

Page 17: Singer, Pinterest's Logging Infrastructure

Log Stream Monitor

LogStream monitor

Log Stream A-1 Processor Stats

Log Stream B-1 Processor Stats

Log Stream B-2

LogStream Registrar

empty log stream Processor Stats

Periodic Task

Page 18: Singer, Pinterest's Logging Infrastructure

Log Stream Processor

Reader

Writer

Commit position

Refresh LogStream

EOS

next batch

update statscalculate next processing timeschedule next processing cycle

Abort on exception

No Yes

Load position and seek reader

Abort on exception

Process batch

Abort on exception

Processing a batch

Page 19: Singer, Pinterest's Logging Infrastructure

Adaptive Log Processing Interval

No messagenext cycle =min(MaxInterval, 2*current interval)

> 1 messages

next cycle = MinInterval

[MinInterval, MaxInterval]

Page 20: Singer, Pinterest's Logging Infrastructure

Pluggable Log Stream Reader

LogFileReader LogMessage with LogPosition

LogMessage: {key: <binary>; timestamp: <timestamp>; message: <binary>}LogPosition: inode + byte offset

Page 21: Singer, Pinterest's Logging Infrastructure

Log Message

Envelope thrift message passed between Reader and Writer:

key binary Uninterpreted binary used to co-locate message. Examples are: session id so that all log entries in the session are on the same partition. No seder cost.

timestamp nanosecs

message binary Uninterpreted binary data. Examples are: Text log line, thrift message or file path. No serder cost.

Page 22: Singer, Pinterest's Logging Infrastructure

Log Position

● Caching can give wrong byte offset● Implement a generic buffered Java InputStream which tracks byte offsets● Restrictions: Reader should not cache or read-ahead.

LogFile inode next log file to read from

byteOffset byte offset from head of file next byte to read from the file

Page 23: Singer, Pinterest's Logging Infrastructure

Log Rotation

log log.1 log.2 log.4log.3 log.6log.5 log.7

log log.1 log.2 log.4log.3 log.6log.5 log.7

1. Using inode to identify log file.2. Check inode<->filename mapping when open file by name.

10 12 1413 1615 1711

12 1413 1615111018

Page 24: Singer, Pinterest's Logging Infrastructure

Duplicate inodes

log log.1 log.2 log.4log.3 log.6log.5 log.7

log log.1 log.2 log.4log.3 log.6log.5 log.7

10 12 1413 1615 1711

12 1413 1615111018

Skip the cycle to wait for log rotation.

Page 25: Singer, Pinterest's Logging Infrastructure

Log File Reader Caveats

Corrupted block Partial LogMessage

Log File Reader kept open between processing cycle to avoid file opening cost

Page 26: Singer, Pinterest's Logging Infrastructure

Pluggable Log Stream Writer

•Writer interprets LogMessage

•Examples:• Log archiver interpret the message as file path

• Kafka writer create Kafka message without deserialize the content in the envelope.

Page 27: Singer, Pinterest's Logging Infrastructure

Log Configuration

Puppet master

WatcherRestart Singer on change

puppet agent

Page 28: Singer, Pinterest's Logging Infrastructure

Singer Deployment

•Debian package: part of base image?

•Dynamic configuration update through Puppet

•Resource footprint enformed

•Rich stats exported through Ostrich to OpenTSD

Page 29: Singer, Pinterest's Logging Infrastructure

Alternatives

•Scribe

•Logstash

•…

Page 30: Singer, Pinterest's Logging Infrastructure

What’s next?

•Resilient file format so that we can skip corrupted blocks

•Pluggable log processing policy

Page 31: Singer, Pinterest's Logging Infrastructure