real - time stream p rocessing yu lele. o utline why needs real-time process? traditional stream...

52
REAL-TIME STREAM PROCESSING Yu Lele

Upload: christopher-norton

Post on 28-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

REAL-TIME STREAM PROCESSING

Yu Lele

Page 2: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

2

OUTLINE

Why needs real-time process? Traditional stream methodologyCurrent systems for stream processingStream systems in deployment

Page 3: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

3

START?

Preliminary

• Financial • Query over real-time streaming financial data, • stock tickers and news feeds

• Network security• Over network packet streams• Provide firewall support and instruction detection• URL-filtering based on table lookups

• Web logs• Personalization, performance monitoring, load-balancing

• Sensor monitoring• Large number sensors distributed, generate streams of data• Combine, monitor and analyze

Page 4: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

4

EXAMPLE: TWITTER SEARCH

• When each of these events happened, people instantly search on Twitter. • Trends

Preliminary Twitter

Page 5: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

5

EXAMPLE: TWITTER SEARCH

• Challenges for search & advertising:

1. Queries never been seen beforeI. #binderfullofwomen: politicsII. #horse and bayonets: Presidential

debates

2. Short-lived QueriesI. Real-time process needed

Preliminary Twitter

Page 6: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

6

EXAMPLE: Ali

Preliminary Taobao

需求归类

统计业务关键指标客观反映当前的业绩现状比如网站活动监控

跟踪业务指标的变化趋势出现异常波动,能智能报警

业务闭环运营中的实时数据应用,比如事件营销,触发式服务

实时推荐

实时数据信息服务

需求特征

分钟级延迟;不能漏算;不能错算;统计时长为当天;

分钟级延迟;不能漏算;不能错算;统计时长为当天;

秒级延迟;允许漏算;不能错算;计算过程复杂(规则多);

秒级延迟;允许漏算;不能错算;不推荐系统交互频繁;

秒级延迟;不允许漏算;不能错算;计算过程复杂(指标定义复杂,指标个数多);部分指标统计时长跨天;

Page 7: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

7

EXAMPLE: ALIPAY PLATFORM(TRANSACTION)• Calculate real-time• Trade quantity• Trade amount• The top N seller trading information• User register count• More than 100 million messages

• Log processing, 6T data per day

Preliminary Operation Example

Page 8: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

8

EXAMPLE: THE WEATHER CHANNEL• Real-time ingest and persist

weather dataI. Several Storm TopologiesII. Each for fetching one datasetIII. Reshaping the recordsIV. Persist the records to relational

database

• automatic mechanism for repeating download and manipulate the data

Preliminary Operation Example

http://www.weather.com/

Page 9: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

9

OUTLINE

Why needs real-time process? Traditional stream methodologyCurrent systems for stream processingStream systems in deployment

Page 10: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

10

8 REQUIREMENTS-TRADITIONAL

• 1 Keep the Data Moving• Storage operations are too costly

• Add unnecessary latency (disk write)

• process messages “in-stream” without any requirement to store them • also use an active (i.e., non-

polling) processing model• Require larger primary memory

capacity

Preliminary Challenges

Page 11: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

11

8 REQUIREMENTS-TRADITIONAL

• 2 Support a high-level “StreamSQL” language• SQL is widely understood and promulgated• StreamSQL is designed for continuous data which

extends standard SQL by adding windowing constructs and stream-specific operators

Preliminary Challenges

Page 12: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

12

8 REQUIREMENTS-TRADITIONAL

• 3 Handle Stream Imperfections• Data may be delayed or lost in transmission• Data may arrived Out-of-Order

• 4 Generate Predictable Outcomes• Deterministic and repeatable• Same input stream yield same outcome

Preliminary Challenges

Page 13: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

13

8 REQUIREMENTS-TRADITIONAL

• 5 Integrate Stored and Streaming Data• Often need to compare present with past

• Credit card fraud detection requires monitoring “normal” activity and storing it as a signature which can be used to identify “unusual” or suspicious activity

• State information needs to be stored and efficiently to meet real-time requirements

Preliminary Challenges

Page 14: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

14

8 REQUIREMENTS-TRADITIONAL

• 6 Guarantee Data Safety and Availability• To preserve integrity of mission critical information • To avoid disruptions in real time processing (Real-time failover)• backup

• 7 Partition and Scale Application Automatically

• 8 Process and Respond Instantaneously• High volumes• Low latency

Preliminary Challenges

Page 15: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

15

TRADITIONAL TECHNOLOGIES FOR SP

Preliminary Challenges

Page 16: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

16

TRADITIONAL TECHNOLOGIES FOR SP

Preliminary Challenges

Page 17: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

17

OTHER CHALLENGES

• Provide a simple Programming Interface for user • Guarantee no message (data) loss

• Transactional support

Preliminary Challenges

Page 18: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

18

OUTLINE

Why needs real-time process? Traditional stream methodologyCurrent systems for stream processingStream systems in deployment

Page 19: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

19

SYSTEMS OVERVIEW

• Yahoo S4 • Simple Scalable Streaming System

• Storm• Open sourced by Twitter at 2011

• Spark-Streaming• Large-scale near-real-time stream processing

Systems Overview

Page 20: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

20

YAHOO S4Simple Scalable Streaming System

Page 21: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

21

YAHOO S4-DESIGN

• Stream = sequence of elements(events) of the form (K,A)• K – tuple-valued keys• A – attributes

• Goal:• Create a flexible stream computing

platform• Consume stream• Computes intermediate values• Emits streams in distributed environment

Systems Yahoo S4

Page 22: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

22

YAHOO S4-DESIGN

• Data processing • base unit is Processing Elements(PE)

• Functionality• Event type• Key• Attribute

• Messages transmitted between PE• State of each PE is inaccessible to

other Pes• Provide capability to route events to

appropriate PEs and create new PEs

Systems Yahoo S4

Page 23: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

23

YAHOO S4-EXAMPLE-WORD COUNT• Input event• doc with

quotation

• Goal:• Continuou

sly produce a sorted list of top K words

Systems Yahoo S4

Page 24: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

24

YAHOO S4-PN

• So many PEs• PEC: Processing Element

Container• Logical hosts: Processing

Nodes(PN)• Listening to events• Execute events• Emit output events

• Communication Layer• Cluster management• Automatic failover

Systems Yahoo S4

Page 25: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

25

STORMOpen sourced by Twitter at 2011

Page 26: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

26

KEY CONCEPTS

• Tuples (ordered list of elements)

(“Saratov”, “Slukjanov”, “event1”, “10/3/12 16:20”)

Systems Storm

Page 27: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

27

KEY CONCEPTS

• Streams (unbounded sequence of tuples)

Systems Storm

Page 28: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

28

KEY CONCEPTS

• Spouts (source of streams)• Talk with: queues, logs, api calls, event data

Systems Storm

Page 29: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

29

KEY CONCEPTS

• Bolts (process tuples and create new streams)

Systems Storm

Page 30: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

30

KEY CONCEPTS

• Topologies ( a directed graph of Spouts and Bolts)

Systems Storm

Page 31: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

31

KEY CONCEPTS

• Cluster

Systems Storm

UI

Hadoop’s Job

Tracker

Hadoop’s Task

Tracker

Page 32: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

32

STORM VS S4

Systems Storm

System Yahoo! S4 Storm

开发语言 Java Clojure && java

结构 去中心化的对等结构 中心节点 (nimbus), 非关键

路由 EventType + Key Shuffle,Fields,All,Global.非常灵活

可靠处理 不支持 支持

容错性 部分支持 部分支持Load Balance 支持 不支持

Web 界面 无 支持

代码成熟度 不成熟 成熟

动态增删节点 不支持 支持

活跃度 低 活跃

Page 33: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

33

MORE ON STORM

• Guaranteeing message processing• Each message coming off a spout will be fully processed• Tuple tree has been exhausted and every message in the tree

has been processed within specified timeout

Systems Storm

Spout

Bolt

Bolt

Bolt

https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing

Page 34: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

34

MORE ON STORM

• Transactional topology• Many messages are processed simultaneously• Strict order between messages• Guarantee exactly once messaging semantics for pretty much

computation

Systems Storm

https://github.com/nathanmarz/storm/wiki/Transactional-topologies

Page 35: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

35

STORM ON TAOBAO

•在淘宝, storm 被广泛用来进行实时日志处理,出现在实时统计、实时风控、实时推荐等场景中。

•一般来说,我们从类 kafka 的 metaQ 或者基于 hbase的 timetunnel 中读取实时日志消息,经过一系列处理,最终将处理结果写入到一个分布式存储中,提供给应用程序访问。

Systems Storm

Page 36: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

36

STORM ON TAOBAO

•我们每天的实时消息量从几百万到几十亿不等,数据总量达到 TB 级。对于我们来说, storm 往往会配合分布式存储服务一起使用。在我们正在进行的个性化搜索实时分析项目中,就使用了 timetunnel + hbase + storm + ups 的架构,每天处理几十亿的用户日志信息,从用户行为发生到完成分析延迟在秒级。

Systems Storm

Page 37: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

37

SPARK STREAMING SYSTEMLarge-scale near-real-time stream processing

Page 38: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

STATEFUL STREAM PROCESSING• Traditional streaming systems have

a event-driven record-at-a-time processing model• Each node has mutable state• For each record, update state & send

new records

• State is lost if node dies!

• Making stateful stream processing be fault-tolerant is challenging

mutable state

node 1

node 3

input records

node 2

input records

38

Systems SparkStream

Page 39: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

REQUIREMENTS Scalable to large clusters

Second-scale latencies

Simple programming model

Integrated with batch & interactive processing

Efficient fault-tolerance in stateful computations

Systems SparkStream

Page 40: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

TRADITIONAL STREAMING SYSTEMS• Fault tolerance via replication or upstream backup:

node 2’

input

input

node 1

node 3

node 2

node 1’

node 3’

synchronization

node 1

node 3

node 2

standby

input

input

Systems SparkStream

Page 41: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

TRADITIONAL STREAMING SYSTEMS• Fault tolerance via replication or upstream backup:

node 1

node 3

node 2

node 1’

node 3’

node 2’

synchronization

node 1

node 3

node 2

standby

input

input

input

input

Fast recovery, but 2x hardware cost

Only need 1 standby, but slow to recover

Systems SparkStream

Page 42: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

TRADITIONAL STREAMING SYSTEMS

Fault tolerance via replication or upstream backup:

node 1

node 3

node 2

node 1’

node 3’

node 2’

synchronization

node 1

node 3

node 2

standby

input

input

input

input

Neither approach tolerates stragglers

Systems SparkStream

Page 43: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

43

OBSERVATION

• Batch processing models, like MapReduce do provide tolerance efficiently• Divide job into deterministic task• Rerun failed/slow tasks in parallel on other nodes

Systems SparkStream

Page 44: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

44

IDEA

• Idea: run a streaming computation as a series of very small, deterministic batch jobs• Eg. Process stream of tweets in 1 sec batches• Same recovery schemes at smaller timescale

• Try to make batch size as small as possible• Lower batch size -> lower end-to-end latency

• State between batches kept in memory• Deterministic stateful ops -> fault-tolerance

Systems SparkStream

Page 45: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

DISCRETIZED STREAM PROCESSING

t = 1:

t = 2:

stream 1 stream 2

batch operation

pullinput

… …

input

immutable dataset(stored reliably)

immutable dataset(output or state);stored in memorywithout replication

Systems SparkStream

Page 46: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

PARALLEL RECOVERY• Checkpoint state datasets periodically• If a node fails/straggles, recompute its dataset partitions in parallel on other nodes

map

input dataset

Faster recovery than upstream backup,without the cost of replication

output dataset

Systems SparkStream

Page 47: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

COMPARISON WITH STORM AND S4Higher throughput than Storm

•Spark Streaming: 670k records/second/node

•Storm: 115k records/second/node

•Apache S4: 7.5k records/second/node

100 100005

1015202530 WordCount

Spark

Storm

Record Size (bytes)

Thro

ughp

ut p

er n

ode

(MB/

s)

100 10000

40

80

120Grep

Spark

Record Size (bytes)

Thro

ughp

ut p

er n

ode

(MB/

s)

47

Systems SparkStream

Page 48: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

48

OUTLINE

Why needs real-time process? Traditional stream methodologyCurrent systems for stream processingStream systems in deployment

Page 49: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

STREAM-SYSTEM IN DEPLOYMENT1. Monitor for which search queries are

currently popular.

2. Discover a new popular search query, send it to human evaluators to categorize the query or provide other information

3. Push the information received to the backend systems. Next time user searches for a query, machine learning models will make use of the additional information

49

Life-Example Twitter Search

Page 50: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

STREAM-SYSTEM IN DEPLOYMENT

50

Life-Example Twitter Search

• Monitoring for popular queries• Tuple streams of data ( search query, timestamp)• Spouts attach to search logs• Bolts process tuple streams. Updates the count of the number

of times we’ve seen a query, check whether the query is “currently popular”, and dispatches it to human computation pipeline if so.

Page 51: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

STREAM-SYSTEM IN DEPLOYMENT

51

Life-Example Twitter Search

• Human evaluation of popular search queries• Storm topology has detected the query “Big Bird”• Submit them to Mechanical Turk

I. What category does the query belong to?II. Does the query refer to a person?

Page 52: REAL - TIME STREAM P ROCESSING Yu Lele. O UTLINE  Why needs real-time process?  Traditional stream methodology  Current systems for stream processing

52

REFERENCE

• Storm:http://storm-project.net/• Storm–Taobao:http://

www.searchtb.com/2012/09/introduction-to-storm.html• S4 vs Storm:http://demeter.inf.ed.ac.uk/cross/docs/s4vStorm.pdf• Twitter Storm - Realtime distributed computations:http

://cloud.berkeley.edu/data/storm-berkeley.pdf• Yahoo!S4:http://www.cnblogs.com/aga-j/archive/2012/02/03/2337151.html• Twitter engineering:http://

engineering.twitter.com/2013/01/improving-twitter-search-with-real-time.html

• SparkStreaming:Large scale near- ‐realtime stream processing• S4: Distributed Stream Computing Platform• The 8 Requirements of Real-Time Stream Processing 2005• Pollux: Towards Scalable Distributed Real-time Search on Microblogs