real - time stream p rocessing yu lele. o utline why needs real-time process? traditional stream...
Post on 28-Dec-2015
217 Views
Preview:
TRANSCRIPT
REAL-TIME STREAM PROCESSING
Yu Lele
2
OUTLINE
Why needs real-time process? Traditional stream methodologyCurrent systems for stream processingStream systems in deployment
3
START?
Preliminary
• Financial • Query over real-time streaming financial data, • stock tickers and news feeds
• Network security• Over network packet streams• Provide firewall support and instruction detection• URL-filtering based on table lookups
• Web logs• Personalization, performance monitoring, load-balancing
• Sensor monitoring• Large number sensors distributed, generate streams of data• Combine, monitor and analyze
4
EXAMPLE: TWITTER SEARCH
• When each of these events happened, people instantly search on Twitter. • Trends
Preliminary Twitter
5
EXAMPLE: TWITTER SEARCH
• Challenges for search & advertising:
1. Queries never been seen beforeI. #binderfullofwomen: politicsII. #horse and bayonets: Presidential
debates
2. Short-lived QueriesI. Real-time process needed
Preliminary Twitter
6
EXAMPLE: Ali
Preliminary Taobao
需求归类
统计业务关键指标客观反映当前的业绩现状比如网站活动监控
跟踪业务指标的变化趋势出现异常波动,能智能报警
业务闭环运营中的实时数据应用,比如事件营销,触发式服务
实时推荐
实时数据信息服务
需求特征
分钟级延迟;不能漏算;不能错算;统计时长为当天;
分钟级延迟;不能漏算;不能错算;统计时长为当天;
秒级延迟;允许漏算;不能错算;计算过程复杂(规则多);
秒级延迟;允许漏算;不能错算;不推荐系统交互频繁;
秒级延迟;不允许漏算;不能错算;计算过程复杂(指标定义复杂,指标个数多);部分指标统计时长跨天;
7
EXAMPLE: ALIPAY PLATFORM(TRANSACTION)• Calculate real-time• Trade quantity• Trade amount• The top N seller trading information• User register count• More than 100 million messages
• Log processing, 6T data per day
Preliminary Operation Example
8
EXAMPLE: THE WEATHER CHANNEL• Real-time ingest and persist
weather dataI. Several Storm TopologiesII. Each for fetching one datasetIII. Reshaping the recordsIV. Persist the records to relational
database
• automatic mechanism for repeating download and manipulate the data
Preliminary Operation Example
http://www.weather.com/
9
OUTLINE
Why needs real-time process? Traditional stream methodologyCurrent systems for stream processingStream systems in deployment
10
8 REQUIREMENTS-TRADITIONAL
• 1 Keep the Data Moving• Storage operations are too costly
• Add unnecessary latency (disk write)
• process messages “in-stream” without any requirement to store them • also use an active (i.e., non-
polling) processing model• Require larger primary memory
capacity
Preliminary Challenges
11
8 REQUIREMENTS-TRADITIONAL
• 2 Support a high-level “StreamSQL” language• SQL is widely understood and promulgated• StreamSQL is designed for continuous data which
extends standard SQL by adding windowing constructs and stream-specific operators
Preliminary Challenges
12
8 REQUIREMENTS-TRADITIONAL
• 3 Handle Stream Imperfections• Data may be delayed or lost in transmission• Data may arrived Out-of-Order
• 4 Generate Predictable Outcomes• Deterministic and repeatable• Same input stream yield same outcome
Preliminary Challenges
13
8 REQUIREMENTS-TRADITIONAL
• 5 Integrate Stored and Streaming Data• Often need to compare present with past
• Credit card fraud detection requires monitoring “normal” activity and storing it as a signature which can be used to identify “unusual” or suspicious activity
• State information needs to be stored and efficiently to meet real-time requirements
Preliminary Challenges
14
8 REQUIREMENTS-TRADITIONAL
• 6 Guarantee Data Safety and Availability• To preserve integrity of mission critical information • To avoid disruptions in real time processing (Real-time failover)• backup
• 7 Partition and Scale Application Automatically
• 8 Process and Respond Instantaneously• High volumes• Low latency
Preliminary Challenges
15
TRADITIONAL TECHNOLOGIES FOR SP
Preliminary Challenges
16
TRADITIONAL TECHNOLOGIES FOR SP
Preliminary Challenges
17
OTHER CHALLENGES
• Provide a simple Programming Interface for user • Guarantee no message (data) loss
• Transactional support
Preliminary Challenges
18
OUTLINE
Why needs real-time process? Traditional stream methodologyCurrent systems for stream processingStream systems in deployment
19
SYSTEMS OVERVIEW
• Yahoo S4 • Simple Scalable Streaming System
• Storm• Open sourced by Twitter at 2011
• Spark-Streaming• Large-scale near-real-time stream processing
Systems Overview
20
YAHOO S4Simple Scalable Streaming System
21
YAHOO S4-DESIGN
• Stream = sequence of elements(events) of the form (K,A)• K – tuple-valued keys• A – attributes
• Goal:• Create a flexible stream computing
platform• Consume stream• Computes intermediate values• Emits streams in distributed environment
Systems Yahoo S4
22
YAHOO S4-DESIGN
• Data processing • base unit is Processing Elements(PE)
• Functionality• Event type• Key• Attribute
• Messages transmitted between PE• State of each PE is inaccessible to
other Pes• Provide capability to route events to
appropriate PEs and create new PEs
Systems Yahoo S4
23
YAHOO S4-EXAMPLE-WORD COUNT• Input event• doc with
quotation
• Goal:• Continuou
sly produce a sorted list of top K words
Systems Yahoo S4
24
YAHOO S4-PN
• So many PEs• PEC: Processing Element
Container• Logical hosts: Processing
Nodes(PN)• Listening to events• Execute events• Emit output events
• Communication Layer• Cluster management• Automatic failover
Systems Yahoo S4
25
STORMOpen sourced by Twitter at 2011
26
KEY CONCEPTS
• Tuples (ordered list of elements)
(“Saratov”, “Slukjanov”, “event1”, “10/3/12 16:20”)
Systems Storm
27
KEY CONCEPTS
• Streams (unbounded sequence of tuples)
Systems Storm
28
KEY CONCEPTS
• Spouts (source of streams)• Talk with: queues, logs, api calls, event data
Systems Storm
29
KEY CONCEPTS
• Bolts (process tuples and create new streams)
Systems Storm
30
KEY CONCEPTS
• Topologies ( a directed graph of Spouts and Bolts)
Systems Storm
31
KEY CONCEPTS
• Cluster
Systems Storm
UI
Hadoop’s Job
Tracker
Hadoop’s Task
Tracker
32
STORM VS S4
Systems Storm
System Yahoo! S4 Storm
开发语言 Java Clojure && java
结构 去中心化的对等结构 中心节点 (nimbus), 非关键
路由 EventType + Key Shuffle,Fields,All,Global.非常灵活
可靠处理 不支持 支持
容错性 部分支持 部分支持Load Balance 支持 不支持
Web 界面 无 支持
代码成熟度 不成熟 成熟
动态增删节点 不支持 支持
活跃度 低 活跃
33
MORE ON STORM
• Guaranteeing message processing• Each message coming off a spout will be fully processed• Tuple tree has been exhausted and every message in the tree
has been processed within specified timeout
Systems Storm
Spout
Bolt
Bolt
Bolt
https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing
34
MORE ON STORM
• Transactional topology• Many messages are processed simultaneously• Strict order between messages• Guarantee exactly once messaging semantics for pretty much
computation
Systems Storm
https://github.com/nathanmarz/storm/wiki/Transactional-topologies
35
STORM ON TAOBAO
•在淘宝, storm 被广泛用来进行实时日志处理,出现在实时统计、实时风控、实时推荐等场景中。
•一般来说,我们从类 kafka 的 metaQ 或者基于 hbase的 timetunnel 中读取实时日志消息,经过一系列处理,最终将处理结果写入到一个分布式存储中,提供给应用程序访问。
Systems Storm
36
STORM ON TAOBAO
•我们每天的实时消息量从几百万到几十亿不等,数据总量达到 TB 级。对于我们来说, storm 往往会配合分布式存储服务一起使用。在我们正在进行的个性化搜索实时分析项目中,就使用了 timetunnel + hbase + storm + ups 的架构,每天处理几十亿的用户日志信息,从用户行为发生到完成分析延迟在秒级。
Systems Storm
37
SPARK STREAMING SYSTEMLarge-scale near-real-time stream processing
STATEFUL STREAM PROCESSING• Traditional streaming systems have
a event-driven record-at-a-time processing model• Each node has mutable state• For each record, update state & send
new records
• State is lost if node dies!
• Making stateful stream processing be fault-tolerant is challenging
mutable state
node 1
node 3
input records
node 2
input records
38
Systems SparkStream
REQUIREMENTS Scalable to large clusters
Second-scale latencies
Simple programming model
Integrated with batch & interactive processing
Efficient fault-tolerance in stateful computations
Systems SparkStream
TRADITIONAL STREAMING SYSTEMS• Fault tolerance via replication or upstream backup:
node 2’
input
input
node 1
node 3
node 2
node 1’
node 3’
synchronization
node 1
node 3
node 2
standby
input
input
Systems SparkStream
TRADITIONAL STREAMING SYSTEMS• Fault tolerance via replication or upstream backup:
node 1
node 3
node 2
node 1’
node 3’
node 2’
synchronization
node 1
node 3
node 2
standby
input
input
input
input
Fast recovery, but 2x hardware cost
Only need 1 standby, but slow to recover
Systems SparkStream
TRADITIONAL STREAMING SYSTEMS
Fault tolerance via replication or upstream backup:
node 1
node 3
node 2
node 1’
node 3’
node 2’
synchronization
node 1
node 3
node 2
standby
input
input
input
input
Neither approach tolerates stragglers
Systems SparkStream
43
OBSERVATION
• Batch processing models, like MapReduce do provide tolerance efficiently• Divide job into deterministic task• Rerun failed/slow tasks in parallel on other nodes
Systems SparkStream
44
IDEA
• Idea: run a streaming computation as a series of very small, deterministic batch jobs• Eg. Process stream of tweets in 1 sec batches• Same recovery schemes at smaller timescale
• Try to make batch size as small as possible• Lower batch size -> lower end-to-end latency
• State between batches kept in memory• Deterministic stateful ops -> fault-tolerance
Systems SparkStream
DISCRETIZED STREAM PROCESSING
t = 1:
t = 2:
stream 1 stream 2
batch operation
pullinput
… …
input
immutable dataset(stored reliably)
immutable dataset(output or state);stored in memorywithout replication
…
Systems SparkStream
PARALLEL RECOVERY• Checkpoint state datasets periodically• If a node fails/straggles, recompute its dataset partitions in parallel on other nodes
map
input dataset
Faster recovery than upstream backup,without the cost of replication
output dataset
Systems SparkStream
COMPARISON WITH STORM AND S4Higher throughput than Storm
•Spark Streaming: 670k records/second/node
•Storm: 115k records/second/node
•Apache S4: 7.5k records/second/node
100 100005
1015202530 WordCount
Spark
Storm
Record Size (bytes)
Thro
ughp
ut p
er n
ode
(MB/
s)
100 10000
40
80
120Grep
Spark
Record Size (bytes)
Thro
ughp
ut p
er n
ode
(MB/
s)
47
Systems SparkStream
48
OUTLINE
Why needs real-time process? Traditional stream methodologyCurrent systems for stream processingStream systems in deployment
STREAM-SYSTEM IN DEPLOYMENT1. Monitor for which search queries are
currently popular.
2. Discover a new popular search query, send it to human evaluators to categorize the query or provide other information
3. Push the information received to the backend systems. Next time user searches for a query, machine learning models will make use of the additional information
49
Life-Example Twitter Search
STREAM-SYSTEM IN DEPLOYMENT
50
Life-Example Twitter Search
• Monitoring for popular queries• Tuple streams of data ( search query, timestamp)• Spouts attach to search logs• Bolts process tuple streams. Updates the count of the number
of times we’ve seen a query, check whether the query is “currently popular”, and dispatches it to human computation pipeline if so.
STREAM-SYSTEM IN DEPLOYMENT
51
Life-Example Twitter Search
• Human evaluation of popular search queries• Storm topology has detected the query “Big Bird”• Submit them to Mechanical Turk
I. What category does the query belong to?II. Does the query refer to a person?
52
REFERENCE
• Storm:http://storm-project.net/• Storm–Taobao:http://
www.searchtb.com/2012/09/introduction-to-storm.html• S4 vs Storm:http://demeter.inf.ed.ac.uk/cross/docs/s4vStorm.pdf• Twitter Storm - Realtime distributed computations:http
://cloud.berkeley.edu/data/storm-berkeley.pdf• Yahoo!S4:http://www.cnblogs.com/aga-j/archive/2012/02/03/2337151.html• Twitter engineering:http://
engineering.twitter.com/2013/01/improving-twitter-search-with-real-time.html
• SparkStreaming:Large scale near- ‐realtime stream processing• S4: Distributed Stream Computing Platform• The 8 Requirements of Real-Time Stream Processing 2005• Pollux: Towards Scalable Distributed Real-time Search on Microblogs
top related