real - time stream p rocessing yu lele. o utline why needs real-time process? traditional stream...

REAL-TIME STREAM PROCESSING

Yu Lele

OUTLINE

Why needs real-time process? Traditional stream methodologyCurrent systems for stream processingStream systems in deployment

START?

Preliminary

• Financial • Query over real-time streaming financial data, • stock tickers and news feeds

• Network security• Over network packet streams• Provide firewall support and instruction detection• URL-filtering based on table lookups

• Web logs• Personalization, performance monitoring, load-balancing

• Sensor monitoring• Large number sensors distributed, generate streams of data• Combine, monitor and analyze

EXAMPLE: TWITTER SEARCH

• When each of these events happened, people instantly search on Twitter. • Trends

Preliminary Twitter

EXAMPLE: TWITTER SEARCH

• Challenges for search & advertising:

1. Queries never been seen beforeI. #binderfullofwomen: politicsII. #horse and bayonets: Presidential

debates

2. Short-lived QueriesI. Real-time process needed

Preliminary Twitter

EXAMPLE: Ali

Preliminary Taobao

需求归类

统计业务关键指标客观反映当前的业绩现状比如网站活动监控

跟踪业务指标的变化趋势出现异常波动，能智能报警

业务闭环运营中的实时数据应用，比如事件营销，触发式服务

实时推荐

实时数据信息服务

需求特征

分钟级延迟；不能漏算；不能错算；统计时长为当天；

秒级延迟；允许漏算；不能错算；计算过程复杂（规则多）；

秒级延迟；允许漏算；不能错算；不推荐系统交互频繁；

秒级延迟；不允许漏算；不能错算；计算过程复杂（指标定义复杂，指标个数多）；部分指标统计时长跨天；

EXAMPLE: ALIPAY PLATFORM(TRANSACTION)• Calculate real-time• Trade quantity• Trade amount• The top N seller trading information• User register count• More than 100 million messages

• Log processing, 6T data per day

Preliminary Operation Example

EXAMPLE: THE WEATHER CHANNEL• Real-time ingest and persist

weather dataI. Several Storm TopologiesII. Each for fetching one datasetIII. Reshaping the recordsIV. Persist the records to relational

database

• automatic mechanism for repeating download and manipulate the data

Preliminary Operation Example

http://www.weather.com/

OUTLINE

8 REQUIREMENTS-TRADITIONAL

• 1 Keep the Data Moving• Storage operations are too costly

• Add unnecessary latency (disk write)

• process messages “in-stream” without any requirement to store them • also use an active (i.e., non-

polling) processing model• Require larger primary memory

capacity

Preliminary Challenges

• 2 Support a high-level “StreamSQL” language• SQL is widely understood and promulgated• StreamSQL is designed for continuous data which

extends standard SQL by adding windowing constructs and stream-specific operators

• 3 Handle Stream Imperfections• Data may be delayed or lost in transmission• Data may arrived Out-of-Order

• 4 Generate Predictable Outcomes• Deterministic and repeatable• Same input stream yield same outcome

• 5 Integrate Stored and Streaming Data• Often need to compare present with past

• Credit card fraud detection requires monitoring “normal” activity and storing it as a signature which can be used to identify “unusual” or suspicious activity

• State information needs to be stored and efficiently to meet real-time requirements

• 6 Guarantee Data Safety and Availability• To preserve integrity of mission critical information • To avoid disruptions in real time processing (Real-time failover)• backup

• 7 Partition and Scale Application Automatically

• 8 Process and Respond Instantaneously• High volumes• Low latency

TRADITIONAL TECHNOLOGIES FOR SP

OTHER CHALLENGES

• Provide a simple Programming Interface for user • Guarantee no message (data) loss

• Transactional support

OUTLINE

SYSTEMS OVERVIEW

• Yahoo S4 • Simple Scalable Streaming System

• Storm• Open sourced by Twitter at 2011

• Spark-Streaming• Large-scale near-real-time stream processing

Systems Overview

YAHOO S4Simple Scalable Streaming System

YAHOO S4-DESIGN

• Stream = sequence of elements(events) of the form (K,A)• K – tuple-valued keys• A – attributes

• Goal:• Create a flexible stream computing

platform• Consume stream• Computes intermediate values• Emits streams in distributed environment

Systems Yahoo S4

YAHOO S4-DESIGN

• Data processing • base unit is Processing Elements(PE)

• Functionality• Event type• Key• Attribute

• Messages transmitted between PE• State of each PE is inaccessible to

other Pes• Provide capability to route events to

appropriate PEs and create new PEs

Systems Yahoo S4

YAHOO S4-EXAMPLE-WORD COUNT• Input event• doc with

quotation

• Goal:• Continuou

sly produce a sorted list of top K words

Systems Yahoo S4

YAHOO S4-PN

• So many PEs• PEC: Processing Element

Container• Logical hosts: Processing

Nodes(PN)• Listening to events• Execute events• Emit output events

• Communication Layer• Cluster management• Automatic failover

Systems Yahoo S4

STORMOpen sourced by Twitter at 2011

KEY CONCEPTS

• Tuples (ordered list of elements)

(“Saratov”, “Slukjanov”, “event1”, “10/3/12 16:20”)

Systems Storm

KEY CONCEPTS

• Streams (unbounded sequence of tuples)

Systems Storm

KEY CONCEPTS

• Spouts (source of streams)• Talk with: queues, logs, api calls, event data

Systems Storm

KEY CONCEPTS

• Bolts (process tuples and create new streams)

Systems Storm

KEY CONCEPTS

• Topologies ( a directed graph of Spouts and Bolts)

Systems Storm

KEY CONCEPTS

• Cluster

Systems Storm

Hadoop’s Job

Tracker

Hadoop’s Task

Tracker

STORM VS S4

Systems Storm

System Yahoo！ S4 Storm

开发语言 Java Clojure && java

结构去中心化的对等结构中心节点 (nimbus), 非关键

路由 EventType + Key Shuffle,Fields,All,Global.非常灵活

可靠处理不支持支持

容错性部分支持部分支持Load Balance 支持不支持

Web 界面无支持

代码成熟度不成熟成熟

动态增删节点不支持支持

活跃度低活跃

real - time stream p rocessing yu lele. o utline why needs real-time process? traditional stream...

realtime requirements

streamspecific operators11

repeatablesame input

large graphs systems

t data

continuous data

data movingstorage operations

sigmod12 toturialhttp

Documents

e xport p rocessing z ones a uthority presentation by...

a r eview p aper : n oise m odels in d igital i mage p...

binary c onstraint p rocessing chapter 2

i ntroduction to p arallel p rocessing

the following is an o utline of lectures

d ocper c ontractor o nline p rocessing s ystem dcops

altops a ssembly l ine t otal o bject p rocessing s...

r eal w orld s ignal p rocessing - analog, embedded...

r equirements and system engineering week 4. o utline system...

c redit c ard p rocessing & f axing julia gabbard september...

r and h adoop i ntegrated p rocessing e nvironment

a uditory p rocessing d isorder ; (c entral ) a uditory p...

histopathological image analysis using mage rocessing

tuesday, 8 may 2018 - ozwater · 0830 - 1000 1000 - 1045...

1 binary c onstraint p rocessing chapter 2 ics-275a fall...

c ritical r are e arth r efining and p rocessing c orporate...

you gotta be cool. stream stream output stream input...

clean-ap update cl utter e nvironment an alysis using a...

c entral p rocessing u nit (cpu) chapter 2. h istory of...

s ignal p rocessing ( time - based effects ) delay,...