introduction to real-time data processing - meetupfiles.meetup.com/18978602/university prgoram -...

37
Introduction to Real-time data processing Yogi Devendra ([email protected])

Upload: others

Post on 22-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Introduction to Real-time data processing

Yogi Devendra ([email protected])

Page 2: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Agenda

● What is big data?● Data at rest Vs Data in motion● Batch processing Vs Real - time data

processing (streaming)● Examples● When to use: Batch? Real-time? ● Current trends

2

Page 3: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Image ref [4]

3

Big data

Page 4: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Definition : big data

Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. [1]

4

Page 5: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Exploding sizes of datasets5

● Google ○ >100PB data everyday [3]

● Large Hydron collidor : ○ 150M sensors x 40M sample per sec x 600 M

collisions per sec○ >500 exabytes per day [2]○ 0.0001% of data is actually analysed

Page 6: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

6

Questions

Image ref [16]

Page 7: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Data at rest Vs Data in motion

● At rest : ○ Dataset is fixed ○ a.k.a bounded [15]

● In motion : ○ continuously incoming data ○ a.k.a unbounded

7

Page 8: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Data at rest Vs Data in motion (continued)

● Generally Big data has velocity○ continuous data

● Difference lies in when are you analyzing your data? [5]

○ after the event occurs ⇒ at rest○ as the event occurs ⇒ in motion

8

Page 9: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Examples

● Data at rest○ Finding stats about group in a closed room○ Analyzing sales data for last month to make

strategic decisions● Data in motion

○ Finding stats about group in a marathon○ e-commerce order processing

9

Page 10: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

10

Questions

Image ref [16]

Page 11: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Batch processing

● Problem statement : ○ Process this entire data ○ give answer for X at the end.

11

Page 12: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Batch processing : Use-cases12

● Sales summary for the previous month[5]

● Model training for Spam emails

Page 13: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Batch processing : Characteristics13

● Access to entire data● Split decided at the launch time.● Capable of doing complex analysis (e.g.

Model training) [6]● Optimize for Throughput (data processed

per sec) ● Example frameworks : Map Reduce,

Apache Spark [6]

Page 14: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

14

Questions

Image ref [16]

Page 15: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Real time data processing

● a.k.a. Stream processing● Problem statement :

○ Process incoming stream of data ○ to give answer for X at this

moment.

15

Page 16: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Stream processing : Use-cases

● e-commerce order processing● Credit card fraud detection● Label given email as : spam vs non-

spam

16

Page 17: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Image ref [7]

17

Page 18: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Stream processing : Characteristics

● Results for X are based on the current data

● Computes function on one record or smaller window. [6]

● Optimizations for latency (avg. time taken for a record)

18

Page 19: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Stream processing : Characteristics

● Need to complete computes in near real-time

● Computes something relatively simple e.g. Using pre-defined model to label a record.

● Example frameworks: Apache Apex, Apache storm

19

Page 20: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

20

Questions

Image ref [16]

Page 21: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

21

Page 22: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Batch Vs Streaming

pani puri ⇒ Streamingimage ref [9]

wada ⇒ batch image ref [8]

22

Page 23: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

23

Questions

Image ref [16]

Page 24: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Micro-batch● Create batch of

small size ● Process each

micro-batch separately

● Example frameworks: Spark streaming

pani puri ⇒ micro-batch image ref [10]

24

Page 25: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

● Depends on use-case○ Some are suitable for batch○ Some are suitable for streaming○ Some can be solved by any one○ Some might need combination of two.

25

When to use : Batch Vs Streaming?

Page 26: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

When to use : Batch Vs Real time?(continued)

● Answers for current snapshot ⇒ Real-time○ Answers at the end ⇒ Open

● Complex calculations, multiple iterations over entire data ⇒ Batch ○ Simple computations ⇒ Open

● Low latency requirements (< 1s) ⇒ Real-time

26

Page 27: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

When to use : Batch Vs Real time?(continued)

● Each record can be processed independently ⇒ Open○ Independent processing not possible ⇒

Batch● Depends on use-case

○ Some use-cases can be solved by any one○ Some other might need combination of two.

27

Page 28: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

28

Questions

Image ref [16]

Page 29: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Can one replace the other?

● Batch processing is designed for ‘data at rest’. ‘data in motion’ becomes stale; if processed in batch mode.

● Real-time processing is designed for ‘data in motion’. But, can be used for ‘data at rest’ as well (in many cases).

29

Page 30: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

30

Questions

Image ref [16]

Page 31: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Quiz : is this Batch or Real-time?

● Queue for roller coaster ride image ref [11]

● Queue at the petrol pump image ref [12]

31

Page 32: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Quiz : is this Batch or Real-time?

● Selecting relevant ad to show for requested page

● Courier dispatch from city A to B

image ref [13]

image ref [14]

32

Page 33: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

33

Questions

Image ref [16]

Page 34: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

Current trends● Difficulty in splitting problems as Map

Reduce : Alternative paradigms for expressing user intent .

● More and more use-cases demanding faster insight to data (near real-time)

● ‘Data in motion’ is common. ● ‘Real-time data processing’ getting

traction.

34

Page 35: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

35

Questions

Image ref [16]

Page 36: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

36

Page 37: Introduction to Real-time data processing - Meetupfiles.meetup.com/18978602/University Prgoram - Introduction to Real-time data... · Introduction to Real-time data processing Yogi

References1. Big Data | Gartner IT Glossary http://www.gartner.com/it-glossary/big-data/2. Big Data | Wikipedia https://en.wikipedia.org/wiki/Big_data 3. Data size estimates | Follow the data https://followthedata.wordpress.com/2014/06/24/data-size-estimates/4. Data Never Sleeps 2.0 | Domo https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/5. Data in motion vs. data at rest | Internap http://www.internap.com/2013/06/20/data-in-motion-vs-data-at-rest/6. Difference between batch processing and stream processing | Quora https://www.quora.com/What-are-the-differences-between-batch-

processing-and-stream-processing-systems/answer/Sean-Owen?srid=O9ht7. How FAST is Credit Card Fraud Detection | FICO http://www.fico.com/en/latest-thinking/infographic/how-fast-is-credit-card-fraud-

detection8. CULINARY TERMS | panjakhada http://panjakhada.com/the-basics/9. Crispy Chaat ... | grabhouse http://grabhouse.com/urbancocktail/11-crispy-chaat-joints-food-lovers-hyderabad/

10. Paani puri stall | citiyshor http://www.cityshor.com/pune/food/street-food/camp/murali-paani-puri-stall/11. Great Inventions: The Roller Coaster | findingdulcinea http://www.findingdulcinea.com/features/science/innovations/great-inventions/the-

roller-coaster.html12. RIL petrol pump network | economictimes http://articles.economictimes.indiatimes.com/2015-05-24/news/62583419_1_petrol-and-

diesel-fuel-retailing-ril13. Publishers | Propellerads https://propellerads.com/publishers/14. Michael Bishop Couriers | Google plus https://plus.google.com/11068417651766822306715. The world beyond batch: Streaming 101 http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html16. How to Answer the Question http://www.clipartpanda.com/clipart_images/how-to-answer-the-question-4695414617. Thank You http://www.planwallpaper.com/thank-you

37