designing a real time data ingestion pipeline

29
Designing Real-Time Data Ingestion Pipeline Badar Ahmed

Upload: datascience

Post on 20-Mar-2017

352 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Designing a Real Time Data Ingestion Pipeline

Designing Real-TimeData Ingestion PipelineBadar Ahmed

Page 2: Designing a Real Time Data Ingestion Pipeline

About Us

DataScience Inc. ▪ Data Science as a service▪ Customers from Sonos to Belkin▪ Ranked #1 among "Best Places to

Work in Los Angeles for 2015"

▪ Visit datascience.com!

2

Badar Ahmed ▪ Software Engineer▪ Background in high performance

computing & cloud computing▪ Work across the stack on Big Data

problems

Page 3: Designing a Real Time Data Ingestion Pipeline

Importance of Data Ingestion

▪ Data ingestion is precursor to any analysis▪ Characteristics:▪ Reliable▪ Correctness▪ Speed▪ Scalable

3

Page 4: Designing a Real Time Data Ingestion Pipeline

Types of Data Ingestion

▪ Broad topic with many different architectural patterns

▪ Real Time▪ Batch

▪ Structured Data▪ Unstructured Data

4

Page 5: Designing a Real Time Data Ingestion Pipeline

Ingestion Evolution @ DataScience

5

▪ Legacy API existed▪ But ..

✦ Expensive✦ Ops Heavy✦ Hard to scale✦ No batch interface

Page 6: Designing a Real Time Data Ingestion Pipeline

6

Page 7: Designing a Real Time Data Ingestion Pipeline

What was needed

▪ Scaleable ingestion system▪ Batch Ingest▪ Lower Ops and $$$ Cost

7

Page 8: Designing a Real Time Data Ingestion Pipeline

Idea #1

▪ Asynchronous API▪ Queue requests and process them later

Pros:▪ Fast▪ Scaleable

8

Page 9: Designing a Real Time Data Ingestion Pipeline

9

Page 10: Designing a Real Time Data Ingestion Pipeline

Issues with Idea #1

▪ Failure introduces complexity▪ Decoupled systems can be more

difficult to debug▪ User UX poorer if they need to keep

track of async requests▪ Lot of deviation from the simpler API

model of ConnectHQ

10

Page 11: Designing a Real Time Data Ingestion Pipeline

Idea #2

▪ Synchronous Batch so ..✦ UX remains the same

▪ Use Concurrency to do parallel writes to datastore

✦ Caveat: Concurrent code is difficult to write & debug

11

Page 12: Designing a Real Time Data Ingestion Pipeline

First Step: Prototype

12

Page 13: Designing a Real Time Data Ingestion Pipeline

First Step: Prototype

13

Page 14: Designing a Real Time Data Ingestion Pipeline

14

Page 15: Designing a Real Time Data Ingestion Pipeline

15

Page 16: Designing a Real Time Data Ingestion Pipeline

Integration Testing

16

Page 17: Designing a Real Time Data Ingestion Pipeline

Integration Testing

17

Page 18: Designing a Real Time Data Ingestion Pipeline

Unit Testing with Mocks

18

Page 19: Designing a Real Time Data Ingestion Pipeline

More Testing

19

Page 20: Designing a Real Time Data Ingestion Pipeline

More Testing

20

Page 21: Designing a Real Time Data Ingestion Pipeline

Test & Refactor Cycle

21

Page 22: Designing a Real Time Data Ingestion Pipeline

Test & Refactor Cycle

22

Page 23: Designing a Real Time Data Ingestion Pipeline

23

Page 24: Designing a Real Time Data Ingestion Pipeline

Questions?

24

Page 25: Designing a Real Time Data Ingestion Pipeline

Thank you.

Page 26: Designing a Real Time Data Ingestion Pipeline

Development

26

Page 27: Designing a Real Time Data Ingestion Pipeline

Operations & Monitoring

27

Page 28: Designing a Real Time Data Ingestion Pipeline

Operations & Monitoring

28

Page 29: Designing a Real Time Data Ingestion Pipeline

Batch Data Loading

29