traveloka's data journey — traveloka data meetup #2

26
Traveloka’s Data Journey Stories and lessons learned on building a scalable data pipeline at Traveloka.

Upload: traveloka

Post on 23-Jan-2018

353 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Traveloka's data journey — Traveloka data meetup #2

Traveloka’s

Data

Journey

Stories and lessons learned on building a scalable data

pipeline at Traveloka.

Page 2: Traveloka's data journey — Traveloka data meetup #2

Very Early

Days...

Stories and lessons learned on building a scalable data pipeline at Traveloka.

Page 3: Traveloka's data journey — Traveloka data meetup #2

Very Early days

Applications & Services

Summarizer

Internal Dashboard

Report Scripts + Crontab

- Raw Activity- Key Value- Time Series

Page 4: Traveloka's data journey — Traveloka data meetup #2

Full... Split & Shard!

Raw, KV, and Time Series DB

Applications & Services Internal

Dashboard

Report Scripts + Crontab

Raw Activity (Sharded)

Time Series SummarySummarizer

Lesson Learned1. UNIX principle: “Do One Thing and Do It Well”2. Split use cases based on SLA & query pattern3. Scalable tech based on growth estimation

Key Value DB (Sharded)

Page 5: Traveloka's data journey — Traveloka data meetup #2

Throughput?

Kafka comes into rescue

Applications & Services

Raw Activity (Sharded)

Lesson Learned1. Use something that can handle higher throughput for cases with high write volume like tracking2. Decouple publish and consume

Kafka as Datahub

Raw data consumer

Key Value (Sharded)

insert

update

Page 6: Traveloka's data journey — Traveloka data meetup #2

We need Data Warehouse

and BI Tool, and we need it fast!

Raw Activity (Sharded)

Other sources

Python ETL(temporary solution)

Star Schema DW on

PostgresPeriscope BI

Tool

Lesson Learned1. Think DW since the beginning of data pipeline2. BI Tools: Do not reinvent the wheel

Page 7: Traveloka's data journey — Traveloka data meetup #2

“Have” to

adopt big data

Stories and lessons learned on building a scalable data pipeline at Traveloka.

Page 8: Traveloka's data journey — Traveloka data meetup #2

Postgres couldn’t handle the load!

Raw Activity (Sharded)

Other sources

Python ETL(temporary solution)

Star Schema DW on

RedshiftPeriscope BI

Tool

Lesson Learned1. Choose specific tech that best fit the use case

Page 9: Traveloka's data journey — Traveloka data meetup #2

Scaling out in MongoDB

every so often is not manageable...

Lesson Learned1. MongoDB Shard: Scalability need to be tested!

Kafka as Datahub

Gobblin as Consumer

Raw Activity on S3

Page 10: Traveloka's data journey — Traveloka data meetup #2

“Have” to adopt big data

Lesson Learned1. Processing have to be easily scaled2. Scale processing separately for: day to day job, backfill job

Kafka as Datahub

Gobblin as Consumer

Raw Activity on S3

Processing on Spark

Star Schema DW on

Redshift

Page 11: Traveloka's data journey — Traveloka data meetup #2

Near Real Time on Big Data

is challenging

Lesson Learned1.Dig requirement until it is very specific, for data it is related to: 1) latency SLA 2) query pattern 3) accuracy 4) processing requirement 5) tools integration

Kafka as Datahub

MemSQL for Near Real Time DB

Page 12: Traveloka's data journey — Traveloka data meetup #2

No OPS!!!

Stories and lessons learned on building a scalable data pipeline at Traveloka.

Page 13: Traveloka's data journey — Traveloka data meetup #2

Open your mind for

any combination of tech!

Lesson Learned1. Combination of cloud provider is possible, but be careful of latency concern2. During a research project, always prepare plan B & C plus proper buffer on timeline3. Autoscale!

PubSub as Datahub

DataFlow for Stream

Processing

Key Value on DynamoDB

Page 14: Traveloka's data journey — Traveloka data meetup #2

More autoscale!

Lesson Learned1. Autoscale = cost monitoring

CaveatAutoscale != everything solvede.g. PubSub default quota 200MB/s (could be increased, but manually request)

PubSub as Datahub

BigQuery for Near Real Time DB

Page 15: Traveloka's data journey — Traveloka data meetup #2

More autoscale!

Lesson Learned1. Scalable as granular as possible, in this case separate compute and storage scalability2. Separate BI with well defined SLA and exploration use case

Kafka as Datahub

Gobblin as Consumer

Raw Activity on S3

Processing on Spark

Hive & Presto on Qubole as Query

Engine

BI & Exploration Tools

Page 16: Traveloka's data journey — Traveloka data meetup #2

WRAP UP

Stories and lessons learned on building a scalable data pipeline at Traveloka.

Page 17: Traveloka's data journey — Traveloka data meetup #2
Page 18: Traveloka's data journey — Traveloka data meetup #2

Consumer of Data

Streaming

Batch

TravelokaApp

Kafka

ETL

Data Warehouse

S3 Data Lake

Batch Ingest

Android, iOS

DOMO Analytics UI

NoSQL DBTraveloka Services

Ingest

CloudPub/Sub

Storage

CloudStorage

Pipelines Cloud

Dataflow

Analytics

BigQuery

Monitoring

Logging

Hive, Presto Query

Page 19: Traveloka's data journey — Traveloka data meetup #2

Key Lessons Learned

● Scalability in mind -- esp disk full.. :)● Scalable as granular as possible -- compute, storage● Scalability need to be tested (of course!)● Do one thing, and do it well, dig your requirement

-- SLA, query pattern

● Decouple publish and consume

-- publisher availability is very important!

● Choose tech that is specific to the use case● Careful of Gotchas! There's no silver bullet...

Page 20: Traveloka's data journey — Traveloka data meetup #2

THE FUTURE

Stories and lessons learned on building a scalable data pipeline at Traveloka.

Page 21: Traveloka's data journey — Traveloka data meetup #2

Future Roadmap

● In the past, we see problems/needs, see what technology can solve it, and plug it to the existing pipeline.

● It works well.● But after some time, we need to maintain a lot of different

components.● Multiple clusters:

○ Kafka○ Spark○ Hive/Presto○ Redshift○ etc

● Multiple data entry points for analyst:○ BigQuery○ Hive/Presto○ Redshift

Page 22: Traveloka's data journey — Traveloka data meetup #2

Our Goal

● Simplifying our data architecture.● Single data entry point for data analysts/scientists,

both streaming and batch data.● Without compromising what we can do now.● Reliability, speed, and scale.● Less or no ops.● We also want to make migration as simple/easy as

possible.

Page 23: Traveloka's data journey — Traveloka data meetup #2

How will we achieve this?

● There are few options that we are considering right now.

● Some of them introducing new technologies/components.

● Some of them is making use of our existing technology to its maximum potential.

● We are trying exciting new (relatively) technologies:○ Google BigQuery○ Google Dataprep on Dataflow○ AWS Athena○ AWS Redshift Spectrum○ etc

Page 24: Traveloka's data journey — Traveloka data meetup #2

Plan to simplify

Cloud Pub/Sub

Cloud Dataflow

BigQuery Cloud Storage

Kubernetes Cluster Collector

Managed services

BI & Analytics UI

BigTable

REST API

ML Models

Page 25: Traveloka's data journey — Traveloka data meetup #2

Plan to simplify

● Seems promising, but…● Need to be tested.● Cover all use cases that we need ?● Query migration ?● Costs ?● Maintainability ?● Potential problems ?

Page 26: Traveloka's data journey — Traveloka data meetup #2

See You On

Next Event!

Thank You