data- how does it work-

Data: How Does It Work?

An overview of the Castle Global Analytics Pipeline (CGAP™)

1. General overview

2. Our current solution

3. Numbers and comparisons

Raw data aggregate

d from multiple sources

Extract, transform, and load

Persist ingestible

data to warehouse

Analyze ingested data for insights

Exhibit A: Generic Data Pipeline

Raw data aggregate

Persist ingestible

data to warehouse

● Two main types of real-time events: client and

server

● Challenges: client call site consistency,

batching, QA

● Multiple persistent databases

● Real-time events temporarily stored in Kafka -

revisit

● Databases always up to date, copied over at

intervals

CGAP™ - How We Obtain Raw Data

Kafka Diagram

Raw data aggregate

Persist ingestible

data to warehouse

The ETL Job (Spark)

● Multiple batch jobs, one per kafka topic, run at intervals

● Concurrently read out of multiple kafka partitions

● Save offset state per job

● Parse, flatten, write JSON

● Takes care of compression and uploading to cloud storage

● Extra batch load jobs run for persistent relational DBs

Spark Diagram

Spark Diagram (cont.)

Raw data aggregate

Persist ingestible

data to warehouse

Data Warehousing (AWS Redshift)

● Load in data from AWS S3 to Redshift

● Two clusters, Kiwi/Plaza, 10 nodes total

● Scaling/upgrading requires some write downtime

● Support for full SQL joins, based on PostgreSQL 8.0

● Specialized proprietary hardware/software layers

● Scales up to petabytes

Raw data aggregate

Persist ingestible

data to warehouse

Business Intelligence

● Primarily using Mode as the UI to Redshift, so so

● Allows for some visualizations from queries

Our Numbers

● Kiwi: half a billion events ingested daily○ 200MM from clients, 280MM from server (170MM

pushes)○ Uncompressed size of JSON: 500GB/day○ Compressed size, Spark output: 8GB/day○ Compressed and indexed size, Redshift: 15GB/day

● Plaza: 14MM events per day, mostly server○ Plaza loads: 4 hours in 53s vs. Kiwi 4 hours in 8

minutes

Our Numbers (cont.)

● Currently storing 1.2TB of compressed Redshift data○ Roughly 2 months of full Kiwi data (14 billion

rows)○ Past 2 months, archive and coalesce to save

space● At Plaza’s rate, can store over two years on just

256GB● Costs:

○ S3 storage long-term: $5000/year○ Redshift, current status: $16,000/year○ Machines for batch load: 1 MBP○ Mode: $40/month

Thanks to...

● The Apache Foundation for making open source

systems

● AWS for providing us their economies of scale

● You, for listening

data- how does it work-

Documents

how does science work?

how does mapping work?

how does washback work

how does liftlelo work

how does dna work?

how does mykindacrowd work?

how does this work

a totally online school? how does that work?how does that...

how does tms work?

how does bitcoin work?

how does expozone work?

how does voip work

how does scan work

how does mindfulness work

how does vat work?

human physiology: how does the human body work? how does the...

how does gps work?

how does electricity work?

how does mediawiki work?

how does homeopathy work