exstreamly cheap - insight data engineering 2016a project
Post on 14-Apr-2017
149 Views
Preview:
TRANSCRIPT
{…where the best deals find you in real time.
Emmanuel Awa
For the love of deals, we all just love it.
Real world engineering challenge.
MOTIVATION
ONE platform : User’s preference Inspired Searches and Shopping..
MOTIVATION
Sqoot API. Scaled to all categories offered by
API
Sample Data
User Interaction – Engineered 1B users
Current Data Source
Any trending deals?
Top selling providers
Categorize deals based on price and discount percentages.
Friends purchase pattern
Sample Queries.
Complex queries? Real time response?
Sample Queries.
Current Pipeline
API
INGESTION
BATCH LAYER
SERVING LAYER
Hybrid Streaming
API Interaction and deals collection
API DESIGN Bad or Good?
Biggest Engineering Challenges
Pagination limits and constant API updates.
http://api.sqoot.com/v2/deals?api_key=xxxxxx;category_slug=home_goods;page=1;per_page=100 Freezing time for real-time non-fire-hose data source is hard
Data Source Constraints
Biggest Project Challenge
Three queries done at the same time. Not fun – Inconsistent. Pagination depends on total largely.
New Page refresh New
ASYNC DISTRIBUTED QUERYING ENGINE
First Stage Master Producer (FSM)
Intermediate Hybrid Consumer-Producer
Final Stage Consumer
Design to solve this?
.
Architecture
FIRST STAGE MASTER
Compute page chunks Leaky bucket approach
FIRST STAGE MASTER Cont’d
HYBRID CONSUMER-PRODUCER
Fetch and produce actual data.
FINAL STAGE CONSUMER
Persist data - HDFS
Nigerian. Masters’ in Computer Science – Brandeis
University MA Software Engineer 2 ½ years. Hobbyist Photographer.
About Me.
PyKafka vs. Kafka-Python. Balanced consumer. Topic to partition assignment – Hash partitioning.
Engineering architecture to handle complex real world data source.
Deep dive. Tweak source code for use case.
DevOps
General learning curves.
Other Challenges
CREATE TABLE trending_categories_with_price (category text, created_at timestamp, updated_at timestamp, expires_at timestamp, description text, fine_print text, price float, discount_percentage float, id bigint, merchant_address text, merchant_country text, merchant_id bigint,merchant_latitude text, merchant_longitude text, merchant_locality text, merchant_name text ,merchant_phone_number text, merchant_region text, number_sold float, online boolean, provider_name text, title text, url text, PRIMARY KEY ((price) category, discount_percentage)) WITH CLUSTERING ORDER BY (discount_percentage DESC);
Sample tables
Elasticsearch or Cassandra or Elasticsearch on Cassandra
Elasticsearch – Good with preserving indexes data. Great for more reads than writes. Analytics. Search
Cassandra – Good for fast writes. Preserving data schema Uptime critical Time seriesElastic Search vs
Cassandra
Benchmarking Pipeline
API
INGESTION
BATCH LAYER
SERVING LAYER
Hybrid Streaming
API Interaction and deals collection
top related