real-time big data analytics based on product recommendations case study
Post on 11-Apr-2017
1.515 Views
Preview:
TRANSCRIPT
Real-time big data analytics based on product recommendations case study
IT Business Solutions B2B Conference October 2015
© deep.bi
We started as an ad network
The challenge was to recommend the best product (out of millions)
to the right person in a given moment (thousands of users within a second)
5 billion ad views delivered in 24 months
To put it in the scale context:
If we would serve 1 ad per second it will take
160 years to serve 5 billion ads
So we needed a solution
SQL databases did not work
Popular NoSQL databases did not work
Standard data warehouse approaches (pre-aggregations, creating schemas) - did not work
Re-thinking all the problems with huge data streams flowing to us every second
we have built a complete solution based on open-source technologies
and fresh, smart ideas from our engineering team
It is called deep.bi and now we make it available to other companies
DEEP.BI = BIG DATA FAST DATA SOLUTION
high velocity high volume
deep.bi lets high-growth companies solve fast data problems by providing
scalable, flexible and real-time data collection, enrichment and analytics
deep.bi – complete data processing flow
Data enrichment,
transformation and integration
Unstructured, raw data from many sources
page views, IoT events,
IP, URL, cookie, transactions, call detail
records, etc.
Find patterns,
build models, predict
behavior
collect enrich analyze
How to predict the best offer based on online data – case study.
Collect website, campaigns and CRM data
Website: Google
Analytics
Campaigns: Agency reports
Apps: Dedicated
monitoring tools
Other systems:
Call center IVR, emails
Instead of integrating current reporting tools we need to gather all the single events that our customers generate.
Data is stored in silos. Reporting tools provide aggregated reports impossible to integrate around single customer.
Collecting raw web data is not enough
2015-05-15T00:26:41.328Z,3,D,[ip_hidden],i1xszg0f-19hqrje,"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36",”[url_hidden]",7279848891,@906,"https://www.google.pl/",vuser-history-allegro-1-hc20150509.1,"122_100003_Park@700:html_620x100_single_banner:See offer"
IP, URL, cookie, user-agent, timestamp
* Coming soon
Enrich raw web and mobile data
50+ information
from one interaction
Purchase intent
Device
Time
Location
ISP
Online context
Weather* Demographics
We can learn quite a few things from user IP
Example use: • international travellers • townspeople • people in mountains • rainy day
• Country • Region • City • ZIP Code • Population • Latitude & Longitude • Time zone • IDD prefix to call the city from
another country • Phone area code • Mobile Country Code (MCC) • Mobile Network Code (MNC) • Elevation • Weather at the moment of event
ISP tells us more we could expect
Example use: • competitors’ users->
acquisition • our users -> retention/up-
selling/cross-selling • people from particular
company or company type
• ISP name or Organization name • Organization type:
• Commercial • Organization • Government • Military • University/College/School • Library • Content Delivery Network • Fixed Line ISP • Mobile ISP • Data Center/Web Hosting/Transit • Search Engine Spider • Reserved
• Mobile brand • Net speed
Detailed information about user device
Example use: • smartphone users • Apple users • Samsung Galaxy users • Google browser users
• Device Type • Device Brand • Device Model • Device Operating System • Operating System Producer • Browser • Browser Producer
Besides user features, track user behavior too.
Deeper understanding of people’s behavior: • RFM Segmentation (Recency, Frequency, Monetary) • Shopping cart analysis • Purchase sequence analysis
User behavior and characteristics helps predicts next best action/offer
What product should we recommend?
How could end this purchase path?
So, how to build tailored recommendations? Pick an algorithm that is suitable for the problem
Product [ feature_1, feature_2, …, feature_N]
User [ feature_1, feature_2, …, feature_N]
User [ product_1, product_2, …, product_N]
Simple rules: if a user has some features serve this group of products Manual segment creating: analysts find
segments of users and match them with product segments Simple feature matching: get user weighted
feature vector and match with products feature vectors
Manual / people managed rules
Find segments automatically (e.g. k-means) Product features based recommendations
User features based recommendations
Combined product and user based
recommendations (collaborative filtering, deep learning)
Machine learning-supported recommendations
Prod
uct p
opul
arity
Products
The most interesting recommendations
Recommendations long tail phenomenon
Technology behind Deep BI
Complex data model for query optimization
split dimensions in several tables based on reports made
pre cherry-pick dimensions which we can aggregate based on cardinality
index every dimension column is a must
Impossible to add high-cardinality dimensions
no way to analyze per user (millions of them)
no way to event add all of user-agent, url, geo-info, ...
Problems with SQL and NoSQL databases
Complex data loading process
needs to pre-aggregate in memory
non-trivial reliability issues
hard to parallelize
There is always latency
pre-aggregation in job loading memory
Problems with SQL and NoSQL databases
Customer databases
Event sources*
Raw data stream
Transformed data stream
Real-time data ingestion Kafka
Data Transformation & Enrichment Node.js, Spark
Streaming
Real-time OLAP Store
Druid
Operational Store
Cassandra
High performance, multi-purpose storage
Web
ana
lytic
s da
shbo
ard
deep.bi API
ETL
Customer analytics
dashboard
*e.g.. mobile apps, websites, marketing campaigns, IoT (beacons, wearables)
Raw Data Store Hadoop,
Parquet, Spark
deep.bi – real-time big data architecture
DEEP Data enrichment, storage & analytics
Client’s DEEP Data Space
End-user browser
Web Data Collection API (HTML or JS)
Trackers pass event data with
<DEEP tracker>
Ingestion API
Data Collection APIs
1
<D> <D>
Mobile Data Collection API (HTML, JS or Native SDK)
Trackers pass event data with
Events are represented with full flexibility of JSON { "data": { "event_type": "CLICK", "ad_request_event": { "ctx": { "event_time": "2015-07-10T06:15:50.819Z", "ip_address": "XX.XX.XX.XX", "geo_info": { "country": ”US", "region": ”California", "city": ”San Francisco", "timezone": ”PST", "isp": ”XXX", "population": 849,774 }, “page": { "raw_url": ”XXX", "standardized_domain": ”XXX" }, "page_info": { "page_raw_url": ”XXX", ”product_categories": [ { "id": 20585 }, { "id": 100126 }, }, "cookie": "ibx8axlw-17j287o", "user_agent": "Mozilla/5.0 (Linux; Android 4.2.2; GT-S7580 Build/JDQ39) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.93 Mobile Safari/537.36",
Publish-subscribe service The nervous system of enterprise data
decouple producers from consumers reliable buffer data send now, process later.
Scalable distributed, replicated log system Pause components, restart processing Powered by:
web giants like LinkedIn, Twitter, Netflix, Uber, Spotify or Pinterest >10M messages/second
Apache Kafka
Scalable, fault-tolerant stream processing system With simple programming model & rich API & integrations Powered by:
Yahoo, Netflix, eBay NASA, Intel, Cisco
It is our fundamental technology for streaming applications sessionize events detect frauds attribute purchases to click or views load & read external stores like Druid, Hadoop, Cassand
Apache Spark Streaming
Open Source Streaming Data Store for Interactive Analytics at Scale denormalized data no more snowflake or star-schema! Build real-time dashboards, analytic applications, exploratory tools on it.
It’s FAST! aggregate, drill-down, slice-n-dice in sub-seconds advanced column-store with compression sophisticated approximate algorithms
It’s SCALABLE horizontally scalable - just add more machines replicated, highly-available Over 100 PBs of data, millions events/second
Druid – Real-time OLAP Store
Ingest historical & real-time data data available for exploration in milliseconds can store years of data in very optimized storage
Powered by eBay, Netflix, PayPal, Yahoo Cisco
It is our core data store of all events, historical and real-time data
Druid – Real-time OLAP Store
Apache Spark for batch-processing: fast and general engine for large-scale data processing Replaces Map-Reduce, being up to 10x-100x faster! Number 1 open-source project in big data space (contributors, commits) In-memory processing (if possible) Spark SQL for SQL processing
Apache Parquet - an optimized storage format columnar – read only columns you need compressed – specialized compression for data type + generic compression 2x-4x: 600 GB data -> 150 GB data
Hadoop can be optimized by 2 order of magnitudes: from hours to seconds!
Hadoop Optimized
Thank you!
Share your thoughts, challenges or case studies with us.
Or drop us a line: hello@deep.bi
SUBMIT »
Backup slides
Let’s assume we want to find users who:
Were interested in smartphones Use Samsung product Live in cities with population over 1M people Are woman Were traveling abroad Came from our display campaign
So, we have a combination of 6 (k) dimensions from 50 (n).
Using the combination formula: we will have…
Complexity of multidimensional queries
… similar number of possible combinations:
15,890,700 as in Lotto (6 from 49).
Thank you!
Share your thoughts, challenges or case studies with us.
Or drop us a line: hello@deep.bi
SUBMIT »
top related