real-time big data analytics based on product recommendations case study

Real-time big data analytics based on product recommendations case study

IT Business Solutions B2B Conference October 2015

We started as an ad network

The challenge was to recommend the best product (out of millions)

to the right person in a given moment (thousands of users within a second)

5 billion ad views delivered in 24 months

To put it in the scale context:

If we would serve 1 ad per second it will take

160 years to serve 5 billion ads

So we needed a solution

SQL databases did not work

Popular NoSQL databases did not work

Standard data warehouse approaches (pre-aggregations, creating schemas) - did not work

Re-thinking all the problems with huge data streams flowing to us every second

we have built a complete solution based on open-source technologies

and fresh, smart ideas from our engineering team

It is called deep.bi and now we make it available to other companies

DEEP.BI = BIG DATA FAST DATA SOLUTION

high velocity high volume

deep.bi lets high-growth companies solve fast data problems by providing

scalable, flexible and real-time data collection, enrichment and analytics

deep.bi – complete data processing flow

Data enrichment,

transformation and integration

Unstructured, raw data from many sources

page views, IoT events,

IP, URL, cookie, transactions, call detail

records, etc.

Find patterns,

build models, predict

behavior

collect enrich analyze

How to predict the best offer based on online data – case study.

Collect website, campaigns and CRM data

Website: Google

Analytics

Campaigns: Agency reports

Apps: Dedicated

monitoring tools

Other systems:

Call center IVR, emails

Instead of integrating current reporting tools we need to gather all the single events that our customers generate.

Data is stored in silos. Reporting tools provide aggregated reports impossible to integrate around single customer.

Collecting raw web data is not enough

2015-05-15T00:26:41.328Z,3,D,[ip_hidden],i1xszg0f-19hqrje,"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36",”[url_hidden]",7279848891,@906,"https://www.google.pl/",vuser-history-allegro-1-hc20150509.1,"122_100003_Park@700:html_620x100_single_banner:See offer"

IP, URL, cookie, user-agent, timestamp

* Coming soon

Enrich raw web and mobile data

50+ information

from one interaction

Purchase intent

Device

Location

Online context

Weather* Demographics

We can learn quite a few things from user IP

Example use: •  international travellers •  townspeople •  people in mountains •  rainy day

•  Country •  Region •  City •  ZIP Code •  Population •  Latitude & Longitude •  Time zone •  IDD prefix to call the city from

another country •  Phone area code •  Mobile Country Code (MCC) •  Mobile Network Code (MNC) •  Elevation •  Weather at the moment of event

ISP tells us more we could expect

Example use: •  competitors’ users->

acquisition •  our users -> retention/up-

selling/cross-selling •  people from particular

company or company type

•  ISP name or Organization name •  Organization type:

•  Commercial •  Organization •  Government •  Military •  University/College/School •  Library •  Content Delivery Network •  Fixed Line ISP •  Mobile ISP •  Data Center/Web Hosting/Transit •  Search Engine Spider •  Reserved

•  Mobile brand •  Net speed

Detailed information about user device

Example use: •  smartphone users •  Apple users •  Samsung Galaxy users •  Google browser users

•  Device Type •  Device Brand •  Device Model •  Device Operating System •  Operating System Producer •  Browser •  Browser Producer

Besides user features, track user behavior too.

Deeper understanding of people’s behavior: •  RFM Segmentation (Recency, Frequency, Monetary) •  Shopping cart analysis •  Purchase sequence analysis

User behavior and characteristics helps predicts next best action/offer

What product should we recommend?

How could end this purchase path?

So, how to build tailored recommendations? Pick an algorithm that is suitable for the problem

Product [ feature_1, feature_2, …, feature_N]

User [ feature_1, feature_2, …, feature_N]

User [ product_1, product_2, …, product_N]

  Simple rules: if a user has some features serve this group of products   Manual segment creating: analysts find

segments of users and match them with product segments   Simple feature matching: get user weighted

feature vector and match with products feature vectors

Manual / people managed rules

  Find segments automatically (e.g. k-means)   Product features based recommendations

  User features based recommendations

  Combined product and user based

recommendations (collaborative filtering, deep learning)

Machine learning-supported recommendations

Products

The most interesting recommendations

Recommendations long tail phenomenon

Technology behind Deep BI

 Complex data model for query optimization

 split dimensions in several tables based on reports made

 pre cherry-pick dimensions which we can aggregate based on cardinality

  index every dimension column is a must

  Impossible to add high-cardinality dimensions

 no way to analyze per user (millions of them)

 no way to event add all of user-agent, url, geo-info, ...

Problems with SQL and NoSQL databases

 Complex data loading process

 needs to pre-aggregate in memory

 non-trivial reliability issues

 hard to parallelize

  There is always latency

 pre-aggregation in job loading memory

Problems with SQL and NoSQL databases

Customer databases

Event sources*

Raw data stream

Transformed data stream

Real-time data ingestion Kafka

Data Transformation & Enrichment Node.js, Spark

Streaming

Real-time OLAP Store

Operational Store

Cassandra

High performance, multi-purpose storage

deep.bi API

Customer analytics

dashboard

*e.g.. mobile apps, websites, marketing campaigns, IoT (beacons, wearables)

Raw Data Store Hadoop,

Parquet, Spark

deep.bi – real-time big data architecture

DEEP Data enrichment, storage & analytics

Client’s DEEP Data Space

End-user browser

Web Data Collection API (HTML or JS)

Trackers pass event data with

Ingestion API

Data Collection APIs

<D> <D>

Mobile Data Collection API (HTML, JS or Native SDK)

Trackers pass event data with

Events are represented with full flexibility of JSON { "data": { "event_type": "CLICK", "ad_request_event": { "ctx": { "event_time": "2015-07-10T06:15:50.819Z", "ip_address": "XX.XX.XX.XX", "geo_info": { "country": ”US", "region": ”California", "city": ”San Francisco", "timezone": ”PST", "isp": ”XXX", "population": 849,774 }, “page": { "raw_url": ”XXX", "standardized_domain": ”XXX" }, "page_info": { "page_raw_url": ”XXX", ”product_categories": [ { "id": 20585 }, { "id": 100126 }, }, "cookie": "ibx8axlw-17j287o", "user_agent": "Mozilla/5.0 (Linux; Android 4.2.2; GT-S7580 Build/JDQ39) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.93 Mobile Safari/537.36",

  Publish-subscribe service   The nervous system of enterprise data

  decouple producers from consumers   reliable buffer data   send now, process later.

  Scalable distributed, replicated log system   Pause components, restart processing   Powered by:

  web giants like LinkedIn, Twitter, Netflix, Uber, Spotify or Pinterest   >10M messages/second

Apache Kafka

  Scalable, fault-tolerant stream processing system   With simple programming model & rich API & integrations   Powered by:

  Yahoo, Netflix, eBay   NASA, Intel, Cisco

  It is our fundamental technology for streaming applications sessionize events   detect frauds   attribute purchases to click or views   load & read external stores like Druid, Hadoop, Cassand

Apache Spark Streaming

  Open Source Streaming Data Store for Interactive Analytics at Scale   denormalized data   no more snowflake or star-schema!   Build real-time dashboards, analytic applications, exploratory tools on it.

  It’s FAST!   aggregate, drill-down, slice-n-dice in sub-seconds   advanced column-store with compression   sophisticated approximate algorithms

  It’s SCALABLE   horizontally scalable - just add more machines   replicated, highly-available   Over 100 PBs of data, millions events/second

Druid – Real-time OLAP Store

  Ingest historical & real-time data   data available for exploration in milliseconds   can store years of data in very optimized storage

  Powered by   eBay, Netflix, PayPal, Yahoo   Cisco

  It is our core data store of all events, historical and real-time data

Druid – Real-time OLAP Store

  Apache Spark for batch-processing: fast and general engine for large-scale data processing   Replaces Map-Reduce, being up to 10x-100x faster!   Number 1 open-source project in big data space (contributors, commits)   In-memory processing (if possible)   Spark SQL for SQL processing

  Apache Parquet - an optimized storage format   columnar – read only columns you need   compressed – specialized compression for data type + generic compression   2x-4x: 600 GB data -> 150 GB data

  Hadoop can be optimized by 2 order of magnitudes: from hours to seconds!

Hadoop Optimized

Thank you!

Share your thoughts, challenges or case studies with us.

Or drop us a line: hello@deep.bi

SUBMIT »

Backup slides

Let’s assume we want to find users who:

  Were interested in smartphones   Use Samsung product   Live in cities with population over 1M people   Are woman   Were traveling abroad   Came from our display campaign

So, we have a combination of 6 (k) dimensions from 50 (n).

Using the combination formula: we will have…

Complexity of multidimensional queries

… similar number of possible combinations:

15,890,700 as in Lotto (6 from 49).

Thank you!

Share your thoughts, challenges or case studies with us.

Or drop us a line: hello@deep.bi

SUBMIT »

real-time big data analytics based on product recommendations case study

Technology

e6893 big data analytics lecture 4: big data analytics

big analytics & visualisation

big data and analytics creating actionable intelligence ·...

analytics and big data analytics

big gains from big data analytics

idiro analytics - analytics & big data

big data big analytics

e6893 big data analytics lecture 2: big data analytics

big data analytics

emc it big data analytics journey - dell emc saudi · pdf...

hadoop, big data and big analytics 2014 - sas...hadoop, big...

big data analytics - zerostack · big data analytics i. big...

analytics & big data

big data & analytics · concepts, technologies and the ibm...

data analytics big data & analytics

big data analytics infrastructure for dummies, ibm · pdf...

big data analytics and predictive analytics - _ predictive...

petascale analytics - the world of big data requires big...

big analytics without big hassles

big data meets big data analytics