clickstream analysis with spark—understanding visitors in realtime by josef adersberger
TRANSCRIPT
CLICKSTREAM ANALYSIS WITH SPARK – UNDERSTANDING VISITORS IN REAL-TIME
Dr. Josef AdersbergerQAware GmbH, Germany
User Journey Analysis
C V VT VT VT C X
C V
V V V V V V V
C V V C V V V
VT VT V V V VT C
V X
Event stream: User journeys:
Web / Ad tracking
KPIs:§ Unique users§ Conversions§ Ad costs / conversion value§ …
V
X
VT
C Click
View
View Time
Conversion
„Larry & Friends“ Architecture
Collector Aggregation SQL DB
Runs not well for morethan 1 TB data in terms ofingestion speed, query timeand optimization efforts
„Hadoop & Friends“ Architecture
Collector Batch Processor[Hadoop]Collector Event Data Lake Batch Processor Analytics DB
JSON Stream
Aggregation takes too long
Cumbersomeprogramming model(can be solved withpig, cascading et al.)
Not interactiveenough
κ-Architecture
Collector Stream ProcessorAnalytics DB
Persistence
JSON Stream
Cumbersomeprogramming model
Over-engineered: We only need15min real-time ;-)
Stateful aggregations (unique x, conversions) require a separate DB with high throughput and fast aggregations & lookups.
λ-Architecture
Collector
Event Processor
Event Data Lake Batch Processor
Analytics DBSpeed Layer
Batch Layer
JSON Stream
Cumbersomeprogramming model Complex
architecture
Redundant logic
Functional Architecture
Strange Events
IngestionRaw Event Stream
Collection Events Processing Analytics Warehouse
FactEntries
Atomic Event Frames
Data Lake
Master Data Integration
§ Buffers load peeks§ Ensures message
delivery (fire & forgetfor client)
§ Create user journeys andunique user sets
§ Enrich dimensions§ Aggregate events to KPIs§ Ability to replay for schema
evolution
§ The representation of truth§ Multidimensional data
model§ Interactive queries for
actions in realtime anddata exploration
§ Eternal memory for all events (even strangeones)
§ One schema per eventtype. Time partitioned.
class Analytics Model
´ factªWebFact
´ dimensionªZeit
´ dimensionªKampagne
Jahr
Quartal
Monat
Woche
Tag
Stunde
Minute
Kunde
+ Land: String
Partner
´ dimensionªTracking
Tracking Group
SensorTag
+ Typ: SensorTagType
Platzierung
+ Format: ImageSize+ Kostenmodell: KostenmodellArt
Werbemittel
+ AdGroup: String+ Format: ImageSize+ Grˆ fle: KiloBytes+ LandingPage: URL+ Motif: URL
Kampagne
´ dimensionªClient
KategorieDev ice
+ Bezeichner: String+ Hersteller: String+ Typ: String
Brow ser
+ Typ: String+ Version: int
´ dimensionªAusspielort
LandRegion
Stadt
´ dimensionªKanal
Kanal
´ dimensionªVermarktung
´ enumerationªSensorTagType
ORDER_TAG MASTER_TAG CUSTOM_TAG
Betriebssystem
+ Typ: String+ Version: Version
? Dimension :Unabh‰ng ig esPr‰dikataufMetrikenbeiderAnalyse("kannisoliertdar¸ bernachdenken/isoliertdazuAnalysenfahren")
? H ierarchie:Sub-Pr‰dikataufMetriken.Erzeug tmehralseine(zueinanderdiskunkte)TeilmengenderMetriken.Entsprichtdeng‰ng ig enDrill-Down-PfadenindenReports bzw.denBatch-Ag g reg ate-Up-PfadeninderAg g reg ationslog ik. SemantischeUnterstrukturen:"istTeilvon& kannnichtexistierenohne".
? Asssoz iation :Nichtverwendet.SeparatesStammdatenmodell. ? Attribut:Ermˆ g lichteineweitere(querschneidende)Einschr‰nkung derMetrikmengeerg‰nzendzudenHierarchien.
Domain
Website
Tracking Site
Vermarkter
Auslieferungs-Domain
Referral
´ enumerati...KostenmodellArt
CPC CPM CPO CPA
´ abstractªDimensionValue
+ id: int+ name: String+ sourceId: String
WebsiteFact
+ Bounces: int+ Verweildauer: float+ Visits: int
BasicAdFact
+ Clicks: int+ Sichtbare Views: int+ Validierte Clicks: int+ View (angefragt): int+ View (ausgeliefert): int+ View (gemessen): int
´ dimensionªProdukt
Shop
Produkt
+ Produktkategorie: String
´ dimensionªZeitfenster
Letzte X Tage
´ dimensionªUser
User Segment
´ dimensionªOrder
OrderStatus
+ Status: OrderStatus
´ enumerationªOrderStatus
IN_BEARBEITUNG ERFOLGREICH (AKTIVIERT) ABGELEHNT NICHT_IN_BEARBEITUNG
UniquesFact
+ Unique Clicks: int+ Unique Users: int+ Unique Views: int
AdCostFact
+ CPC: int+ Kosten: float
Conv ersionFact
+ PC: int+ PR: int+ PV: int+ Umsatz PC: float+ Umsatz PR: float+ Umsatz PV: float
AdVisibilityFact
+ Sichtbarkeitsdauer: float
Activ atedOrderFact
+ Orders: int+ Umsatz: float
TrackingFact
+ Orders: int+ Page Impressions: int+ Umsatz: float
X= {7,14,28,30}
§ Fault tolerant message handling§ Event handling: Apply schema, time-partitioning, De-dup, sanity
checks, pre-aggregation, filtering, fraud detection§ Tolerates delayed events§ High throughput, moderate latency (~ 1min)
Series Connection of Streaming and Batching - all based on Spark.
IngestionRaw Event Stream
Collection Event Data Lake Processing Analytics Warehouse
FactEntries
SQL InterfaceAtomic Event
Frames
§ Cool programming model§ Uniform dev&ops
§ Simple solution§ High compression ratio due to
column-oriented storage§ High scan speed
§ Cool programming model§ Uniform dev&ops§ High performance§ Interface to R out-of-the-box§ Useful libs: MLlib, GraphX, NLP, …
§ Good connectivity (JDBC, ODBC, …)
§ Interactive queries§ Uniform ops§ Can easily be replaced
due to Hive Metastore
§ Obvious choice forcloud-scale messaging
§ Way the best throughputand scalability of all evaluated alternatives
Technology Mapping
https://github.com/qaware/big-data-landscape
User Interface
Data Lake
Data Warehouse
Ingestion
Processing
Data Science Interactive Analysis Reporting & Dashboards
Data Sources
Analytics
Micro Analytics Services instead of reportingservers.
Charting Libraries:Dashboards:
Analytics FrontendsAlgorithm Libraries
Structured Data Lake: The eternal memory.
Efficient data serialization formats:§ Integated compression§ Column-oriented storage§ Predicate pushdown
Distributed Filesystem or NoSQL DB
Data Workflows ETL Jobs Massive Parallelization
Pig Open Studio
Data Logicstics Stream Processing
NewSQL: SQL meets NoSQL.Polyglott Persistence
Index Machines: Fast aggregation and search.
In-Memory Databases: Fast access.
Time Series Databases
Atlas
Polyglott AnalyticsData Lake Analytics
Warehouse
SQL lane
Rlane
Timeserieslane
Reporting Data ExplorationData Science
No Retention ParanoiaData Lake
AnalyticsWarehouse
§ Eternal memory§ Close to raw events§ Allows replays and refills
into warehouse
Aggressive forgetting with clearly definedretention policy per aggregation level like:§ 15min:30d§ 1h:4m§ …
Events
Strange Events
Continuous Tuning
IngestionRaw Event Stream
Collection Event Data Lake Processing Analytics Warehouse
FactEntries
SQL InterfaceAtomic Event
Frames
Load Generator Throughput & latency probes
In NumbersOverall dev effort until the first release: 250 person days
Dimensions: 10 KPIs: 26
Integrated 3rd party systems: 7Inbound data volume per day: 80GB
New data in DWH per day: 2GB
Total price of cheapest cluster which is able to handle production load:
Bonus Topic: Roadmap
SparkKafka HDFS
IaaS
Simplify Ops with Mesos
Faster aggregation & easier updateswith Spark-on-Solr
http://qaware.blogspot.de/2015/06/solr-with-sparks-or-how-to-submit-spark.html
Bonus Topic: Smart Aggregation
Ingestion Event Data Lake Processing Analytics Warehouse
FactEntries
AnalyticsAtomic Event
Frames
1 2
3
Architecture follows requirementsclass Analytics Model
´ factªWebFact
´ dimensionªZeit
´ dimensionªKampagne
Jahr
Quartal
Monat
Woche
Tag
Stunde
Minute
Kunde
+ Land: String
Partner
´ dimensionªTracking
Tracking Group
SensorTag
+ Typ: SensorTagType
Platzierung
+ Format: ImageSize+ Kostenmodell: KostenmodellArt
Werbemittel
+ AdGroup: String+ Format: ImageSize+ Grˆ fle: KiloBytes+ LandingPage: URL+ Motif: URL
Kampagne
´ dimensionªClient
KategorieDev ice
+ Bezeichner: String+ Hersteller: String+ Typ: String
Brow ser
+ Typ: String+ Version: int
´ dimensionªAusspielort
LandRegion
Stadt
´ dimensionªKanal
Kanal
´ dimensionªVermarktung
´ enumerationªSensorTagType
ORDER_TAG MASTER_TAG CUSTOM_TAG
Betriebssystem
+ Typ: String+ Version: Version
? Dimension :Unabh‰ng ig esPr‰dikataufMetrikenbeiderAnalyse("kannisoliertdar¸ bernachdenken/isoliertdazuAnalysenfahren")
? H ierarchie:Sub-Pr‰dikataufMetriken.Erzeug tmehralseine(zueinanderdiskunkte)TeilmengenderMetriken.Entsprichtdeng‰ng ig enDrill-Down-PfadenindenReports bzw.denBatch-Ag g reg ate-Up-PfadeninderAg g reg ationslog ik. SemantischeUnterstrukturen:"istTeilvon& kannnichtexistierenohne".
? Asssoz iation :Nichtverwendet.SeparatesStammdatenmodell. ? Attribut:Ermˆ g lichteineweitere(querschneidende)Einschr‰nkung derMetrikmengeerg‰nzendzudenHierarchien.
Domain
Website
Tracking Site
Vermarkter
Auslieferungs-Domain
Referral
´ enumerati...KostenmodellArt
CPC CPM CPO CPA
´ abstractªDimensionValue
+ id: int+ name: String+ sourceId: String
WebsiteFact
+ Bounces: int+ Verweildauer: float+ Visits: int
BasicAdFact
+ Clicks: int+ Sichtbare Views: int+ Validierte Clicks: int+ View (angefragt): int+ View (ausgeliefert): int+ View (gemessen): int
´ dimensionªProdukt
Shop
Produkt
+ Produktkategorie: String
´ dimensionªZeitfenster
Letzte X Tage
´ dimensionªUser
User Segment
´ dimensionªOrder
OrderStatus
+ Status: OrderStatus
´ enumerationªOrderStatus
IN_BEARBEITUNG ERFOLGREICH (AKTIVIERT) ABGELEHNT NICHT_IN_BEARBEITUNG
UniquesFact
+ Unique Clicks: int+ Unique Users: int+ Unique Views: int
AdCostFact
+ CPC: int+ Kosten: float
Conv ersionFact
+ PC: int+ PR: int+ PV: int+ Umsatz PC: float+ Umsatz PR: float+ Umsatz PV: float
AdVisibilityFact
+ Sichtbarkeitsdauer: float
Activ atedOrderFact
+ Orders: int+ Umsatz: float
TrackingFact
+ Orders: int+ Page Impressions: int+ Umsatz: float
X= {7,14,28,30}
act Processing
Proc
essi
ng
Web
site
Ord
er A
ctiv
atio
nAd
Cos
tCo
nver
sion
Uniq
ues
+O
verla
pAd
Vis
ibili
tyBa
sic
Ads
Trac
king
Dimensionsv erzeichnis aktualisieren
Page Impressions und Orders
zählenAggregation
TrackingFact
Ad Impressions
zählenAggregation
BasicAdFactCTR und
Sichtbarkeitsrate berechnen
Inferenz der Sichtbarkeitsdauer
Aggregation
AdVisibilityFact
Dimensionsraum aufspannen
Pro Vektor die Ev ents und die dort enthaltenen UserIds
und Interaktionsarten ermitteln
Menge der UserIds pro Interaktionsart erzeugen und deren
Mächtigkeit bestimmen
UniquesFact
User Journeys erstellen (bzw. LV/LC pro User)
Attribution v on Conv ersion auf Basis v orkonfiguriertem
Conv ersion Modell
Conv ersions zählen (PV, PC, PR)
Warenkorbwert ermitteln bei Order-Conv ersion
(PV, PC, PR)Aggregation
Conv ersionFact
Kosten pro Ev ent
ermitteln Aggregation
Nicht zuweisbare Kosten ermitteln
CostFact
Inferenz der Visits
Visits zählen
Bounces zählen
Analyse der Verweildauer
AggregationWebsiteFact
Order Status ermitteln
Anzahl der nicht getrackten Orders ermitteln
Aggregation Activ atedOrderFact
:Analytics Warehouse:Event Data Lake
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
«flow»
Data Processing WorkflowMultidimensional Data Model