identifying returning users with spark
TRANSCRIPT
![Page 1: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/1.jpg)
Identifying Returning Users
![Page 2: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/2.jpg)
Simply Business
• Largest UK business insurer provider
• More than 370,000 policy holders
• Using BML, tech and data to disrupt
the business insurance market
![Page 3: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/3.jpg)
Data ’n’ Analytics
• 4 Data Engineers
• 3 Business Intelligence Developers
• 1 Data Scientist
• 1 Director of Data Science
• And hiring! :-)
![Page 4: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/4.jpg)
Our Infrastructure
Event Enrichment (Spark Streaming)
Event Collector (REST Endpoint) Kinesis
MongoDB
Kinesis
Redshi>
Downstream NRT (Spark Streaming)
![Page 5: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/5.jpg)
Returning Visitors
![Page 6: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/6.jpg)
The Challenge
Accurately identify users over time as they interact with us
![Page 7: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/7.jpg)
Why? Knowledge
To understand how our visitors behave: • They have long buying cycles • They use multiple devices • They engage with us through multiple channels
![Page 8: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/8.jpg)
Why? Cost Analysis
To know how much it costs to acquire new customers: • They visit us from different marketing channels • Each of them have different costs • We remunerate our partners depending on that
![Page 9: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/9.jpg)
Why? Personalization
To adapt their visit while they interact with us: • We can trigger offers depending on the channel they come from • We can prefill fields if they’re recognized • We can change the look and feel of the website
![Page 10: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/10.jpg)
Cookies? Their work great when you want to identify people coming back to your website.
Challenges: • Different devices • Cleared cookies • Private sessions • What about telephone calls…?
![Page 11: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/11.jpg)
Use More IDs! We can use other IDs provided by visitors to link sessions: • Login details • Email addresses • Telephone numbers • Credit card numbers • Fingerprints • …
![Page 12: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/12.jpg)
Streaming Solution
![Page 13: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/13.jpg)
How many visitors do we have here?
Example
Timestamp Event Cookie ID Email
1 page_view 111
2 details_submiFed 111 [email protected]
3 page_view 555
4 page_view 888
5 policy_sold 555 [email protected]
6 details_submiFed 888 [email protected]
![Page 14: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/14.jpg)
How many visitors do we have here? Only 2
Example
Timestamp Event Cookie ID Email
1 page_view 111
2 details_submiFed 111 [email protected]
3 page_view 555
4 page_view 888
5 policy_sold 555 [email protected]
6 details_submiFed 888 [email protected]
![Page 15: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/15.jpg)
Example: Streaming Timestamp Event Cookie ID Email
1 page_view 111
2 details_submiFed 111 [email protected]
3 page_view 555
4 page_view 888
5 policy_sold 555 [email protected]
6 details_submiFed 888 [email protected]
Visitor A: 111
![Page 16: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/16.jpg)
Example: Streaming Timestamp Event Cookie ID Email
1 page_view 111
2 details_submiFed 111 [email protected]
3 page_view 555
4 page_view 888
5 policy_sold 555 [email protected]
6 details_submiFed 888 [email protected]
Visitor A: 111, [email protected]
![Page 17: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/17.jpg)
Example: Streaming Timestamp Event Cookie ID Email
1 page_view 111
2 details_submiFed 111 [email protected]
3 page_view 555
4 page_view 888
5 policy_sold 555 [email protected]
6 details_submiFed 888 [email protected]
Visitor A: 111, [email protected] Visitor B: 555
![Page 18: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/18.jpg)
Example: Streaming Timestamp Event Cookie ID Email
1 page_view 111
2 details_submiFed 111 [email protected]
3 page_view 555
4 page_view 888
5 policy_sold 555 [email protected]
6 details_submiFed 888 [email protected]
Visitor A: 111, [email protected] Visitor B: 555 Visitor C: 888
![Page 19: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/19.jpg)
Example: Streaming Timestamp Event Cookie ID Email
1 page_view 111
2 details_submiFed 111 [email protected]
3 page_view 555
4 page_view 888
5 policy_sold 555 [email protected]
6 details_submiFed 888 [email protected]
Visitor A: 111, 555, [email protected] Visitor C: 888
Visitors A and B were merged!
![Page 20: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/20.jpg)
Example: Streaming Timestamp Event Cookie ID Email
1 page_view 111
2 details_submiFed 111 [email protected]
3 page_view 555
4 page_view 888
5 policy_sold 555 [email protected]
6 details_submiFed 888 [email protected]
Visitor A: 111, 555, [email protected] Visitor C: 888, [email protected]
![Page 21: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/21.jpg)
Streaming Algorithm for event in incoming_events: prev_visitors = [get_visitor(id) for id in ids(event)] if len(prev_visitors) == 0: # First time that we see that visitor create_visitor(event) elif len(prev_visitors) == 1: # Visitor coming back using a known ID update_visitor(prev_visitors[0], event) else: # Event IDs were thought to belong to different visitors
merge_and_update_visitor(prev_visitors, event)
Store mapping: event_ids [1..N] ---> [1] visitor
![Page 22: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/22.jpg)
Batch Solution
![Page 23: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/23.jpg)
Batch
We also need to solve this problem in batch mode because every now and then we reprocess all our data.
We cannot do it with Spark Streaming because:
• Kinesis only stores up to 7 days of data
• It would take too long
• We risk overwhelming our MongoDB
![Page 24: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/24.jpg)
Example: Batch Timestamp Event Cookie ID Email
1 page_view 111
2 details_submiFed 111 [email protected]
3 page_view 555
4 page_view 888
5 policy_sold 555 [email protected]
6 details_submiFed 888 [email protected]
![Page 25: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/25.jpg)
Batch Algorithm
• Find the connected components in the graph
• Already implemented in GraphX! • Vertices: visitor IDs
• Edges: pairs of IDs occurring in the same event
![Page 26: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/26.jpg)
Example: Batch Timestamp Event Cookie ID Email
1 page_view 111
2 details_submiFed 111 [email protected]
3 page_view 555
4 page_view 888
5 policy_sold 555 [email protected]
6 details_submiFed 888 [email protected]
555
![Page 27: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/27.jpg)
GraphX: Issues
• There is no Python API for GraphX, and Graphframes performance is worse than GraphX’s
• GraphX requires vertices to have numeric IDs
• You’ll have to create a mapping between the real IDs and the
numeric IDs
• Perform the connected components calculation
• Map back the numeric IDs to the real IDs
• Documentation is not as good as other Spark projects
![Page 28: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/28.jpg)
Sum Up
![Page 29: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/29.jpg)
NRT: Spark Streaming Batch: GraphX Latency Seconds Minutes/Hours Throughput Low High Good for Online Analysis Development Your own Algorithm available
Sum Up
![Page 30: Identifying Returning Users with Spark](https://reader031.vdocument.in/reader031/viewer/2022030305/58745ee81a28abab198b4b29/html5/thumbnails/30.jpg)
Beware That two events share an ID does not necessarily mean that were generated by the same person. Watch out for:
• Clashing IDs
• Fake IDs: [email protected], 07123456789, etc.
• Shared devices: couples, public computers, etc.
• People using your system on behalf of someone else
• Bugs
Analyze your data first!