blazing fast analytics with mongodb & spark
TRANSCRIPT
![Page 1: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/1.jpg)
![Page 2: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/2.jpg)
Blazing Fast Analytics with MongoDB & Spark
![Page 4: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/4.jpg)
4
Agenda
The data challengeSparkUse CasesConnectorsDemo
![Page 5: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/5.jpg)
2010
Eric Schmidt
Every two days now we create as much information as we did from the dawn of civilization up until 2003
“
![Page 6: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/6.jpg)
![Page 7: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/7.jpg)
Apache Spark is the Taylor Swift of big data software.
“
Derrick Harris, Fortune
![Page 8: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/8.jpg)
8
What is Spark?
Fast and general computing engine for clusters
• Makes it easy and fast to process large datasets• APIs in Java, Scala, Python, R• Libraries for SQL, streaming, machine learning, Graph• It’s fundamentally different to what’s come before
![Page 9: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/9.jpg)
9
Why not just use Hadoop?
• Spark is FAST– Faster to write.– Faster to run.
• Up to 100x faster than Hadoop in memory• 10x faster on disk.
![Page 10: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/10.jpg)
A visual comparison
Hadoop Spark
![Page 11: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/11.jpg)
11
RDD Operations
Transformations Actionsmap reducefilter collectflatMap countmapPartitions savesample lookupKeyunion takejoin foreachgroupByKeyreduceByKey
![Page 12: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/12.jpg)
12
Spark higher level libraries
Spark
Spark SQL
Spark Streaming MLIB GraphX
![Page 13: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/13.jpg)
Spark + MongoDB
![Page 14: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/14.jpg)
14
Data Management
OLTPApplicationsFine grained operationsLow Latency
Offline Processing Analytics Data WarehousingHigh Throughput
![Page 15: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/15.jpg)
15
Spark + MongoDB top use cases:– Business Intelligence– Data Warehousing – Recommendation– Log processing– User Facing Services– Fraud detection
![Page 16: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/16.jpg)
16
MongoDB and Spark
![Page 17: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/17.jpg)
17
Spark reading directly from MongoDB
![Page 18: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/18.jpg)
18
Aggregation pipeline to Pre-filter
Aggregation pipeline filter: $match
![Page 19: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/19.jpg)
19
Spark writing directly to MongoDB
![Page 20: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/20.jpg)
Fraud Detection
I'm so in love!
Me, too<3
Now send me your CC number
?
Ok, XXXX-123-zzz
$$$
![Page 21: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/21.jpg)
Fraud Detection
![Page 22: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/22.jpg)
Sharing Workloads
Chat App
HDFS HDFS HDFS ArchivingData Crunching
LoginUser ProfileContactsMessages…
Fraud DetectionSegmentationRecommendations
Spark
![Page 23: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/23.jpg)
MongoDB + Spark Connector
![Page 24: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/24.jpg)
24
MongoDB Spark Connectorhttps://spark-packages.org/?q=official+mongodb
![Page 25: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/25.jpg)
MongoDB Spark
Connector
MongoDB Shard
Spark
MongoDB Spark Connector
https://github.com/mongodb/mongo-spark
![Page 26: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/26.jpg)
Spark Streaming
![Page 27: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/27.jpg)
27
Spark Streaming
Twitter Feed Spark
![Page 28: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/28.jpg)
28
Spark Streaming
Twitter Feed
{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24
03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [
],"hashtags": [{"text": "freebandnames","indices": [20,34
]}
],"user_mentions": []}
}}
![Page 29: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/29.jpg)
29
Spark Streaming{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24
03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [
],"hashtags": [{
"text": "freebandnames","indices": [20,34
]}
],"user_mentions": []}
}}
{"time": "Mon Sep 24 03:35","freebandnames": 1
}
{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24
03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [
],"hashtags": [
{"text": "freebandnames","indices": [20,34
]}
],"user_mentions": []}
}}
{"statuses": [{"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24
03:35:21 +0000 2012","id_str": "250075927172759552","entities": {
"urls": [
],"hashtags": [{"text": "freebandnames","indices": [20,34
]}
],"user_mentions": []}
}}
{"statuses": [{
"coordinates": null,"favorited": false,"truncated": false,"created_at": "Mon Sep 24
03:35:21 +0000 2012","id_str": "250075927172759552","entities": {"urls": [
],"hashtags": [{"text": "freebandnames","indices": [20,34
]}
],"user_mentions": []}
}}
{"time": "Mon Sep 24 03:35","freebandnames": 4
}
Spark
![Page 30: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/30.jpg)
30
Capped Collection
MongoDB and Spark Streaming feature
{"time": "Mon Sep 24 03:35","freebandnames": 4
}
{"time": "Mon Nov 5 09:40",“mongoDBLondon": 400
}
{"time": "Mon Nov 5 11:50",“spark": 7556
}
{"time": "Mon Nov 24 12:50","itshappening": 100
}
Tailable Cursor
![Page 31: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/31.jpg)
MongoDB + Spark MLib Demo
![Page 32: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/32.jpg)
32
Collaborative Filtering
• Two parts• Collaborative: Using Rating preference from several Users• Filtering: Recommend preferences
UserId / MovieId Star Wars Toy Story Frozen
Buzz 4 4 5
Woody 5 4
Jessie 5 ?
Movie Ratings as a matrix
![Page 33: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/33.jpg)
33
MLib ALS
• Approximate into User & Movie latent factor matrices
UserId / MovieId
Frozen ToyStory
Star Wars
Buzz 4 4 5
Woody 5 4
Jessie 5
Buzz x y
Woody x y
Jessie x y
Star Wars
Toy Story
Frozen
x x x
y y y
f(i)
f(j)
rij
![Page 34: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/34.jpg)
34
Prediction Process
• Load movie ratings data from MongoDB• Reflect and Infer the input formats for the ALS algorithm• Split the data
– 80% for training and 20% for validating the model• Calculate the best model using ALS algorithm
– Build/train a User Movie matrix model• Combine the data with user preferences and retrain the
model
![Page 35: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/35.jpg)
35
Explore as a Databricks Notebookhttp://cdn2.hubspot.net/hubfs/438089/notebooks/MongoDB_guest_blog/Using_MongoDB_Connector_for_Spark.html
![Page 36: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/36.jpg)
MongoDB + Spark Case Study
![Page 37: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/37.jpg)
37
China Eastern Airlines – Fare Engine
130K seats,180 million fares & 1.6 billion daily searches
![Page 38: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/38.jpg)
38
Spark and MongoDB
• An extremely powerful combination
• Many possible use cases
• Some operations are actually faster if performed using Aggregation Framework
• Evolving all the time
![Page 40: Blazing Fast Analytics with MongoDB & Spark](https://reader033.vdocument.in/reader033/viewer/2022052706/58715f361a28ab8e5b8b72e5/html5/thumbnails/40.jpg)