mongo db and hadoop driving business insights - final
DESCRIPTION
MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.TRANSCRIPT
![Page 1: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/1.jpg)
![Page 2: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/2.jpg)
MongoDB and Hadoop
Software Engineer, MongoDB
Luke Lovett
![Page 3: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/3.jpg)
Agenda
• Complementary Approaches to Data
• MongoDB & Hadoop Use Cases
• MongoDB Connector Overview and Features
• Demo
![Page 4: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/4.jpg)
Complementary Approaches to Data
![Page 5: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/5.jpg)
Operational: MongoDB
Real-Time Analytics
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile AppsCustomer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL
Risk Modeling
Trade Surveillance
Predictive Analytics
Ad TargetingSentiment
Analysis
![Page 6: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/6.jpg)
MongoDB
• Store and read data frequently
• Easy administration
• Built-in analytical tools
– aggregation framework
– JavaScript MapReduce
– Geo/text indexes
![Page 7: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/7.jpg)
Analytical: Hadoop
Real-Time Analytics
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile AppsCustomer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL
Risk Modeling
Trade Surveillance
Predictive Analytics
Ad TargetingSentiment
Analysis
![Page 8: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/8.jpg)
Hadoop
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models.
• Terabyte and Petabyte datasets
• Data warehousing
• Advanced analytics
![Page 9: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/9.jpg)
Operational vs. Analytical: Lifecycle
Real-Time Analytics
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile AppsCustomer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL
Risk Modeling
Trade Surveillance
Predictive Analytics
Ad TargetingSentiment
Analysis
![Page 10: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/10.jpg)
MongoDB & Hadoop Use Cases
![Page 11: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/11.jpg)
Batch Aggregation
Applicatio
ns
powered
by
Analysis
powered
by
● Need more than MongoDB aggregation
● Need offline processing
● Results sent back to MongoDB
● Can be left as BSON on HDFS for further analysis
MongoDB Connector
for Hadoop
![Page 12: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/12.jpg)
Commerce
Applicatio
ns
powered
by
Analysis
powered
by
• Products & Inventory
• Recommended
products
• Customer profile
• Session management
• Elastic pricing
• Recommendation
models
• Predictive analytics
• Clickstream history
MongoDB Connector
for Hadoop
![Page 13: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/13.jpg)
Fraud Detection
Payments
Fraud modeling
Nightly
Analysis
MongoDB Connector
for Hadoop
Results
Cache
Online payments
processing
3rd Party Data
Sources
Fraud
Detection
query
only
query
only
![Page 14: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/14.jpg)
MongoDB Connector for Hadoop
![Page 15: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/15.jpg)
Connector Overview
HadoopMap Reduce, Hive, Pig, Spark
HDFS / S3Hadoop Connector
Text Files
Hadoop
Connector
Apache Hadoop / Cloudera CDH / Hortonworks HDP / Amazon
EMR
BSON FilesMongoDB
Single Node, Replica Set,
Cluster
![Page 16: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/16.jpg)
Data Movement
Dynamic queries with most recent data
Puts load on operational database
Snapshots move load to Hadoop
Snapshots add predictable load to MongoDB
Dynamic queries to MongoDB vs. BSON snapshots in
HDFS
![Page 17: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/17.jpg)
Connector Operation
1. Split according to given InputFormat
- many options available for reading from live cluster
- configure key pattern, split strategy
1. Write splits file
2. Output to BSON file or live MongoDB
- BSON file splits written automatically for future tasks
- Mongo insertion round-robin across collections
![Page 18: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/18.jpg)
Getting Splits
• Split on a sharded cluster
– Split by chunk
– Split by shard
• Splits on replica
set/standalone
– splitVector command
• BSON files
– specify max docs
– split per input file
Config
Servers
Chunk
Chunk
Chunk
Shard
Mongos
Chunk
Chunk
Chunk
Shard
Chunk
Chunk
Chunk
Shard
MongoDB Connector for Hadoop
![Page 19: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/19.jpg)
Config
Servers
Getting Splits
• Split on a sharded cluster
– Split by chunk
– Split by shard
• Splits on replica
set/standalone
– splitVector command
• BSON files
– specify max docs
– split per input file
Chunk
Chunk
Chunk
Shard
Mongos
Chunk
Chunk
Chunk
Shard
Chunk
Chunk
Chunk
Shard
MongoDB Connector for Hadoop
![Page 20: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/20.jpg)
MapReduce Configuration
• MongoDB input
– mongo.job.input.format = com.hadoop.MongoInputFormat
– mongo.input.uri = mongodb://mydb:27017/db1.collection1
• MongoDB output
– mongo.job.output.format = com.hadoop.MongoOutputFormat
– mongo.output.uri = mongodb://mydb:27017/db1.collection2
![Page 21: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/21.jpg)
MapReduce Configuration
• BSON input/output
– mongo.job.input.format = com.hadoop.BSONFileInputFormat
– mapred.input.dir = hdfs:///tmp/database.bson
– mongo.job.output.format = com.hadoop.BSONFileOutputFormat
– mapred.output.dir = hdfs:///tmp/output.bson
![Page 22: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/22.jpg)
Spark Usage
• Use with MapReduce
input/output formats
• Create Configuration objects with
input/output formats and data
URI
• Load/save data using
SparkContext Hadoop file API
![Page 23: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/23.jpg)
Hive Support
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONSerDe
![Page 24: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/24.jpg)
Hive Support
MongoDB Hive
Primitive type (int, String, etc.) Primitive type (int, float, etc.)
Document Row
Sub-document Struct, Map, or exploded field
Array Array or exploded field
● Types given by schema
● May use structs to project fields out of documents and ease access
● Can explode nested fields to make them top-level:{“customer”: {“name”: “Bart”}}
can be accessed with “customer.name”.
![Page 25: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/25.jpg)
Pig Mappings
• Input: BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage
STORE records INTO ‘hdfs:///output.bson’
using com.mongodb.hadoop.pig.BSONStorage
![Page 26: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/26.jpg)
Pig Mappings
MongoDB Pig
Primitive type (int, String, etc.) Primitive type (int, chararray, etc.)
Document Tuple (schema given)
Document Tuple containing a Map (no schema)
Sub-document Map
Array Bag
● Organize and prune documents by specifying a schema
● Access full document in a Map without needing a schema
![Page 27: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/27.jpg)
Demo!
![Page 28: Mongo db and hadoop driving business insights - final](https://reader033.vdocument.in/reader033/viewer/2022052909/559828f41a28abf1308b45fc/html5/thumbnails/28.jpg)
Questions?