essential tools for your big data arsenal
DESCRIPTION
For some, Hadoop is synonymous with “Big Data,” but Hadoop is just one component of a successful Big Data architecture. Depending on one’s application, it may not even be the most important part. NoSQL solutions like MongoDB also play a dominant role for storage and real-time data processing, helping companies keep pace with the scale of their data requirements. But NoSQL figures even more prominently in helping enterprises consume a wide variety of data sources at speeds not currently possible in Hadoop. NoSQL, then, offers a useful complement to Hadoop, as well as the transaction-based data of traditional RDBMSs. Tackling Big Data is not a one-tool job, and so the orchestration of the appropriate NoSQL database with Hadoop and RDBMS is essential. In this session, we’ll dig deep into the different types of NoSQL, identifying how they differ and the types of Big Data workloads for which they’re best suited. We’ll also explore the trade-offs one makes in choosing NoSQL databases like MongoDB or Neo4j over an RDBMS like MySQL, and when it makes sense to use both Hadoop and NoSQL and when it’s more appropriate to use NoSQL on its own.TRANSCRIPT
Matt Asay (@mjasay)VP, Business Development & Strategy, MongoDB
Essential Tools For Your Big Data Arsenal
The Big Data Unknown
3
Top Big Data Challenges?
Translation? Most struggle to know what Big Data is, how to manage it and who can manage it
Source: Gartner
4
Understanding Big Data – It’s Not Very “Big”
from Big Data Executive Summary – 50+ top executives from Government and F500 firms
64% - Ingest diverse, new data in real-time
15% - More than 100TB of data
20% - Less than 100TB (average of all? <20TB)
Innovation As Iteration
“I have not failed. I've just found 10,000 ways that won't work.” ― Thomas A. Edison
7
Back in 1970…Cars Were Great!
8
So Were Computers!
9
Lots of Great Innovations Since 1970
10
Including the Relational Database
11
RDBMS Makes Development Hard
Relational Database
Object Relational Mapping
Application
Code XML Config DB Schema
12
And Even Harder To Iterate
New Table
New Table
New Column
Name Pet Phone Email
New Column
3 months later…
13
RDBMS
From Complexity to Simplicity
MongoDB
{
_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{ type : "Health",
plan : "PPO Plus" },
{ type : "Dental",
plan : "Standard" }
]
}
14
So…Use Open Source
15
Big Data != Big Upfront Payment
16
RDBMS Is Expensive To Scale
“Clients can also opt to run zEC12 without a raised datacenter floor -- a first for high-end IBM mainframes.”
IBM Press Release 28 Aug, 2012
17
Spoiled for choice
1 Oracle Relational DBMS 1583.84 54.232 MySQL Relational DBMS 1331.34 25.583 Microsoft SQL Server Relational DBMS 1207 -106.784 PostgreSQL Relational DBMS 177.01 -5.225 DB2 Relational DBMS 175.83 3.586 MongoDB NoSQL Document Store 149.48 -2.717 Microsoft Access Relational DBMS 142.49 -4.218 SQLite Relational DBMS 77.88 -4.99 Sybase Relational DBMS 73.66 -1.68
10 Teradata Relational DBMS 54.41 3.32
DB-Engines.com Database Ranking
18
Remember the Long Tail?
19
It Didn’t Work Out So Well
20
Use Popular, Well-Known Technologies
Source: Silicon Angle, 2012
21
Ask the Right Questions…
“Organizations already have people who know their own data better than mystical data scientists….Learning Hadoop [or MongoDB] is easier than learning the company’s business.”
(Gartner, 2012)
22
Leverage Existing Skills
23
Search as a Sign?
When To Use Hadoop, NoSQL
25
Enterprise Big Data Stack
EDWHadoop
Man
agem
ent
& M
on
ito
rin
gS
ecurity &
Au
ditin
g
RDBMS
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualization, Compute, Storage, Network
RDBMS
Applications
Infrastructure
Data Management
Online Data Offline Data
26
Consideration – Online vs. Offline
• Long-running• High-Latency• Availability is lower
priority
• Real-time• Low-latency• High availability
Online Offlinevs.
27
Consideration – Online vs. Offline
Online Offlinevs.
28
Hadoop Is Good for…
Risk Modeling Churn AnalysisRecommendation
Engine
Ad TargetingTransaction
AnalysisTrade
Surveillance
Network Failure Prediction
Search Quality Data Lake
29
MongoDB/NoSQL Is Good for…
360° View of the Customer
Mobile & Social Apps
Fraud Detection
User Data Management
Content Management &
DeliveryReference Data
Product CatalogsMachine to
Machine AppsData Hub
How To Use The Two Together?
31
Finding Waldo
32
Customer example: Online Travel
Travel
• Flights, hotels and cars
• Real-time offers• User profiles, reviews• User metadata
(previous purchases, clicks, views)
• User segmentation• Offer recommendation
engine• Ad serving engine• Bundling engine
Algorithms
MongoDB Connector for
Hadoop
33
Predictive Analytics
Government
• Predictive analytics system for crime, health issues
• Diverse, unstructured (incl. geospatial) data from 30+ agencies
• Correlate data in real-time
• Long-form trend analysis• MongoDB data dumped
into Hadoop, analyzed, re-inserted into MongoDB for better real-time response
Algorithms
MongoDB
+ Hadoop
34
Data Hub
Insurance
• Insurance policies• Demographic data• Customer web data• Call center data• Real-time churn
detection
• Customer action analysis
• Churn prediction algorithms
Churn Analysis
MongoDB Connector for
Hadoop
35
Machine Learning
Ad-Serving
• Catalogs and products
• User profiles• Clicks• Views• Transactions
• User segmentation• Recommendation
engine• Prediction engine
Algorithms
MongoDB Connector for
Hadoop
36
• Makes MongoDB a Hadoop-enabled file system
• Read and write to live data, in-place
• Copy data between Hadoop and MongoDB
• Full support for data processing
– Hive
– MapReduce
– Pig
– Streaming
– EMR
MongoDB + Hadoop Connector
MongoDB Connector for
Hadoop
@mjasay