hadoop in the wild cmsc 491 hadoop-based distributed computing spring 2015 adam shook
TRANSCRIPT
Hadoop in the Wild
CMSC 491Hadoop-Based Distributed Computing
Spring 2015Adam Shook
Agenda
• Check out some use cases• Discuss some architectures
USE CASES
Common Use Cases
• Log Processing• Image Identification• Extract Transform Load• Recommendation Engines• Time-Series Storage and Processing• Building Search Indexes• Long-Term Archive• Audit Logging
Non-Use Cases
• Data processing handled by one large server• ACID Transactions
A Bank
• Problem– Need to analyze customer activity across multiple
products to predict credit risk– Acquired a number of banks
• Solution– Setup a single Hadoop cluster with data from multiple
EDWs– Bank added new sources of customer service data to
get a clear picture of a customer’s financial situation
A Mobile Carrier
• Problem– Why are our customers terminating their service
contracts?
• Solution– Combined transactional and event data with social
network data– Combined coverage maps with account data
An Online Dating Service
• Problem– Surveys, demographic, and web activity to build a
picture– Customers wanted better recommendations– Algorithms improved and number of users grew
• Solution– Moved data and analysis to Hadoop– Able to size system to meet needs of customers
Ad Targeting
• Problem– Advertising is a special kind of recommendation– Need to select best ad for a particular visitor, but
each advertiser is paying to have its ad seen
• Solution– Collect stream of user activity with continuous
analysis– Build sophisticated models of user behavior
POS Transaction Analysis
• Problem– Retailers able to collect much more data in stores and
online– EDW do not generally support sophisticated analysis to
provide better forecasting
• Solution– Loaded 20 years of sales transactions and used Hive to
do same analysis as before– Now able to use new algorithms with new data sets
Sensor Data
• Problem– Volume of sensor data from every generator across
multiple grids is enormous– Clear picture depends on real-time and forensic
analysis
• Solution– Capture and store all streaming sensor data– Built continuous analysis system to watch
performance of generators
Threat Analysis
• Problem– How do we detect threats and fraudulent activity
in an online world?
• Solution– Use of HBase to store virus signatures– Use of MapReduce to compare spam or malware• Lambda Architecture
Trade Surveillance
• Problem– Difficult to monitor trades for compliance, and
impossible to catch rogue traders
• Solution– Store trade data and trading party data– Continuously monitor activity and build
connections– Provides cheap storage for law-required auditing
Search
• Problem– Indexing stuff is pretty easy, until we went and had
to index the Internet– User preferences make it harder
• Solution– MapReduce was designed for indexing– Online retailers depend on search for users finding
and buying products
Data Sandbox
• Problem– ???
• Solution– Simple storage mechanism with diverse tools for
data analysis and exploration
ARCHITECTURES
Building your Data Lake
Building your Data Lake
Building your Data Lake
Building your Data Lake
1 2
3 4
Lambda Architecture
All Data Precompute Views
QFD 1 QFD 2 QFD N
QFD 1 QFD 2 QFD N
Process Stream Increment Views
New Data Stream Query
Real-TimeIncrement
Batchrecompute
Storm
Real-time views
Batch views
BATCH LAYER
SERVING LAYER
SPEED LAYER
Hadoop
(Apache HBase)
(HDFS/SQL)
• EDW (Oracle) was unable to scale and perform• Investigated small Hadoop system• Engineers loved it• Began developing Hive
• Time-series summaries• Ad hoc jobs over historical data• Long-term archival store for logs• Look up log events by specific attributes
Facebook Architecture
Facebook Messaging
• Needed a short set of temporal data• A growing set of data that is rarely accessed• HBase fit their needs more than other open-
source technologies
Twitter Architecture
LinkedIn Architecture
LinkedIn Applications
LinkedIn Applications
LinkedIn Applications
LinkedIn Future
• MapReduce is not suited for large graph processing
• Batch-oriented nature is not suited for “breaking news”
References
• Hadoop: The Definitive Guide, Chapter 16.2• http://www.slideshare.net/s_shah/the-big-data-ecosystem-
at-linkedin-23512853• http://www.slideshare.net/Hadoop_Summit/hadoop-
hardware-twitter-size-does-matter• http://www.forbes.com/sites/edddumbill/2014/01/14/the-
data-lake-dream/• http://www.slideshare.net/brocknoland/common-and-
unique-use-cases-for-apache-hadoop• http://blog.cloudera.com/wp-content/uploads/2011/03/
ten_common_hadoopable_problems_final.pdf