hadoop in the wild cmsc 491 hadoop-based distributed computing spring 2015 adam shook

Hadoop in the Wild

CMSC 491Hadoop-Based Distributed Computing

Spring 2015Adam Shook

Agenda

• Check out some use cases• Discuss some architectures

USE CASES

Common Use Cases

• Log Processing• Image Identification• Extract Transform Load• Recommendation Engines• Time-Series Storage and Processing• Building Search Indexes• Long-Term Archive• Audit Logging

Non-Use Cases

• Data processing handled by one large server• ACID Transactions

A Bank

• Problem– Need to analyze customer activity across multiple

products to predict credit risk– Acquired a number of banks

• Solution– Setup a single Hadoop cluster with data from multiple

EDWs– Bank added new sources of customer service data to

get a clear picture of a customer’s financial situation

A Mobile Carrier

• Problem– Why are our customers terminating their service

contracts?

• Solution– Combined transactional and event data with social

network data– Combined coverage maps with account data

An Online Dating Service

• Problem– Surveys, demographic, and web activity to build a

picture– Customers wanted better recommendations– Algorithms improved and number of users grew

• Solution– Moved data and analysis to Hadoop– Able to size system to meet needs of customers

Ad Targeting

• Problem– Advertising is a special kind of recommendation– Need to select best ad for a particular visitor, but

each advertiser is paying to have its ad seen

• Solution– Collect stream of user activity with continuous

analysis– Build sophisticated models of user behavior

POS Transaction Analysis

• Problem– Retailers able to collect much more data in stores and

online– EDW do not generally support sophisticated analysis to

provide better forecasting

• Solution– Loaded 20 years of sales transactions and used Hive to

do same analysis as before– Now able to use new algorithms with new data sets

Sensor Data

• Problem– Volume of sensor data from every generator across

multiple grids is enormous– Clear picture depends on real-time and forensic

analysis

• Solution– Capture and store all streaming sensor data– Built continuous analysis system to watch

performance of generators

Threat Analysis

• Problem– How do we detect threats and fraudulent activity

in an online world?

• Solution– Use of HBase to store virus signatures– Use of MapReduce to compare spam or malware• Lambda Architecture

Trade Surveillance

• Problem– Difficult to monitor trades for compliance, and

impossible to catch rogue traders

• Solution– Store trade data and trading party data– Continuously monitor activity and build

connections– Provides cheap storage for law-required auditing

Search

• Problem– Indexing stuff is pretty easy, until we went and had

to index the Internet– User preferences make it harder

• Solution– MapReduce was designed for indexing– Online retailers depend on search for users finding

and buying products

Data Sandbox

• Problem– ???

• Solution– Simple storage mechanism with diverse tools for

data analysis and exploration

ARCHITECTURES

Building your Data Lake

1 2

3 4

Lambda Architecture

All Data Precompute Views

QFD 1 QFD 2 QFD N

QFD 1 QFD 2 QFD N

Process Stream Increment Views

New Data Stream Query

Real-TimeIncrement

Batchrecompute

Storm

Real-time views

Batch views

BATCH LAYER

SERVING LAYER

SPEED LAYER

Hadoop

(Apache HBase)

(HDFS/SQL)

Facebook

• EDW (Oracle) was unable to scale and perform• Investigated small Hadoop system• Engineers loved it• Began developing Hive

Facebook

• Time-series summaries• Ad hoc jobs over historical data• Long-term archival store for logs• Look up log events by specific attributes

Facebook Architecture

Facebook Messaging

• Needed a short set of temporal data• A growing set of data that is rarely accessed• HBase fit their needs more than other open-

source technologies

Twitter Architecture

LinkedIn Architecture

LinkedIn Applications

LinkedIn Future

• MapReduce is not suited for large graph processing

• Batch-oriented nature is not suited for “breaking news”

References

• Hadoop: The Definitive Guide, Chapter 16.2• http://www.slideshare.net/s_shah/the-big-data-ecosystem-

at-linkedin-23512853• http://www.slideshare.net/Hadoop_Summit/hadoop-

hardware-twitter-size-does-matter• http://www.forbes.com/sites/edddumbill/2014/01/14/the-

data-lake-dream/• http://www.slideshare.net/brocknoland/common-and-

unique-use-cases-for-apache-hadoop• http://blog.cloudera.com/wp-content/uploads/2011/03/

ten_common_hadoopable_problems_final.pdf

hadoop in the wild cmsc 491 hadoop-based distributed computing spring 2015 adam shook

Documents

account data slide

data lake slide

data analysis

auditing slide

exploration slide

new data sets slide

data sandbox problem

event data