big data: it's more than volume, paypal

24
BIG DATA: IT’S MORE THAN VOLUME Nachum Shacham PayPal Big Data Innovation Summit April 2013

Upload: innovation-enterprise

Post on 24-Jun-2015

162 views

Category:

Technology


0 download

DESCRIPTION

In this presentation Nachum Shacham talks about the uses and qualities of Big Data, and how they are utilised where he works at PayPal. He talks about the ultimate goal of extracting business value, as well as unlocking the true value of your data through use of algorithms and sufficient data further down the long tail.

TRANSCRIPT

Page 1: Big Data: It's More Than Volume, Paypal

BIG DATA: IT’S MORE THAN VOLUME

Nachum Shacham

PayPal

Big Data Innovation Summit

April 2013

Page 2: Big Data: It's More Than Volume, Paypal

IT’S BIG-DATA TIME!

Volume big platforms

Variety multiple data types

Velocity fast response

Value a treasure of patterns

Ultimate Goal: Extract business value

Page 3: Big Data: It's More Than Volume, Paypal

3

TECHNOLOGY HYPE CYCLE

DM Tech Forum

BIG DATA

Page 4: Big Data: It's More Than Volume, Paypal

4

MIXED SIGNALS FROM THE PUNDITS

• Data Lake• “Needle in a hay stack”• “All hay no needles”• “Yet another fad” • “Noth’n new: we’ve been analyzing

data for 30 years”

DM Tech Forum

• “Store’em and they’ll come”• “Don’t ever discard data”• “$524.752MM ROI in 3 years”• “Smart” …• “Hadoop is free”• “Just…”

Page 5: Big Data: It's More Than Volume, Paypal

5

USE YOUR OWN FILTER

• Sift facts from MBS• Seek factual 1-liners• See through metaphors• Discount “Smart” (data, algorithms, systems)• Be skeptical

DM Tech Forum

Page 6: Big Data: It's More Than Volume, Paypal

6

UNLOCK THE VALUE IN BIG DATA

• Data Trumps Algorithms• Sufficient data further down the long tail• Wisdom of the crowd effective recommendations• Combine signals from different media

DM Tech Forum

Page 7: Big Data: It's More Than Volume, Paypal

7

BUSINESS VALUE IN BIG DATA

DM Tech Forum

RISK ANALYSIS

IDENTIFY INFLUENCERS INSOCIAL GRAPHONLINE ADS

REVENUE OPTIMIZATION

FRAUD DETECTION AND PREVENTION

Page 8: Big Data: It's More Than Volume, Paypal

8

LET’S DIG INTO BIG DATA

• Define KPIs• Explore• Model & Measure• Visualize signals• Test• Question test results• Rinse and Repeat

DM Tech Forum

Page 9: Big Data: It's More Than Volume, Paypal

9

BIG-DATA ANALYTICSFROM SEMI-STRUCTURED DATA TO BUSINESS SIGNALS

MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051" TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" TASK_STATUS="SUCCESS" Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS" FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounters)[(FILE_BYTES_WRITTEN)

• Similar goals, different challenges

• Leverage familiar tools for fast adoption

Page 10: Big Data: It's More Than Volume, Paypal

Cloud

RDBMS Data Warehouse Hadoop

MPP PLATFORMS AS WORKBENCHES FOR BIG DATA AND THEIR TOOLS

HivePIG

Javascala

SQL Oozie

StreamingPython R

Hbase

SQL++

R

Map Reduce

Page 11: Big Data: It's More Than Volume, Paypal

11

CLASSES OF ANALYTICS JOBS

Big Data

Data organization

for BI

A few large

models

Many small

models

DATA MANIPULATIONGRAPHICS

MODEL BUILDINGCROSS VALIDATION

PROBLEM MRFORMULATION

Page 12: Big Data: It's More Than Volume, Paypal

MATCH THE JOB TO THE PLATFORM

Page 13: Big Data: It's More Than Volume, Paypal

Data Sourcing

Data Preparation

Exploratory Data Analysis

Predictive Models

Visualization

Reporting

R: THE TOOL FOR ALL ANALYTICS STEPS

R

Page 14: Big Data: It's More Than Volume, Paypal

data files

process linesset sorting key and valueoutput <key, value>

Collect segment data marked by keyProcess segment dataOutput processed segment data

Shuffle sort

Reducer.R

Mapper.py

Text processing

Model per segment

BI-LINGUAL HADOOP STREAMING: LARGE SCALE PARALLEL PREDICTIVE MODELING

Page 15: Big Data: It's More Than Volume, Paypal

SEMI-STRUCTURED DATA TABULAR DATA

Meta VERSION="1" .Job JOBID="job_201212150932_52151" JOBNAME=”DataFilter" USER=”user1234” SUBMIT_TIME="1355822133394" JOBCONF="hdfs://tmp/hadoop-hadoop/mapred/staging/user1234/\.staging/job_201212150932_52151/job\.xml" VIEW_JOB=" " MODIFY_JOB=" " JOB_QUEUE=”B" .Job JOBID="job_201212150932_52151" JOB_PRIORITY="NORMAL" .Job JOBID="job_201212150932_52151" LAUNCH_TIME="1355822223576" TOTAL_MAPS="50" TOTAL_REDUCES="0" JOB_STATUS="PREP" .Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" START_TIME="1355822133148" SPLITS="" .MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051”TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" START_TIME="1355822133545" TRACKER_NAME="tracker_dn0492\.ebay\.com:localhost\.localdomain/127\.0\.0\.1:33613" HTTP_PORT="50060" .MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051" TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" TASK_STATUS="SUCCESS" Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS" FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounters)[(FILE_BYTES_WRITTEN)(FILE_BYTES_WRITTEN)(27089)]}{(org\.apache\.hadoop\.mapred\.Task$Counter)(Map-Reduce Framework)[(SPILLED_RECORDS)(Spilled Records)(0)]}" .Job JOBID="job_201212150932_52151" JOB_STATUS="RUNNING" .Task TASKID="task_201212150932_52151_m_000001" TASK_TYPE="MAP" START_TIME="1355822133163"

attempt,201212171719,248176,m,000013,0,1355499674337,1355499903213,MAP,SUCCESS,default,rack3,lvsaishdc3dn0109,0109attempt,2012121771719,248176,m,000464,0,1355501042650,1355501253259,MAP,SUCCESS,default,rack5,lvsaishdc3dn0217,0217attempt,2012121771719,248176,m,000626,0,1355501212902,1355501366476,MAP,SUCCESS,default,rack17,lvsaishdc3dn0776,0776attempt,2012121771719,248176,m,001193,0,1355499673762,1355499887662,MAP,SUCCESS,default,rack8,lvsaishdc3dn0366,036attempt,2012121771719,248176,m,001355,0,1355499673545,1355499908182,MAP,SUCCESS,default,rack9,lvsaishdc3dn0386,0386attempt,2012121771719,248176,m,001517,0,1355501266524,1355501470527,MAP,SUCCESS,default,rack5,lvsaishdc3dn0236,0236attempt,2012121771719,248176,m,001850,0,1355501303142,1355501486691,MAP,SUCCESS,default,rack5,lvsaishdc3dn0235,0235

Page 16: Big Data: It's More Than Volume, Paypal

16

FROM TABULAR DATA TO BI

DM Tech Forum

Page 17: Big Data: It's More Than Volume, Paypal

17

PARALLEL SEGMENTED MODELING

RR

RR

R

MAPPERS

REDUCERS

Page 18: Big Data: It's More Than Volume, Paypal

18

MODELS BUILT ON LARGE DATASETS

Meta VERSION="1" .Job JOBID="job_201112150932_52151" JOBNAME=”DataFilter" USER=”user1234” LAUNCH_TIME="1324801865576”

TIME INTERVAL DATA

CONCURRENCY

PERCENTILESTIME SERIESWORD COUNT

REPRESENTATIONAVOID RAM LIMITATIONS

R STAT PROCESSING

Page 19: Big Data: It's More Than Volume, Paypal

Cloud

R LEVERAGING RDBMS POWER

teradataR Scidb-R

Page 20: Big Data: It's More Than Volume, Paypal

TERADATAR FUNCTIONS (SAMPLE)

Function Name What it does

td.zscore Zscore Transformation

td.t.paired T Test Paired

td.cor Correlation Matrix

td.f.oneway One way F Test

td.factanal Factor Analysis

td.freq Frequency Analysis

td.hist Histograms

td.kmeans K-Means Clustering

td.ks Kolmogorov Smirnov Test

td.mode Mode Value of Column

td.tapply Apply a function over a database column

td.summary Like R summary()

td.quantiles Quantile Values

td.rank Rank

Page 21: Big Data: It's More Than Volume, Paypal

ANALYSIS OF A TABLE WITH > 1B ROWS

>library(RJDBC)>library(teradataR)>tdConnect(”TD_WH", uid = tdlogin, pwd = tdpwd, database = ”myVDM”)> system.time(myTbldf <- td.data.frame(”myTbl")) user system elapsed 0.092 0.054 140.071 > dim(myTbldf )[1] 1,131,670,269 9> system.time(cor <- td.cor(myTbldf[3:9])) user system elapsed 0.021 0.003 6.722

C D E F G H I

C 1.0000000 0.7096425 0.22154483 0.24186862 0.13354501 0.4954111 0.19577803D 0.7096425 1.0000000 0.24272691 0.27590234 0.13358632 0.4279517 0.14634683E 0.2215448 0.2427269 1.00000000 0.08940507 0.03734827 0.1631614 0.04401034F 0.2418686 0.2759023 0.08940507 1.00000000 0.07664496 0.1686094 0.04744032G 0.1335450 0.1335863 0.03734827 0.07664496 1.00000000 0.1247046 0.05837435H 0.4954111 0.4279517 0.16316144 0.16860940 0.12470460 1.0000000 0.35395733I 0.1957780 0.1463468 0.04401034 0.04744032 0.05837435 0.3539573 1.00000000

Page 22: Big Data: It's More Than Volume, Paypal

CONCLUSION

• Big data is here. See through the hype• Analyze big data to extract value• Multiple technologies & analytics tools are out there• Match platform, tools and approach• Delegate massive processing to big clusters

Step Up, Dig In, & Have fun

Page 23: Big Data: It's More Than Volume, Paypal

QUESTIONS?

Page 24: Big Data: It's More Than Volume, Paypal

BIG DATA EMPOWERS ALGORITHMS

Banko & Brill “Scaling to Very Very Large Corpora forNatural Language Disambiguation”