the rise of big data science

Post on 27-Jan-2015

119 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is an introductory lecture of the buzziest domain technology nowadays. The domain encapsulates a lot of new concepts, keywords, theories and paradigm shifts, from computer science to business.

TRANSCRIPT

GILAD BARKAN

The Rise of Big Data Science

Big Data Science

Big Data

Data Scienc

e

Big Data

Science

Big Data

Why ?What ?How ?

Big Data

Why ?What ?How ?

Why Big Data ?

It’s the flooded information era we live inIn a world where data is power, big data is

big power

Why Big Data ?

Web 2.0

Why should we care about Big Data ?

The big business opportunities Competitive fast moving marketplace

Capitalize on business opportunities before everyone else Existing channels to every person on the planet Maximizing revenues from customers Segment-of-1 - more personal customer

experiences

Big Data

Why ?What ?How ?

What is Big Data ?

Volume

Variety

Velocity

The 3 V’s

What is Big Data ?

Volume

Variety

Velocity

The 3 V’s

Big Data - Volume

Big Data - Volume

Smartphone Users

Hours Spent Online

35Billion Hours

1Billion

+

Global Online

Population

2Billion

Big UsersMore Users, All the Time

Big Data+

More Data

More Users

What is Big Data ?

Volume

Variety

Velocity

The 3 V’s

Heterogeneous sources of data Structured Unstructured

Tri

llio

ns

of

Gig

ab

ytes

(Zett

ab

ytes)

Text, Log Files, Click Streams, Blogs, Tweets, Audio, Video,

etc.

Big Data - Variety

Unstructured NoSQLTraditional Structured SQL

tables

5 KB / record

text

50 KB / record

images

1000 KB / image

Audio

5000 KB / song

video

700 MB / movie

Un/Semi-Structured Data

Structured Data

What is Big Data ?

Volume

Variety

Velocity

The 3 V’s

Big Data - Velocity

How the hell does Google return an answer in 0.28 seconds by looking at 4 Billion pages?

Big Data - Velocity

Online Advertisement - Real Time Bidding (RTB)

Big Data - Velocity

Recommendations

Big Data

Why ?What ?How ?

How is Big Data Handled ?

The challenge is huge Store, analyze and serve huge volume of variety

of data in high velocity

We can’t achieve this using a single machine, no matters how strong it is. Why? Expensive – stay tuned Load balancing requests

Outbrain serves 3,000 per second DG (MediaMind) serves 500K per second!!!

Not fault tolerant

Distributing the Data

The Big Data Paradigms Shifts

Scale Up (Vertical)

SQL Server

Scale Out(Horizontal)

Volume

HDFS(GFS)

NodesHadoop Cluster

Big Data –Reducing Costs

Hadoop is a 5 times cheaper infrastructure !!!TCO (purchase + maintenance) for 3 years per 300 TB:

75 nodes cluster = 1 M$DBMS server = 5 M$

Big Data Paradigm Shift - Computing

MapReduce Computing Paradigm

Exploiting the distributed architecture for large scale computations in parallel

MapReduce

“Hello MapReduce” – counting words

C W

5 the

0 Cow

2 quick

C W

7 the

1 Cow

0 quick

C W

9 the

1 Cow

3 quick

URL 1

URL 3

URL 2

C W

21 the

2 Cow

5 quick

MapReduc

e

+

Hadoop Cluster

Master

Mappers

Reducer

{𝑤 ,𝑐 }

{𝑤 ,𝑐 }

{𝑤 ,𝑐}

Big Data Paradigm Shift – NoSQL

Schema-less databases to support the variety of dataComplex SQL queries (joins, etc.) in a distributed data

framework is extremely inefficient Key-Value Store NoSQL

Value Key

user_id

url

image_id

video_id

tables

text

images

video

anyAny – not single

primary as in SQL

Variety

Big Data Paradigm Shift –

RAM-based DBs instead of traditional disk-based DBsStore critical data in memory (much more expensive)

If the data doesn't come to Alg - Alg will come to the data

Velocity

Alg

Read

traditional

Data

WriteAlg

Data

today

Read Write

Big Data - Summary

Big Data - Summary

BIG business opportunitiesThe 3 V’s: Volume, Variety, VelocityTechnological paradigm shifts

Big Data Technological Paradigm Shifts

NoSQL

Value Key Scale up

Master

Mappers

Reducer

Scale Out

ReduceMap

Volume Variety

Velocity

Data

Alg

Data

Alg

Big Data - Summary

BIG business opportunitiesThe 3 V’s: Volume, Variety, VelocityComputing and DB paradigm shiftsFlood of new (open source) technologies

Flood of New Big Data Technologies

Open Source

Big Data - Summary

BIG business opportunitiesThe 3 V’s: Volume, Variety, VelocityComputing and DB paradigm shiftsFlood of new (open source) technologiesIt’s definitely not just a buzz

Big Buzz ?

Big Data - Summary

BIG business opportunitiesThe 3 V’s: Volume, Variety, VelocityComputing and DB paradigm shiftsFlood of new (open source) technologiesIt’s definitely not just a buzz

It’s a real response to the world hectic paced evolution

reducing costs by order of magnitudeStill it doesn’t mean every business today will /

should transform its technology stack to support big data

Big Data Science

Big Data

Data Scienc

e

Big Data

Science

Data Science

Why ?What ?How ?

Data Science

Why ?What ?How ?

data scientist

s

Why Data Science ?

Data is a real value

Facebook acquires Onavo for ~150M$

Data Science

Why ?What ?How ?

Welcome to the Intelligent world

Data Scienc

e

Data Analysis

Data Mining

Automatic Decisionin

g

Predictive

Analytics

Machine Learning

Data Analytics

Data Miners are the New Gold Miners

Search

Online Advertisement - Real Time Bidding (RTB)

Recommendations

Recommendations

Text Analysis

CRM – Customers Churn Prediction

Time Series Analysis

Machine Learning

ClassificationClusteringRegressionRecommendation

Third PartyCharges

Pay Bill

Abnormal

fee

Classification

Amdocs Insight™ - why is the customer calling the Call Center ?

Bill too high

Overage

Clustering

Market Segmentation Social Network

Analysis

Regression

Housing price prediction

50 100 150 200 250

100

200

300

400

130

280

Size in m2

Price ($)in 1000’s 215

The Data Scientist

Data Scientist Skillset

Hands on tools,

languages, technologies

MsC / PhD in Math, CS,

Stats, Physics

Hands on the specific problem domain

Data Science ≠ BI

Apply advanced statistical machine learning algorithms to: dig deeper to find patterns that traditional BI

tools may not reveal much wider domains / applications spectrum

Predictive Analytics ≠ Exploratory Analytics

Exploratory AnalyticsBusiness Intelligence

Traditional BIExploratory Analytics

Big Data Science

Predictive Analytics Data Science Vs.

Academia Response to Data Science

Data Science

Why ?What ?How ?

The Art of Data Science

We need at least one semester course for itStill…

Data Science Life Cycle

Understand Data

Prepare Data

Model

Evaluate

Deploy

Monitor

Offline Data Analysis

Run Time

Business Goal

Big Data

Data Scienc

e

Big Data

Science

Closing the Loop

Technically wise, what do you think? Is Big Data good or bad for Data Science ?

The Bad - Finding a Needle in a Haystack

It’s the same treasure that hides – the problem is that the pile is now huge

Big Data Big Noise

The Bad - Finding a Needle in a Haystack

It’s the same treasure that hides – the problem is that the pile is now huge

Big Data Big Noise

The Good - The Statistical View

Statistics is predictive analytics’ fuel !The more data you have (Big Data) the

better your predictive models will perform

Law of Large Numbers

Law of Large Numbers

Law of Large Numbers

Law of Large Numbers

Law of Large Numbers

Law of Large Numbers

Combining the Good & Bad

Data is a function of quality and quantity

Small Big

Low

High

Quantity

Quality

Big Data Science - Summary

Big Data Big Numbers Big Opportunities Big Data is the buzziest technology nowadays

Data Scientists the ones that coax the treasures for their

companies, out of the big data Are multi-discipline skilled the new industry rock stars

Thank You for your attention

top related