big data with matlab and spark - matlabexpo.com · real-time dependence real-world example: sports...

19
1 © 2015 The MathWorks, Inc. Big Data with MATLAB and Spark Pierre Harouimi

Upload: others

Post on 08-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

1© 2015 The MathWorks, Inc.

Big Data

with MATLAB and Spark

Pierre Harouimi

Page 2: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

2

▪ Too much data to handle and

capture it

▪ Difficult to predict

▪ Real-Time dependence

Real-World Example: Sports Analytics

Page 3: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

3

Visualization

Preprocessing

Machine Learning

Big data workflow: from desktop to production

ACCESS DATA

PROCESS ON DESKTOP

SCALE PROBLEM SIZE

Page 4: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

4

▪ Standard tools won’t work

▪ Time-consuming

▪ Need to learn new

tools & rewrite algorithms

So, what’s the big (data) challenges?

Page 5: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

5

Prototype algorithms quickly

Run directly from MATLAB

with tall arrays

Use the same MATLAB code

▪ Standard tools won’t work

▪ Time-consuming

▪ Need to learn new

tools & rewrite algorithms

Solution!

Page 6: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

6

Datastore & tall arrays

Cluster of

Machines

Memory

Single

Machine

Memory

One or more files

datastore

1. Use datastore to define file-list

2. Create tall table from datastore

3. Act like ordinary table in parallel

4. Request on local machine

>> ds = datastore('*.csv')

>> tt = tall(ds)

>> model = fitlm(tt.Temp=...)

>> result = gather(tt.result)

tall array

Single

Machine

MemoryProcess

Single

Machine

MemoryProcess

Page 7: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

7

Tall arrays: very small changes

1 file 1000+ files

Page 8: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

8

Workflow Pattern

Access out of memory data

Work with subsets of your data

Develop functions for event

detection and calculation

Apply functions to all of your data

Aggregate, summarize, & visualize

datastore & tall

findgroups, splitapply

Normal MATLAB code

cellfun

table, histogram, heatmap,

boxplot, binScatterPlot

Page 9: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

9

MATLAB Distributed Computing Server (MDCS)

Local Parallel

Computing

Deployed Parallel Computing

Page 10: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

10

What is Hadoop/Spark?

Cluster-computing framework

Data parallelism

Fault toleranceMachine Learning

Page 11: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

11

Scaling with Spark: Very small changes too!

Desktop Code Spark + Hadoop Code

Page 12: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

12

Big Data with MATLAB & Spark

datastore

Data that don’t fit in memoryACCESS DATA

Enable experts domainsPROCESS ON DESKTOP

tall

Use the SAME MATLAB CodeSCALE PROBLEM SIZE

MDCS

Tall arrays

Subset of your data

Local parallel computing

Page 13: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

13

The MathWorks Fleet Data

Data collected over 1.5 years

21 unique vehicles

1300 trips log files

39 unique channels

CO2

Challenges

Big DataTesting Ideas

Event Detection

Needle in the Haystack

Objectives

Page 14: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

14

Example Setup at MathWorks

Data Warehouse

Server

Engineers

4G LTE

Bluetooth

Page 15: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

15

Analyze fleet data with MATLAB

Page 16: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

16

Access & Explore Data: MATLAB & Spark

MathWorks Vehicle Fleet

Challenge Develop and deploy Data Analytics to run on Spark against

vehicle fleet data stored on Hadoop

Solution Use MATLAB tall arrays to develop analytics on the

desktop and then scale out to the Spark cluster

Results Developed insight and understanding of over 1300 vehicle trips

Fuel efficiency performance under real-world driving conditions

Page 17: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

17

Image Processing

• Active Safety

Statistics

• Summary Statistics

• Regression, ANOVA, Machine Learning

Signal Processing

• Sound quality analysis

• LIDAR analysis

Analysis Domains

MPG Acceleration Displacement Weight Horsepow er

MP

GA

ccele

ratio

nD

ispla

cem

ent

Weig

ht

Hors

epow

er

50 1001502002000 4000200 40010 2020 40

50

100

150

200

2000

4000

200

400

10

20

20

40

Location/Mapping

• Analyzing GPS Data

• Custom Visualizations

Page 18: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

18

Key Takeaways

▪ Use the same MATLAB code

▪ Use new MATLAB data types datastore & tall arrays for out of

memory data sets

▪ Scale your work up with Parallel Computing Toolbox on the desktop or

the MATLAB Distributed Computing Server (MDCS) on Spark

Page 19: Big Data with MATLAB and Spark - matlabexpo.com · Real-Time dependence Real-World Example: Sports Analytics. 3 Visualization Preprocessing Machine Learning Big data workflow: from

19© 2018 The MathWorks, Inc.

© 2018 The MathWorks, Inc. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks for

a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders.