big data with matlab and spark - matlabexpo.com · real-time dependence real-world example: sports...
TRANSCRIPT
1© 2015 The MathWorks, Inc.
Big Data
with MATLAB and Spark
Pierre Harouimi
2
▪ Too much data to handle and
capture it
▪ Difficult to predict
▪ Real-Time dependence
Real-World Example: Sports Analytics
3
Visualization
Preprocessing
Machine Learning
Big data workflow: from desktop to production
ACCESS DATA
PROCESS ON DESKTOP
SCALE PROBLEM SIZE
4
▪ Standard tools won’t work
▪ Time-consuming
▪ Need to learn new
tools & rewrite algorithms
So, what’s the big (data) challenges?
5
Prototype algorithms quickly
Run directly from MATLAB
with tall arrays
Use the same MATLAB code
▪ Standard tools won’t work
▪ Time-consuming
▪ Need to learn new
tools & rewrite algorithms
Solution!
6
Datastore & tall arrays
Cluster of
Machines
Memory
Single
Machine
Memory
One or more files
datastore
1. Use datastore to define file-list
2. Create tall table from datastore
3. Act like ordinary table in parallel
4. Request on local machine
>> ds = datastore('*.csv')
>> tt = tall(ds)
>> model = fitlm(tt.Temp=...)
>> result = gather(tt.result)
tall array
Single
Machine
MemoryProcess
Single
Machine
MemoryProcess
7
Tall arrays: very small changes
1 file 1000+ files
8
Workflow Pattern
Access out of memory data
Work with subsets of your data
Develop functions for event
detection and calculation
Apply functions to all of your data
Aggregate, summarize, & visualize
datastore & tall
findgroups, splitapply
Normal MATLAB code
cellfun
table, histogram, heatmap,
boxplot, binScatterPlot
9
MATLAB Distributed Computing Server (MDCS)
Local Parallel
Computing
Deployed Parallel Computing
10
What is Hadoop/Spark?
Cluster-computing framework
Data parallelism
Fault toleranceMachine Learning
11
Scaling with Spark: Very small changes too!
Desktop Code Spark + Hadoop Code
12
Big Data with MATLAB & Spark
datastore
Data that don’t fit in memoryACCESS DATA
Enable experts domainsPROCESS ON DESKTOP
tall
Use the SAME MATLAB CodeSCALE PROBLEM SIZE
MDCS
Tall arrays
Subset of your data
Local parallel computing
13
The MathWorks Fleet Data
Data collected over 1.5 years
21 unique vehicles
1300 trips log files
39 unique channels
CO2
Challenges
Big DataTesting Ideas
Event Detection
Needle in the Haystack
Objectives
14
Example Setup at MathWorks
Data Warehouse
Server
Engineers
4G LTE
Bluetooth
15
Analyze fleet data with MATLAB
16
Access & Explore Data: MATLAB & Spark
MathWorks Vehicle Fleet
Challenge Develop and deploy Data Analytics to run on Spark against
vehicle fleet data stored on Hadoop
Solution Use MATLAB tall arrays to develop analytics on the
desktop and then scale out to the Spark cluster
Results Developed insight and understanding of over 1300 vehicle trips
Fuel efficiency performance under real-world driving conditions
17
Image Processing
• Active Safety
Statistics
• Summary Statistics
• Regression, ANOVA, Machine Learning
Signal Processing
• Sound quality analysis
• LIDAR analysis
Analysis Domains
MPG Acceleration Displacement Weight Horsepow er
MP
GA
ccele
ratio
nD
ispla
cem
ent
Weig
ht
Hors
epow
er
50 1001502002000 4000200 40010 2020 40
50
100
150
200
2000
4000
200
400
10
20
20
40
Location/Mapping
• Analyzing GPS Data
• Custom Visualizations
18
Key Takeaways
▪ Use the same MATLAB code
▪ Use new MATLAB data types datastore & tall arrays for out of
memory data sets
▪ Scale your work up with Parallel Computing Toolbox on the desktop or
the MATLAB Distributed Computing Server (MDCS) on Spark
19© 2018 The MathWorks, Inc.
© 2018 The MathWorks, Inc. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks for
a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders.