data analytics with matlab - es.mathworks.com€¦ · data analytics with matlab tackling the...
TRANSCRIPT
1© 2014 The MathWorks, Inc.
Data Analytics with MATLAB
Tackling the Challenges of Big Data
Adrienne James, PhD
MathWorks
7th October 2014
2
Big Data in Industry
ENERGYAsset Optimization
FINANCEMarket Risk, Regulatory
AUTOFleet Data Analysis
AEROMaintenance, reliability
Medical DevicesPatient Outcomes
3
PROCESSING OPTIONS
• MATLAB RESTful interface to Cluster
• MATLAB Hadoop Streaming
• NoSQL connector (e.g. mongo)
• MATLAB / Java App accessing Cluster
• MATLAB Map-Reduce Components
4
Key takeaways
New functions for analysing data that does not fit in memory on your
desktop
– datastore
– mapreduce
& that can scale for use with Hadoop
Additional techniques for predictive modelling with large data
– Work with large data in memory on a cluster (spmd)
Deploy predictive models
– Bring MATLAB analytics to the Web
– Share analytics with a wider community of users
5
How big is big? What characterises “big” data?
Wikipedia
“Any collection of data sets so large and complex that it becomes difficult to
process using … traditional data processing applications.”
Volume : amount of data
Velocity : speed at which data is generated or needs to be analysed
Variety : range of data types/data sources
6
Considerations: Large Data AnalyticsData Characteristics
1. Size & type of data?
2. Where is your data?
3. What hardware do you have access to?
4. Analysis Characteristics?
7
Example: Airline Delay Analysis
Data
– BTS/RITA Airline On-Time Statistics
– 123.5M records, 29 fields
Analysis Tasks
– Calculate delay patterns
– Visualize summaries
– Estimate & evaluate predictive models
8
Considerations: Large Data AnalyticsAirline Data Characteristics
1. Size & type of data?
CSV Data
22 files
12GB
9
Considerations: Large Data AnalyticsData Characteristics
1. Size & type of data?
2. Where is my data?• Small subset available locally
• Entire data set stored elsewhere
10
Big Data Analysis with MATLAB – start on the desktop
Explore
Prototype
Scale
Access Share/Deploy
Work on your desktop
Start “simple”
Basic statistics
Explore data
11
Demo: Exploring departure delays using datastore
Explore approaches pre- & post-
Start with a small subset …
What happens as the data size grows?
…. until eventually it does not fit in memory on your desktop machine
datastore
12
Access & explore bigger data on the desktop more easily
Easily specify data set
– Single text file (or collection of text files)
– Database (using Database Toolbox)
Preview data structure and format
Customise data to import
using column names
Incrementally read
subsets of the data
airdata = datastore('*.csv');
airdata.SelectedVariables = {'Distance', 'ArrDelay‘};
data = read(airdata);
datastore
13
load
datastore extends Data Access Landscape
SMALL Increasing Data Size
memmapfile
matfile
API
databasedatabase.
ODBCConnection
Text files
Databases
.MAT files
Binary files
Images
textscan,
readtable
+programming
ImageAdapterimread, …
fread, …
SystemObjectsstreaming data
post-
readtable
Import
Tool
datastoretextscan
…
pre-
14
Considerations: Large Data AnalyticsData Characteristics
1. Size & type of data?
2. Where is your data?
3. What hardware do you have access to?
4. Analysis Characteristics Initially, simple statistics & data exploration
• Small subset available locally
• Entire data set stored elsewhere
15
Big Data Analysis with MATLAB
Explore
Prototype
Scale
Access Share/Deploy
Scale to a cluster
Start locally and then …..
16
Datastore
HDFS
Reduce
Node
Node
Node Data
Data
Data
Map
ReduceMap
ReduceMap
Map Reduce
Map
Map
Reduce
Reduce
What is ?
A Big Data Platform
17
A bit of audience participation – mapreduce ….
18
Introducing the mapreduce programming framework
Input filesIntermediate files
(local disk)Output files
Newspaper
pages
For each page how many
times do “Steve”, “Emily” and
“David” get mentioned?
Total
mentions
Steve 11%
Emily 58%
David 31%
Example:
National
popularity contest
19
mapreduce concept – group counts
Map Reduce
Input filesIntermediate files
(local disk)Output files
20
Demo: Exploring mapreduce
21
Datastore
Explore and Analyze Data on Hadoop
MATLAB
MapReduce
Code
HDFS
Node Data
MATLAB
Distributed
Computing
Server
Node Data
Node Data
Map Reduce
Map Reduce
Map Reduce
Hadoop
ds = datastore('hdfs://myserver:7867/data/file1.txt');
22
Considerations: Large Data AnalyticsData Characteristics
1. Size & type of data?
2. Where is your data?
3. What hardware do you have access to?
4. Analysis Characteristics Explore predictive modelling
Cluster
23
Big Data Analysis with MATLAB
Explore
Prototype
Scale
Access Share/Deploy
Scale to a cluster
Options for more involved
algorithms ….
• may require all data in memory
• multiple iterations …
24
Data Analytics Landscape
easily
partitioned;
independent
tasks
iterative
all data needed in
memory at once
SMALL Increasing Data Size
SIMPLE
COMPLEX
Algorithm
complexity
More programming
effort required
Built-in
numerical & statistical
algorithms
spmddistributed
arrays
gpuarray
parfor
vectorisationmapreduce
25
Working with more “complex” algorithms with data in memory
on a cluster
MDCS
1987 1988 1989 1990 1991 1992
Instr
uctions
Reduced D
ata
Client
26
Demo: Predictive Modelling
Logistic Regression & Neural Networks
10 busiest airport origins & 7 largest airline carriers
Explore & compare prediction quality of two models to predict flights delayed for more than
20 minutes
– Randomly partition data into test and training sets (cvpartition)
– Model #1: Logistic Regression
– Model #2: Neural Network
Predictor Variables: DayOfWeek,Origin,Airline,DepTime,Distance
27
Single Program, Multiple Data
Lab 1
>> mycode
Lab 2
>> mycode
Lab 3
>> mycode
Lab 4
>> mycode
28
Single Program, Multiple Data
Parallel Pool
Lab 1
Lab 2
Lab 3
Lab 4
Client
spmd
a = rand;
end
a = rand;
a = rand;
a = rand;
a = rand;
Cluster
29
Explore Big Data
Explore
Prototype
Access Share/Deploy
Subset data by filtering or variable selection
and gain insight with visualization
Scale
Explore
Prototype
Scale
Access Share/Deploy
30
Highlights: Airline Delay Analysis
Start small
Scale up
Quick prototyping on large data
Interactive exploration
Interspersed visualizations
Predictive modelling with large data
31
Deploy
Explore
Prototype
Scale
Access Share/Deploy
Hadoop
Enterprise
WebDesktop
32
Web Analytics: Analysis of traffic around Paris
http://rumeur.bruitparif.fr/
33
Predictive Data Analytics – Load Demand Forecasting
34
Demo
Station:
35
MATLAB on Hadoop
Two modes of operation
Execute mapreduce on Hadoop from your MATLAB desktop using
MATLAB Distributed Computing Server
– Extends your desktop environment for use with Hadoop
– Execute algorithms within Hadoop MapReduce on data stored in HDFS
Create standalone applications or libraries for deploying to production
instances of Hadoop
– Locked down package for use in production environments
– Integration of MATLAB analytics with operational systems
36
Key takeaways
New functions for analysing data that does not fit in memory on your
desktop
– datastore
– mapreduce
& that can scale for use with Hadoop
Additional techniques for predictive modelling with large data
– Work with large data in memory on a cluster (spmd)
Deploy predictive models
– Bring MATLAB analytics to the Web
– Share analytics with a wider community of users
37
New Big Data Capabilities in MATLAB
Memory and Data Access
64-bit processors
Memory Mapped Variables
Disk Variables
Databases
Datastores
Platforms
Desktop (Multicore, GPU)
Clusters
Cloud Computing (MDCS on EC2)
Hadoop
Programming Constructs
Streaming
Block Processing
Parallel-for loops
GPU Arrays
SPMD and Distributed Arrays
MapReduce
38
Additional Resources
MathWorks Web Site
Big Data With MATAB: http://www.mathworks.com/discovery/big-data-matlab.html
MapReduce & Hadoop: http://www.mathworks.com/discovery/matlab-mapreduce-hadoop.html
Machine Learning with MATLAB: http://www.mathworks.com/machine-learning/index.html
A selection of user stories
LiquidNet: Lean Data Analysis: The Awesome Data Dexterity of MATLAB Desktop
Ruuki Metals: Steel Manufacturing Process Analytics
CEESAR: Data Processing Framework Supporting Large Scale Driving Data Analysis
Daimler AG: Analyzing Test Data from a Worldwide Fleet of Fuel Cell Vehicles
39
Thank You