cleaning up and managing data in matlab · typical challenges in data cleaning, management drowning...

26
1 © 2017 The MathWorks, Inc. Cleaning Up and Managing Dirty Data in MATLAB Siddharth Sundar, Application Engineer

Upload: others

Post on 06-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

1© 2017 The MathWorks, Inc.

Cleaning Up and Managing Dirty Data in MATLAB

Siddharth Sundar, Application Engineer

Page 2: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

5

Agenda

▪ Typical Challenges with Data Handling and Management

▪ A Fundamental Valuation Example

▪ A Text Analytics Example

▪ What about Cleaning Large Datasets?

▪ Summary and Resources

Page 3: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

7

Typical Challenges in Data Cleaning, Management

▪ We are Drowning in Data

– Data Volume and Variety

– Different sources, types, sizes

– Garbage-in garbage-out

Page 4: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

8

So many Data Sources

Spark+Hadoop

Local disk

Shared folders

Databases

Flat files/Excel

DatafeedsWebpages

Page 5: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

9

So many kinds of Data

Page 6: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

10

So many kinds of Data

Page 7: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

11

Typical Challenges in Data Cleaning, Management

▪ Drowning in Data

– Data Volume and Variety

– Different sources, types, sizes

– Garbage-in garbage-out

▪ Poor Data Quality

Page 8: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

12

Poor Data Quality

Page 9: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

13

Typical Challenges in Data Cleaning, Management

▪ Drowning in Data

– Data Volume and Variety

– Different sources, types, sizes

– Garbage-in garbage-out

▪ Poor Data Quality

– Poorly formatted files

– Irregularly sampled data

– Redundant, Missing data, Outliers

▪ Need for more customized analytics

– No one size fits all

Page 10: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

14

Agenda

▪ Typical Challenges with Data Handling and Management

▪ A Fundamental Valuation Example

▪ A Text Analytics Example

▪ What about Cleaning Large Datasets?

▪ Summary and Resources

Page 11: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

15

Demo: Fundamental Valuation of S&P100 securities

Goal:

▪ Fundamental valuation for ranking stocks

based on historical EPS trends

Approach

▪ Access data from CSV files

▪ Preprocess to clean-up text (missing data and

outliers)

▪ Calculate strength of historical EPS trends

Page 12: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

16

How do we handle Missing Data?

Does missing data

have meaning?

Type of data

Replace value with value of preceding

instance

Large, temporarily ordered

data

Does the data follow a simple

distribution

Impute with simple ML

model

Remove instances with

missing data

Impute missing values with

column median

Impute missing values with column

mean

Numerical

Convert missing values to meaningful

number

Categorical

Missing values become their own

category

Yes

Dataset is big and little

data is missing at randomOtherwise

NoNo

Yes

Page 13: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

17

Summary: Fundamental Valuation Example

▪ Interactive tools to import, visualize

data

▪ Code generation from interactive

tools

▪ Built-in clean up functions

▪ Align and calculate group stats

▪ Save time

Page 14: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

18

Agenda

▪ Typical Challenges with Data Handling and Management

▪ A Fundamental Valuation Example

▪ A Text Analytics Example

▪ What about Cleaning Large Datasets?

▪ Summary and Resources

Page 15: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

19

Demo: Sentiment Analysis of SEC filings

Goal:

▪ Analyze the sentiment of SEC filings

for S&P 100 companies to use as a

stock picking/ranking indicator

Approach

▪ Access data directly from HTML/PDF

▪ Preprocess to clean-up text and deal

with domain-specific terms

▪ Predict sentiment

Page 16: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

20

Summary: Sentiment Analysis Example

▪ String Class in MATLAB

▪ Easy to use/read functions to do

text processing

▪ Text visualization functions

▪ Less regexp, more built-in

commands

▪ More processing, less time

Page 17: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

21

Agenda

▪ Typical Challenges with Data Handling and Management

▪ A Fundamental Valuation Example

▪ A Text Analytics Example

▪ What about Cleaning Large Datasets?

▪ Summary and Resources

Page 18: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

22

Demo: Technicals calculation to time the market

▪ Objective

– Calculate technical indicators on Big

intraday data

▪ Data

– Intraday tick data scraped from the web

– Missing data, outliers etc.

▪ Approach

– Preprocess data

– Explore data

– Calculate technicals

Page 19: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

23

How do you work with tall arrays in MATLAB?

▪ datastore

– Points to the data

▪ Tall array

– Variable representation of

the data in your workspace

▪ Functions

– Operate on tall arrays

>> fileLoc = ’.\datasets\*.csv';

>> ds = datastore(fileLoc);

>> tt = tall(ds);

>> tt = fillmissing(t,’nearest');

tall

Spark + Hadoop

Local diskShared folders

Databases

Page 20: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

24

Summary: Technicals Demo

▪ Big Data handled just like data that

fits in memory (Tall)

▪ No need for use of Mapreduce or

other Big Data

technologies/frameworks

▪ Easy Big Data visualization

▪ Scalability of MATLAB models

Page 21: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

25

Agenda

▪ Typical Challenges with Data Handling and Management

▪ A Fundamental Valuation Example

▪ A Text Analytics Example

▪ What about Cleaning Large Datasets?

▪ Summary and Resources

Page 22: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

26

Revisiting the Challenges with Handling Dirty Data

▪ Drowning in Data

– Data Volume and Variety

– Different sources, types, sizes

– Garbage-in garbage-out

▪ Poor Data Quality

– Poorly formatted files

– Irregularly sampled data

– Redundant, Missing data, Outliers

▪ No one size fits all solution for Data cleaning

Page 23: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

27

Get Training

Accelerate your learning curve:

- Customized curriculum

- Learn best practices

- Practice on real-world examples

Options to fit your needs:

- Self-paced (online)

- Instructor led (online and in-person)

- Customized curriculum (on-site)

CPE Approved

Provider

Page 24: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

28

Training Roadmap

MATLAB for Financial Applications

Programming Techniques

Interactive User Interfaces

Parallel Computing Time-Series Modeling (Econometrics)

Statistical Methods

Optimization Techniques

Data Analysis and Modeling Application Development

Risk Management

Machine Learning

Asset Allocation

Interfacing with Databases

Interfacing with Excel

Content for On-site Customization

Page 25: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

29

Want to Learn More?

Marshall Alphonso

Senior Finance Engineer, New York

[email protected]

Mike DeLucia

Senior Account Manager

[email protected]

Chuck Castricone

Account Manager

[email protected]

Account ManagersEngineers

Siddharth Sundar

Finance Application Engineer

[email protected]

Page 26: Cleaning Up and Managing Data in MATLAB · Typical Challenges in Data Cleaning, Management Drowning in Data –Data Volume and Variety –Different sources, types, sizes –Garbage-in

30© 2017 The MathWorks, Inc.