class website cx4242: data cleaning - visualization · class website cx4242: data cleaning mahdi...

15
Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

Upload: others

Post on 31-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Class Website

CX4242:

Data CleaningMahdi Roozbahani

Lecturer, Computational Science and

Engineering, Georgia Tech

Page 2: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Data CleaningHow dirty is real data?

Page 3: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Examples

• Jan 19, 2016

• January 19, 16

• 1/19/16

• 2006-01-19

• 19/1/16

3

How dirty is real data?

http://blogs.verdantis.com/wp-content/uploads/2015/02/Data-cleansing.jpg

Page 4: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

4

How dirty is real data?

Discuss with you neighbors (group of 2-3)

60 seconds

Comes up with 5+ kinds of “data dirtiness”

Page 5: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

• Missing or corrupted (NaN, null)

• Numbers stored as string (“1232”)

• Different units

• Spelling/typos

• Different string encodings

• Outliers (due to data recording)

• geocoding, timezone offsets (missing +, -)

• Duplicate data

• Fake data (malicious)

• Sql injection

• Different software version generating slightly different formats

• Cap locks

• Semi-colons

• Structure (json objects)

• Invisible characters

• Different delimiters

• Indentation

5

How dirty is real data?

Page 6: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Importance of Data Cleaning

Page 7: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

“80%” Time Spent on Data Preparation

Cleaning Big Data: Most Time-Consuming, Least

Enjoyable Data Science Task, Survey Says [Forbes]http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-

consuming-least-enjoyable-data-science-task-survey-says/#73bf5b137f75

13

Page 8: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Data Janitor

Page 9: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Writing “Clean Code”

• Be careful with trailing whitespaces

• Indent code (spaces vs tabs) following

coding practices in your team/companyhttps://google.github.io/styleguide/javaguide.html#s4.2-block-indentation

17

http://codeimpossible.com/2012/04/02/trailing-whitespace-is-evil-don-t-commit-evil-into-your-repo/

http://www.businessinsider.com/tabs-vs-spaces-from-silicon-valley-2016-5

…there’s no way I'm going to be with someone who uses spaces over tabs…

Trailing whitespace is evil. Don't commit evil into your repo.

Page 10: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

18

Both available free for GT students on

http://safaribooksonline.com/

Page 11: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Data CleanersWatch videos

• Data Wrangler (research at Stanford)

• Open Refine (previously Google Refine)

Write down

• Examples of data dirtiness

• Tool’s features demo-ed (or that you like)

Will collectively summarize similarities and

differences afterwards

Open Refine: http://openrefine.org

Data Wrangler: http://vis.stanford.edu/wrangler/

19

Page 12: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning
Page 13: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning
Page 14: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

What can Open Refine and Wrangler do?

• [w,o] undo, redo

• [o,w] history of data

• [o] transform data (e.g., take log)

• [w] data editing/highlighting/interaction may be easier

• [o] clustering

• [w] transpose/pivot

• [w] fill in missing data

• [w] suggestions + preview

O = Open Refine

W = Data wrangler 22

Page 15: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

!The videos only show

some of the tools’ features.

Try them out.

Open Refine: http://openrefine.org

Data Wrangler: http://vis.stanford.edu/wrangler/

37