data wrangling

30
1 DATA WRANGLING FIND LOAD CLEAN

Upload: gramener

Post on 07-Jan-2017

111 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Data Wrangling

1

DATA WRANGLING

FIND LOAD CLEAN

Page 2: Data Wrangling

2

DATA WRANGLING

FIND LOAD CLEAN

WHERE CAN I GET DATA FROM?

Page 3: Data Wrangling

3

Client data isn't easy to get

THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA

Public data isn't relevant

Page 4: Data Wrangling

We have internal information. Getting

information from outside is our challenge. There’s

no way of doing that.

– Senior EditorLeading Media Company

Page 5: Data Wrangling

5

INDIA’S RELIGIONSIf you search on google.co.in for "how do I convert to", here are the suggestions Google shows

The popularity influences the order.So there's a good chance that the religions on top are more often searched for.

Page 6: Data Wrangling

6

AUSTRALIA’S RELIGIONSBut be careful of how you interpret it.In Australia, PDF is not a religion. Unless you're a data scientist.

Page 7: Data Wrangling

7

Page 8: Data Wrangling

8

USE MULTIPLE APPROACHES TO FIND YOUR DATA

Public data catalogueshttps://github.com/caesar0301/awesome-public-datasetshttps://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md

Govt data websiteshttps://data.gov.in/https://data.gov/https://data.gov.uk/https://data.gov.sg/http://publicdata.eu/

or search on Googlehttps://www.google.com/

or ask peopleHumans™

1

2

3

4

Page 9: Data Wrangling

9

EXERCISELET'S FIND SOME DATASETS

(YOU PICK WHAT YOU WANT TO FIND. WE WILL SEARCH FOR IT)

Page 10: Data Wrangling

10

DATA WRANGLING

FIND LOAD CLEAN

HOW DO I STORE & PROCESS DATA?

Page 11: Data Wrangling

11

WE LOAD DATA INTO OUR PROGRAMS OR OTHERS'

Files Databases• Delimited text: CSV, TSV, PSV• Formatted text: TXT, PRN

• Marked up text: HTML, XML, JSON, JSON Line, YAML, SQL

• Spreadsheets: XLS*, ODS, MDB, ACCDB, DBF

• Specialised formats: HDF5, SQLite, DTA (Stata), C4.5, CDF

• Graph formats: GEXF, GDF, GML, GraphML, GraphViz DOT

• Unstructured: TXT, PDF, Images, Audio, Video, ...

• In-memory databases: DataFrames

• Relational databases: Oracle, MySQL, PostgreSQL, SQL Server, DB2, Sybase, Informix, ...

• Document databases: MongoDB, CouchDB, ElasticSearch, Firebase

• Distributed databases: HFS, Spark

• Cloud data stores: BigQuery, DynamoDB, RedShift, Azure SQL Database, DocumentDB, ...

• APIs: Twitter, Facebook, Google, Wikipedia, YouTube, ...Use CSV when sharing tabular data.

Use JSON for hierarchical data.Use in-memory, else relational

databases. Don't analyse big data. Shrink it.

Page 12: Data Wrangling

12

EXERCISELET'S LOAD FROM A SITE

THE GOOGLE SEARCH DATA YOU SAW EARLIER

LET'S LOAD A BIG DATASETA FEW COLUMNS FROM A LEAKED OK CUPID SURVEY

LET'S LOAD AN UNSTRUCTURED TABLEA TABLE FROM THE MEDICAL CERTIFICATION OF CAUSE OF DEATH 2013 PDF

Page 13: Data Wrangling

13

DATA WRANGLING

FIND LOAD CLEAN

HOW DO I FIX THE DATA ISSUES?

Page 14: Data Wrangling

14

CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES

Fix rows &

columns

Fix missing values

Standarise values

Fix invalid values

Filterdata

When we receive a dataset, we find a pattern of things that go wrong. These can be fixed in specific ways.

Here's a workflow / checklist of things to look out for and fix.

After this, check if the data is complete, and sufficient to solve the problem.

Page 15: Data Wrangling

15

FIX ROWS AND COLUMNS

Fix rows ExamplesDelete incorrect rows Header rows, Footer rowsDelete summary rows Total, subtotal rows

Delete extra rows Column number indicators (1), (2), ...Blank rows

Fix columns ExamplesAdd column names if missing Files with missing header row

Rename columns consistently Abbreviations, encoded columns

Delete unnecessary columns Unidentified columns, irrelevant columns

Split columns for more data Split http://host:port/path into [Host, Port, Path]

Merge columns for identifiers

Merge Firstname, Lastname into NameMerge State, District into FullDistrict

Align misaligned columns Dataset may have shifted columns

Page 16: Data Wrangling

16

FIX MISSING VALUES

Fix missing values Examples

Set values as missing values Treat blanks, "NA", "XX", "999", etc as missing

Fill missing values with...

Constant (e.g. zero)Column (e.g. created date defaults to updated date)Function (e.g. average of rows/columns)External data

Remove missing values Delete rowDelete column

Fill partial missing values Missing time zone, century etc.

Page 17: Data Wrangling

17

STANDARDISE VALUES

Standardise numbers ExamplesRemove outliers Removing high and low valuesStandardise units lbs to kgs, m/s for speedScale values if required Fit to percentage scaleStandardise precision 2.1 to 2.10

Standardise text ExamplesRemove extra characters Common prefix/suffix, leading/trailing/multiple

spacesStandardise case Uppercase, lowercase, Title Case, Sentence

case, etcStandardise format 23/10/16 to 2016/10/20

“Modi, Narendra" to “Narendra Modi"

Page 18: Data Wrangling

18

FIX INVALID VALUES

Fix invalid values ExamplesEncode unicode properly CP1252 instead of UTF-8

Convert incorrect data types

String to number: "12,300"String to date: "2013-Aug"Number to string: PIN Code 110001 to "110001"

Correct values not in list Non-existent country, PIN codeCorrect wrong structure Phone number with over 10 digitsCorrect values beyond range Temperature less than -273° C (0° K)

Validate internal rulesGross sales > Net salesDate of delivery > Date of orderingIf Title is "Mr" then Gender is "M"

In these cases, treat value as "missing".Remove it, or fix it with a formula.The formula may involve the value, row, column, entire dataset, or external data

Page 19: Data Wrangling

19

FILTER DATA

Filter data Examples

Deduplicate dataRemove identical rowsRemove rows where some columns are identical

Filter rows Filter by segmentsFilter by date period

Filter columns Pick columns relevant to analysisAggregate data Group by required keys, aggregate the rest

Page 20: Data Wrangling

20

EXERCISEASSEMBLY ELECTION DATA

SOMETHING WE DID A FEW YEARS AGO, AND IS WELL DOCUMENTED

Page 21: Data Wrangling

21

The ECI website has this data.

Page 22: Data Wrangling

22

… and, most of the data is in PDFs

Page 23: Data Wrangling

23

The PDF files have a reasonably clear structure

Page 24: Data Wrangling

24

… that translates into text that can be parsed

Page 25: Data Wrangling

25

… which, with some effort, can be converted into a structured format

… and at this point, we need to start checking for errors.

Page 26: Data Wrangling

26

At this point, we start checking what’s gone wrong

Each row here is one constituency.

The number of candidates that have contested in each constituency in every year is shown as a table.

You can see that some patterns emerge here.

Page 27: Data Wrangling

27

Not every spelling error is easily identifiable by the first letter

Parties are mis-speltMADMKMAMAKMDMK

Party names changeAIADMKADMKADKParties restructureINC(I)INC

Constituency names mis-speltBHADRACHALAMBHADRACHELAMBHADRAHCALAM

Page 28: Data Wrangling

28

Fortunately, large scale data itself can provide a solution

Page 29: Data Wrangling

29

… with modern tools that support machine learning

Page 30: Data Wrangling

30

DATA WRANGLING

FIND LOAD CLEAN