data wrangling

1

DATA WRANGLING

FIND LOAD CLEAN

2

DATA WRANGLING

FIND LOAD CLEAN

WHERE CAN I GET DATA FROM?

3

Client data isn't easy to get

THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA

Public data isn't relevant

We have internal information. Getting

information from outside is our challenge. There’s

no way of doing that.

– Senior EditorLeading Media Company

“

5

INDIA’S RELIGIONSIf you search on google.co.in for "how do I convert to", here are the suggestions Google shows

The popularity influences the order.So there's a good chance that the religions on top are more often searched for.

6

AUSTRALIA’S RELIGIONSBut be careful of how you interpret it.In Australia, PDF is not a religion. Unless you're a data scientist.

8

USE MULTIPLE APPROACHES TO FIND YOUR DATA

Public data catalogueshttps://github.com/caesar0301/awesome-public-datasetshttps://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md

Govt data websiteshttps://data.gov.in/https://data.gov/https://data.gov.uk/https://data.gov.sg/http://publicdata.eu/

or search on Googlehttps://www.google.com/

or ask peopleHumans™

1

2

3

4

9

EXERCISELET'S FIND SOME DATASETS

(YOU PICK WHAT YOU WANT TO FIND. WE WILL SEARCH FOR IT)

10

DATA WRANGLING

FIND LOAD CLEAN

HOW DO I STORE & PROCESS DATA?

11

WE LOAD DATA INTO OUR PROGRAMS OR OTHERS'

Files Databases• Delimited text: CSV, TSV, PSV• Formatted text: TXT, PRN

• Marked up text: HTML, XML, JSON, JSON Line, YAML, SQL

• Spreadsheets: XLS*, ODS, MDB, ACCDB, DBF

• Specialised formats: HDF5, SQLite, DTA (Stata), C4.5, CDF

• Graph formats: GEXF, GDF, GML, GraphML, GraphViz DOT

• Unstructured: TXT, PDF, Images, Audio, Video, ...

• In-memory databases: DataFrames

• Relational databases: Oracle, MySQL, PostgreSQL, SQL Server, DB2, Sybase, Informix, ...

• Document databases: MongoDB, CouchDB, ElasticSearch, Firebase

• Distributed databases: HFS, Spark

• Cloud data stores: BigQuery, DynamoDB, RedShift, Azure SQL Database, DocumentDB, ...

• APIs: Twitter, Facebook, Google, Wikipedia, YouTube, ...Use CSV when sharing tabular data.

Use JSON for hierarchical data.Use in-memory, else relational

databases. Don't analyse big data. Shrink it.

12

EXERCISELET'S LOAD FROM A SITE

THE GOOGLE SEARCH DATA YOU SAW EARLIER

LET'S LOAD A BIG DATASETA FEW COLUMNS FROM A LEAKED OK CUPID SURVEY

LET'S LOAD AN UNSTRUCTURED TABLEA TABLE FROM THE MEDICAL CERTIFICATION OF CAUSE OF DEATH 2013 PDF

13

DATA WRANGLING

FIND LOAD CLEAN

HOW DO I FIX THE DATA ISSUES?

14

CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES

Fix rows &

columns

Fix missing values

Standarise values

Fix invalid values

Filterdata

When we receive a dataset, we find a pattern of things that go wrong. These can be fixed in specific ways.

Here's a workflow / checklist of things to look out for and fix.

After this, check if the data is complete, and sufficient to solve the problem.

15

FIX ROWS AND COLUMNS

Fix rows ExamplesDelete incorrect rows Header rows, Footer rowsDelete summary rows Total, subtotal rows

Delete extra rows Column number indicators (1), (2), ...Blank rows

Fix columns ExamplesAdd column names if missing Files with missing header row

Rename columns consistently Abbreviations, encoded columns

Delete unnecessary columns Unidentified columns, irrelevant columns

Split columns for more data Split http://host:port/path into [Host, Port, Path]

Merge columns for identifiers

Merge Firstname, Lastname into NameMerge State, District into FullDistrict

Align misaligned columns Dataset may have shifted columns

16

FIX MISSING VALUES

Fix missing values Examples

Set values as missing values Treat blanks, "NA", "XX", "999", etc as missing

Fill missing values with...

Constant (e.g. zero)Column (e.g. created date defaults to updated date)Function (e.g. average of rows/columns)External data

Remove missing values Delete rowDelete column

Fill partial missing values Missing time zone, century etc.

17

STANDARDISE VALUES

Standardise numbers ExamplesRemove outliers Removing high and low valuesStandardise units lbs to kgs, m/s for speedScale values if required Fit to percentage scaleStandardise precision 2.1 to 2.10

Standardise text ExamplesRemove extra characters Common prefix/suffix, leading/trailing/multiple

spacesStandardise case Uppercase, lowercase, Title Case, Sentence

case, etcStandardise format 23/10/16 to 2016/10/20

“Modi, Narendra" to “Narendra Modi"

18

FIX INVALID VALUES

Fix invalid values ExamplesEncode unicode properly CP1252 instead of UTF-8

Convert incorrect data types

String to number: "12,300"String to date: "2013-Aug"Number to string: PIN Code 110001 to "110001"

Correct values not in list Non-existent country, PIN codeCorrect wrong structure Phone number with over 10 digitsCorrect values beyond range Temperature less than -273° C (0° K)

Validate internal rulesGross sales > Net salesDate of delivery > Date of orderingIf Title is "Mr" then Gender is "M"

In these cases, treat value as "missing".Remove it, or fix it with a formula.The formula may involve the value, row, column, entire dataset, or external data

19

FILTER DATA

Filter data Examples

Deduplicate dataRemove identical rowsRemove rows where some columns are identical

Filter rows Filter by segmentsFilter by date period

Filter columns Pick columns relevant to analysisAggregate data Group by required keys, aggregate the rest

20

EXERCISEASSEMBLY ELECTION DATA

SOMETHING WE DID A FEW YEARS AGO, AND IS WELL DOCUMENTED

21

The ECI website has this data.

22

… and, most of the data is in PDFs

23

The PDF files have a reasonably clear structure

24

… that translates into text that can be parsed

25

… which, with some effort, can be converted into a structured format

… and at this point, we need to start checking for errors.

26

At this point, we start checking what’s gone wrong

Each row here is one constituency.

The number of candidates that have contested in each constituency in every year is shown as a table.

You can see that some patterns emerge here.

27

Not every spelling error is easily identifiable by the first letter

Parties are mis-speltMADMKMAMAKMDMK

Party names changeAIADMKADMKADKParties restructureINC(I)INC

Constituency names mis-speltBHADRACHALAMBHADRACHELAMBHADRAHCALAM

28

Fortunately, large scale data itself can provide a solution

29

… with modern tools that support machine learning

30

DATA WRANGLING

FIND LOAD CLEAN

data wrangling

Data & Analytics