data wrangling
TRANSCRIPT
1
DATA WRANGLING
FIND LOAD CLEAN
2
DATA WRANGLING
FIND LOAD CLEAN
WHERE CAN I GET DATA FROM?
3
Client data isn't easy to get
THERE'S CLIENT DATA, AND THERE'S PUBLIC DATA
Public data isn't relevant
We have internal information. Getting
information from outside is our challenge. There’s
no way of doing that.
– Senior EditorLeading Media Company
“
5
INDIA’S RELIGIONSIf you search on google.co.in for "how do I convert to", here are the suggestions Google shows
The popularity influences the order.So there's a good chance that the religions on top are more often searched for.
6
AUSTRALIA’S RELIGIONSBut be careful of how you interpret it.In Australia, PDF is not a religion. Unless you're a data scientist.
7
8
USE MULTIPLE APPROACHES TO FIND YOUR DATA
Public data catalogueshttps://github.com/caesar0301/awesome-public-datasetshttps://github.com/rasbt/pattern_classification/blob/master/resources/dataset_collections.md
Govt data websiteshttps://data.gov.in/https://data.gov/https://data.gov.uk/https://data.gov.sg/http://publicdata.eu/
or search on Googlehttps://www.google.com/
or ask peopleHumans™
1
2
3
4
9
EXERCISELET'S FIND SOME DATASETS
(YOU PICK WHAT YOU WANT TO FIND. WE WILL SEARCH FOR IT)
10
DATA WRANGLING
FIND LOAD CLEAN
HOW DO I STORE & PROCESS DATA?
11
WE LOAD DATA INTO OUR PROGRAMS OR OTHERS'
Files Databases• Delimited text: CSV, TSV, PSV• Formatted text: TXT, PRN
• Marked up text: HTML, XML, JSON, JSON Line, YAML, SQL
• Spreadsheets: XLS*, ODS, MDB, ACCDB, DBF
• Specialised formats: HDF5, SQLite, DTA (Stata), C4.5, CDF
• Graph formats: GEXF, GDF, GML, GraphML, GraphViz DOT
• Unstructured: TXT, PDF, Images, Audio, Video, ...
• In-memory databases: DataFrames
• Relational databases: Oracle, MySQL, PostgreSQL, SQL Server, DB2, Sybase, Informix, ...
• Document databases: MongoDB, CouchDB, ElasticSearch, Firebase
• Distributed databases: HFS, Spark
• Cloud data stores: BigQuery, DynamoDB, RedShift, Azure SQL Database, DocumentDB, ...
• APIs: Twitter, Facebook, Google, Wikipedia, YouTube, ...Use CSV when sharing tabular data.
Use JSON for hierarchical data.Use in-memory, else relational
databases. Don't analyse big data. Shrink it.
12
EXERCISELET'S LOAD FROM A SITE
THE GOOGLE SEARCH DATA YOU SAW EARLIER
LET'S LOAD A BIG DATASETA FEW COLUMNS FROM A LEAKED OK CUPID SURVEY
LET'S LOAD AN UNSTRUCTURED TABLEA TABLE FROM THE MEDICAL CERTIFICATION OF CAUSE OF DEATH 2013 PDF
13
DATA WRANGLING
FIND LOAD CLEAN
HOW DO I FIX THE DATA ISSUES?
14
CHECK FOR ALL THESE DATA CLEANSING ACTIVITIES
Fix rows &
columns
Fix missing values
Standarise values
Fix invalid values
Filterdata
When we receive a dataset, we find a pattern of things that go wrong. These can be fixed in specific ways.
Here's a workflow / checklist of things to look out for and fix.
After this, check if the data is complete, and sufficient to solve the problem.
15
FIX ROWS AND COLUMNS
Fix rows ExamplesDelete incorrect rows Header rows, Footer rowsDelete summary rows Total, subtotal rows
Delete extra rows Column number indicators (1), (2), ...Blank rows
Fix columns ExamplesAdd column names if missing Files with missing header row
Rename columns consistently Abbreviations, encoded columns
Delete unnecessary columns Unidentified columns, irrelevant columns
Split columns for more data Split http://host:port/path into [Host, Port, Path]
Merge columns for identifiers
Merge Firstname, Lastname into NameMerge State, District into FullDistrict
Align misaligned columns Dataset may have shifted columns
16
FIX MISSING VALUES
Fix missing values Examples
Set values as missing values Treat blanks, "NA", "XX", "999", etc as missing
Fill missing values with...
Constant (e.g. zero)Column (e.g. created date defaults to updated date)Function (e.g. average of rows/columns)External data
Remove missing values Delete rowDelete column
Fill partial missing values Missing time zone, century etc.
17
STANDARDISE VALUES
Standardise numbers ExamplesRemove outliers Removing high and low valuesStandardise units lbs to kgs, m/s for speedScale values if required Fit to percentage scaleStandardise precision 2.1 to 2.10
Standardise text ExamplesRemove extra characters Common prefix/suffix, leading/trailing/multiple
spacesStandardise case Uppercase, lowercase, Title Case, Sentence
case, etcStandardise format 23/10/16 to 2016/10/20
“Modi, Narendra" to “Narendra Modi"
18
FIX INVALID VALUES
Fix invalid values ExamplesEncode unicode properly CP1252 instead of UTF-8
Convert incorrect data types
String to number: "12,300"String to date: "2013-Aug"Number to string: PIN Code 110001 to "110001"
Correct values not in list Non-existent country, PIN codeCorrect wrong structure Phone number with over 10 digitsCorrect values beyond range Temperature less than -273° C (0° K)
Validate internal rulesGross sales > Net salesDate of delivery > Date of orderingIf Title is "Mr" then Gender is "M"
In these cases, treat value as "missing".Remove it, or fix it with a formula.The formula may involve the value, row, column, entire dataset, or external data
19
FILTER DATA
Filter data Examples
Deduplicate dataRemove identical rowsRemove rows where some columns are identical
Filter rows Filter by segmentsFilter by date period
Filter columns Pick columns relevant to analysisAggregate data Group by required keys, aggregate the rest
20
EXERCISEASSEMBLY ELECTION DATA
SOMETHING WE DID A FEW YEARS AGO, AND IS WELL DOCUMENTED
21
The ECI website has this data.
22
… and, most of the data is in PDFs
23
The PDF files have a reasonably clear structure
24
… that translates into text that can be parsed
25
… which, with some effort, can be converted into a structured format
… and at this point, we need to start checking for errors.
26
At this point, we start checking what’s gone wrong
Each row here is one constituency.
The number of candidates that have contested in each constituency in every year is shown as a table.
You can see that some patterns emerge here.
27
Not every spelling error is easily identifiable by the first letter
Parties are mis-speltMADMKMAMAKMDMK
Party names changeAIADMKADMKADKParties restructureINC(I)INC
Constituency names mis-speltBHADRACHALAMBHADRACHELAMBHADRAHCALAM
28
Fortunately, large scale data itself can provide a solution
29
… with modern tools that support machine learning
30
DATA WRANGLING
FIND LOAD CLEAN