data analytics intermediate 1 2 fundamentals of data in excel
TRANSCRIPT
Data Cleaning
Garbage in. Garbage out.
Variable TypeIDTextOrdinalCategoricalNumerical
5 Data Types that need to be cleaned
Duplicate IDTypeMissingOut of rangeOutlier
Data Cleaning Best Practice
Guided Practice for cleaning data
Copy the data separately before doing Make a note of what you've done.Create a document to record conclusions Where is the data come from? Explain analysis results Summary of analysis results.
Summary Statistics
Ordinal and categorical data
1. Central tendency a. Mean: A central value of a finite set of numbers: specifically, the sum of the values divided by the number of values. b. Median: The middle value of the given list of data, when arranged in an order. c. Mode: The value that appears most often in a set of data values.2. Dispersion a. Standard deviation: A measure of the amount of variation or dispersion of a set of values. b. Range: A set of data is the difference between the largest and smallest values3. Statistical dependency a. Correlation: Any statistical relationship, whether causal or not, between two random variables or bivariate data.
Lookup_value is the value that will be used to match data. This is usually an identifier (an ID of some kind). It must exist in both worksheets.
Table_array is the table from which you want to retrieve data.
Col_index_num is the number of the column from the left side of the table_array from which you want to retrieve data. Lock Column with $ e.g. $C4
Range_lookup defines whether or not the lookup_value is an approximate match or an exact match of the value you are comparing it to in the left-most column of the table_array. TRUE: Approximate match is needed.* FALSE: An exact match is required.
A categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not known.
Numerical dataNumerical data is a data type expressed in numbers, rather than natural language description. Sometimes called quantitative data, numerical data is always collected in number form.
Data ReferencingVOOKUP SYNTAXThe syntax is: =VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])
A NULL is any missing value in your data. There are four primary strategies for handling NULL values: Delete them (only with caution). Ignore them (some may have meaning). Impute values (e.g. median or zeros). Find missing values (using reference resources).
Guideline: If over 15% of a dataset is filled with NULL values, find new data!
ID
1
2
3
4
5
6
7
Age range
10-30
31-50
31-50
51-90
51-90
10-30
10-30
Region
North
North
Northeast
North
South
Northeast
South
Spending
20000
30000
100000
20000
10000
4000
40000
Name
Anthony
Brittney
Christina
Donald
Elaine
Frank
Gary
ID
1
2
3
4
5
6
7
Age range
10-30
31-50
31-50
51-90
51-90
10-30
10-30
Region
North
North
Northeast
North
South
Northeast
South
Spending
20000
30000
100000
20000
10000
4000
40000
Name
Anthony
Brittney
Christina
Donald
Elaine
Frank
Gary
ID
1
3
5
6
6
6
Bill
80000
20000
10000
0
3000
1000
Card type
Primo
Super
Platinum
Primo
Super
Platinum
ID
1
2
3
4
5
6
7
Age range
10-30
31-50
31-50
51-90
51-90
10-30
10-30
Region
North
North
Northeast
North
South
Northeast
South
Spending
east
30000
-100000
20000
10000
4000
4000000
Name
Anthony
Brittney
Christina
Donald
Elaine
Frank
Gary
ID
1
3
5
6
6
6
Bill
80000
20000
10000
0
3000
1000
Card type
Primo
Super
Platinum
Primo
Super
Platinum
FieldAccident_Index
Accident_Severity
Date
Region
Number_of_
Included_Parties
Number_of_
Vehicles
NotesUnique
identifier
Slight, Serious,
of Fatal
Expect
duplicates
Likely entered
by a human; critical
to our analysis
The spread of
numbers looks
concerning at
first glance
The spread of
numbers looks
concerning at
first glance
Action to takeCheck for and
handle duplicates.
Checkfor and
handle blanks
None
Check for
incorrect and/or
missing values.
Check for and
handle outliers.
Check for and
handle outliers.
Date TypeAlphanumeric
Text
Date
Test
Numerical
Numerical
2021 © TRUE DIGITAL ACADEMY Data Analytics Intermediate I Module 2: Fundamentals of Data in Excel
Data Analytics Intermediate 1Fundamentals of Data in Excel
Module
2
Example: Thai Demographics and Income dataExample: Thai Demographics and Income data
Histogram
Strong positive correlation Strong negative correlation
Correlation
grouping
aggregation
Positive Correlation The greater population density, the greater income The lower population density, the less income
Negative Correlation The greater population density, the less income The lower population density, the greater income
Statistics in ExcelName Range: Highlight selected table and fill in name range
Mean: =AVERAGE(name range)Median: =MEDIAN(name range)Standard Deviation: =STDEV(name range)Correlation: =CORREL(name range 1, name range 2)Adds-on Data Analytics can help you do data analysis in many ways.
Example How many customers are there in EACH region? What's the average spending in EACH region? What's the average spending in EACH age group?
Click ‘insert’ > ‘Pivot Table’Select table/data sourceSelect variables into column, row, filter, value
Crosstabulation is a technique to find the relationship of multiple variables. Categorical x Numerical = Aggregate statistics Categorical x Categorical = Counts
count
mean
std
min
25%
50%
75%
max
77
23542
6209
13497
19590
22101
26376
45572
ID
1
2
3
4
5
6
7
Age range
10-30
31-50
31-50
51-90
51-90
10-30
10-30
Region
North
North
Northeast
North
South
Northeast
South
Spending
20000
30000
100000
20000
10000
4000
40000
Name
Anthony
Brittney
Christina
Donald
Elaine
Frank
Gary
Region
North
North
North
Spending
20000
30000
20000Region
North
Northeast
South
Spending
23333
52000
25000
Region
Northest
Northest
Spending
100000
4000
Region
South
South
Spending
10000
40000
Income
Den
sity
Income
Positive Correlation = 0.6 Positive Correlation = 0.9
Den
sity
Income
Den
sity
Income
Negative Correlation = 0.6 Negative Correlation = 0.9
Den
sity
Income
Den
sity
Income
Den
sity
2021 © TRUE DIGITAL ACADEMY Data Analytics Intermediate I Module 2: Fundamentals of Data in Excel
Data Analytics Intermediate 1Fundamentals of Data in Excel
Module
2