nicholas tierney · whycare about missing data? tidy data and data shadow 1. identify missing...

1
Why care about missing data? Tidy Data and Data Shadow 1. Identify missing values [email protected] njtierney.com nj_tierney njtierney Tidy Principles for Missing Data Exploration and Analysis using naniar Nicholas Tierney 2. Explore why data is missing 3. Investigate imputation Highlight imputed values by referring to whether they were previously missing using _NA airquality %>% bind_shadow() %>% simputation::impute_lm(Ozone ~ Temp + Wind) %>% ggplot(aes(x = Temp, y = Ozone, colour = Ozone_NA)) + geom_point() geom_miss_point() Reveals missing values in ggplot2: airquality %>% ggplot(aes(x = Solar.R, y = Ozone)) + geom_miss_point() Refer to _NA variables to plot variables according to being missing: airquality %>% bind_shadow() %>% ggplot(aes(x = Temp, fill = Ozone_NA)) + geom_density() Summarise missing data: miss_var_summary(airquality) # A tibble: 6 x 4 variable n_miss pct_miss <chr> <int> <dbl> 1 Ozone 37 24.2 2 Solar.R 7 4.58 3 Wind 0 0 4 Temp 0 0 5 Month 0 0 6 Day 0 0 Get a sense of the missingness in the whole data frame vis_miss(airquality) Combine with dplyr::group_by() airquality %>% group_by(Month) %>% miss_var_summary() # A tibble: 25 x 5 Month variable n_miss pct_miss <int> <chr> <int> <dbl> 1 5 Ozone 5 16.1 2 5 Solar.R 4 12.9 3 5 Wind 0 0 4 5 Temp 0 0 5 5 Day 0 0 6 6 Ozone 21 70.0 7 6 Solar.R 0 0 What and why Learn more at: naniar.njtierney.com A B 2018 “Sam” A_NA B_NA !NA !NA Variables in columns Observations in rows NA NA Data Shadow NA NA Tidy Data Variables end in _NA Values are Missing (NA) or Not Missing (!NA) Bind data with shadow using bind_shadow() Tidy Missing Data A B 2018 “Sam” A_NA B_NA !NA !NA NA NA NA NA Dealing with missing data in real world analysis is usually unavoidable and almost always frustrating. naniar uses principles of tidy missing data, and provides tidyverse-friendly tools to clean up, summarise, and assess missing values, so you can easily: { naniar bind_shadow() geom_miss_point() visdat vis_dat() vis_miss() Identify missing values Explore why data is missing Investigate imputation

Upload: others

Post on 13-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nicholas Tierney · Whycare about missing data? Tidy Data and Data Shadow 1. Identify missing values nicholas.tierney@gmail.com njtierney.com nj_tierney njtierney Tidy Principles

Why care about missing data? Tidy Data and Data Shadow

1. Identify missing values

[email protected] njtierney.com nj_tierney njtierney

Tidy Principles for Missing Data Exploration and Analysis using naniarNicholas Tierney

2. Explore why data is missing 3. Investigate imputation

Highlight imputed values by referring to whether they were previously missing using _NAairquality %>% bind_shadow() %>%simputation::impute_lm(Ozone ~ Temp + Wind) %>% ggplot(aes(x = Temp,

y = Ozone, colour = Ozone_NA)) +

geom_point()

geom_miss_point()Reveals missing values in ggplot2:

airquality %>% ggplot(aes(x = Solar.R,

y = Ozone)) + geom_miss_point()

Refer to _NA variables to plot variables according to being missing:airquality %>%bind_shadow() %>%ggplot(aes(x = Temp,

fill = Ozone_NA)) +geom_density()

Summarise missing data:miss_var_summary(airquality)

# A tibble: 6 x 4 variable n_miss pct_miss<chr> <int> <dbl>

1 Ozone 37 24.22 Solar.R 7 4.583 Wind 0 04 Temp 0 05 Month 0 06 Day 0 0

Get a sense of the missingness in the whole data framevis_miss(airquality)

Combine with dplyr::group_by()

airquality %>% group_by(Month) %>% miss_var_summary()

# A tibble: 25 x 5 Month variable n_miss pct_miss<int> <chr> <int> <dbl> 1 5 Ozone 5 16.1 2 5 Solar.R 4 12.9 3 5 Wind 0 0 4 5 Temp 0 0 5 5 Day 0 0 6 6 Ozone 21 70.07 6 Solar.R 0 0

What and why

Learn more at: naniar.njtierney.com

A B

2018

“Sam”

A_NA B_NA

!NA

!NA

Variables in columns

Observations in rows

NA

NA

Data Shadow

NA

NA

Tidy Data

Variables end in _NA

Values are Missing (NA) or Not Missing (!NA)

Bind data with shadow using bind_shadow()

Tidy Missing Data

A B

2018

“Sam”

A_NA B_NA

!NA

!NA

NA

NA

NA

NA

Dealing with missing data in real world analysis is usually unavoidable and almost always frustrating.

naniar uses principles of tidy missing data, and provides tidyverse-friendly tools to clean up, summarise, and assess missing values, so you can easily:

{ naniarbind_shadow()

geom_miss_point()

visdatvis_dat()vis_miss()

• Identify missing values• Explore why data is missing• Investigate imputation