data wrangling with dplyr - nhs-r community€¦ · wrangling reshaping or transforming “raw”...
TRANSCRIPT
data wrangling with dplyr
1
Cre
dit: Ju
stToo
Lazy. Lice
nse
d u
nd
er C
C B
Y 2
.0.
Andrew Jones | Strategy Unit
Wrangling
Reshaping or transforming “raw” data into a
format which is easier to work with.
(For later visualisation, computing of statistics,
and modelling.)
2
The dplyr package
Dplyr is a language for data manipulation.
Most wrangling puzzles can be solved with
knowledge of just 5 dplyr verbs (5 functions).
These will be the subject of this session.
3
Gap
min
der
4
Gapminder
Data from gapminder.org
install.packages(“gapminder”)
library(gapminder)
5
Q. How many variables here?Meaningful names?
What type?(more on this tomorrow)
6
dplyr
5 verbs
Will help us gain a deeper understanding of our
data sets.
7
arrange filter mutate summarisegroup_by
Newspaper puzzles
8
18 ÷3 x7 −6 ÷3Level 2
Visualising instructions
9
18 ÷3 x7 −6
Input
then then then
Output = new number
Visualising Instructions
10
18 then
÷3 then
x7
11
18 then
3 then
7
÷
x
12
18 then
3 then
7
divide by
multiply by
13
18 then
by=3 then
by=7
divide
multiply
verbs
Rules (for each verb)
14
18 then
by=3 then
by=7
divide( )
multiply( )
verbs
Rules (for each verb)
15
18 %>%
by=3 %>%
by=7
divide( )
multiply( )
Input (number)
Output (new number)
16
18 %>%
by=3 %>%
by=7
divide( )
multiply( )
Input (number)
What is the output at this step?
17
18 %>%
by=3 %>%
by=7
divide( )
multiply( )
Input (number)
What is the output at this step?
The next step builds on the previous one
data_frame then
do_this(rules) then
do_this(rules)
Tidyverse syntax
18
Input
dplyr verb
Output (new data frame)
data_frame %>%
do_this(rules) %>%
do_this(rules)
Tidyverse syntax
19
Combine simple pieces to solve complex puzzles
data_frame %>%
do_this(rules) %>%
do_this(rules)
The tidyverse
20
Gap
min
der
21
Q1. How many years does this Gapminder excerpt
cover?
22
1. arrangeReorder rows based on selected variable
gapminder %>%
arrange(year)
23
“then”Input data frame
dplyr verb variable to arrange by
1. arrangeReorder rows based on selected variable
gapminder %>%
arrange(year)
24
“then” – Ctrl + Shift + mInput data frame
dplyr verb variable to arrange by
1. arrangeIn descending order:
gapminder %>%
arrange(desc(year))
25
1. arrangeIn descending order (easy way):
gapminder %>%
arrange(-year)
26
“then” – Ctrl + Shift + m
Q2. Which 5 countries have the highest human
populations?(2007)
27
gapminder %>%
arrange(desc(pop))
28
2. filterpick observations by their value
gapminder %>%
filter( )
29
“then”Input data frame
dplyr verb
2. filterpick observations by their value
gapminder %>%
filter(year == 2007)
30
2. filterpick observations by their value
gapminder %>%
filter(year == 2007)
31
We are testing equality so ==
Here we are choosing rows
where this expression is
TRUE
The expression inside brackets
should be TRUE or FALSE
2. filter
gapminder %>%
filter(year == 2007) %>%
arrange(desc(pop))
32
“then”strings multiple
verbstogether
2. filter
gapminder %>%
filter(continent == “Africa”) %>%
arrange(desc(pop))
33
Use quotes if referring to text
(character) strings
‘single’ or “double”as you wish
34
Pic
ture
: W
hite, K. (2
011
). 1
01
Thin
gs
to L
earn
in A
rt S
cho
ol.
MIT
Pre
ss.Break / Quiz
What is this not?
Q3. Which 5 countries* have the lowest GDP?
(2007)
*Not all countries represented in data
35
Q3. Which 5 countries have the lowest GDP?
gapminder %>%
filter(year == 2007) %>%
arrange(gdpPercap)
36
This is per capita GDP(but we can get what we need from existing variables)
3. mutate
create new variables from existing ones
gapminder %>%
mutate(gdp = pop * gdpPercap)
37
3. mutate
gapminder %>%
mutate(gdp = pop*gdpPercap)
38
NOT a test of equality, so =
new column name
Usually a function of existing variable(s)
3. mutate
gapminder %>%
mutate(gdp = pop*gdpPercap) %>%
arrange(gdp)
39
We can refer to variables we’ve just created in the pipe chain
Q4. Which country has the highest population for each
year of data?
40
5. summarisecollapse many values into a single summary value
gapminder %>%
summarise(pop_high = max(pop))
41
(summary)function
new column name
Similar to mutate:
4. group_byFor each group…
… summarise (collapse into a single summary value)
gapminder %>%
group_by(year) %>%
summarise(pop_high = max(pop))
42
4. group_byUseful if we desire breakdowns by variable(s)
gapminder %>%
group_by(year) %>%
summarise(pop_high = max(pop))
43
4.group_by and 5.summarise
gapminder %>%
group_by(year) %>%
summarise(pop_high = max(pop))
44
year pop_high
A column is created for each grouping variable
A column for the summary
4.group_by and 5.summarise
gapminder %>%
group_by(year) %>%
summarise(pop_high = max(pop))
45
A row for each year group
year pop_high
1952
1957
4.group_by and 5.summarise
gapminder %>%
group_by(year, continent) %>%
summarise(mean_life = max(pop))
46
A column is created for each
grouping variable
year continent pop_high
1952 Africa
1957 Africa
4.group_by and 5.summarise
gapminder %>%
group_by(year, continent) %>%
summarise(mean_life = max(pop))
47
A row for each unique combo of
the grouping variables
year continent pop_high
1952 Africa
1957 Africa
Q5. How hasmean life expectancy
in Africachanged (1952–2007)?
48
Q5. How hasmean life expectancy
in Africachanged (1952–2007)?
49
A summary value…
Q5. How hasmean life expectancy
in Africachanged (1952–2007)?
50
A summary value…
for each year…
Q5. How hasmean life expectancy
in Africachanged (1952–2007)?
51
A summary value…
for each year…
but pick only the African continent
Q5. How hasmean life expectancy
in Africachanged (1952–2007)?
52
A summary value…
for each year…
but pick only the African continent
Over to you:
Q. How many countries from each continent?
53
Hint:
summarise(any_name = n())
Is a common pattern – it will count the number of rows in each group
Extension:
This work is licensed as
Creative Commons
Attribution-ShareAlike 4.0
International
To view a copy of this license, visit
https://creativecommons.org/licenses/by-sa/4.0/
For title photo:
https://creativecommons.org/licenses/by/2.0/
54
End
55
6. selectselect a subset of variables from existing data set
gapminder %>%
select(var1, var2)
56
6. selectselect a subset of variables from existing data set
gapminder %>%
select(-var2)
57
To remove a column
6. selectselect a subset of variables from existing data set
gapminder %>%
select(1:5)
58
You can also refer to columns by number.
Here 1:5 saves having to type: 1,2,3,4,5
6. selectselect a subset of variables from existing data set
gapminder %>%
select(var6, everything())
59
If you want this column at the start of your data frame