data wrangling with dplyr - nhs-r community€¦ · wrangling reshaping or transforming “raw”...

59
data wrangling with dplyr 1 Credit: JustTooLazy. Licensed under CC BY 2.0. Andrew Jones | Strategy Unit

Upload: others

Post on 30-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

data wrangling with dplyr

1

Cre

dit: Ju

stToo

Lazy. Lice

nse

d u

nd

er C

C B

Y 2

.0.

Andrew Jones | Strategy Unit

Page 2: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Wrangling

Reshaping or transforming “raw” data into a

format which is easier to work with.

(For later visualisation, computing of statistics,

and modelling.)

2

Page 3: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

The dplyr package

Dplyr is a language for data manipulation.

Most wrangling puzzles can be solved with

knowledge of just 5 dplyr verbs (5 functions).

These will be the subject of this session.

3

Page 4: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Gap

min

der

4

Page 5: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Gapminder

Data from gapminder.org

install.packages(“gapminder”)

library(gapminder)

5

Page 6: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q. How many variables here?Meaningful names?

What type?(more on this tomorrow)

6

Page 7: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

dplyr

5 verbs

Will help us gain a deeper understanding of our

data sets.

7

arrange filter mutate summarisegroup_by

Page 8: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Newspaper puzzles

8

18 ÷3 x7 −6 ÷3Level 2

Page 9: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Visualising instructions

9

18 ÷3 x7 −6

Input

then then then

Output = new number

Page 10: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Visualising Instructions

10

18 then

÷3 then

x7

Page 11: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

11

18 then

3 then

7

÷

x

Page 12: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

12

18 then

3 then

7

divide by

multiply by

Page 13: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

13

18 then

by=3 then

by=7

divide

multiply

verbs

Rules (for each verb)

Page 14: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

14

18 then

by=3 then

by=7

divide( )

multiply( )

verbs

Rules (for each verb)

Page 15: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

15

18 %>%

by=3 %>%

by=7

divide( )

multiply( )

Input (number)

Output (new number)

Page 16: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

16

18 %>%

by=3 %>%

by=7

divide( )

multiply( )

Input (number)

What is the output at this step?

Page 17: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

17

18 %>%

by=3 %>%

by=7

divide( )

multiply( )

Input (number)

What is the output at this step?

The next step builds on the previous one

Page 18: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

data_frame then

do_this(rules) then

do_this(rules)

Tidyverse syntax

18

Input

dplyr verb

Output (new data frame)

Page 19: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

data_frame %>%

do_this(rules) %>%

do_this(rules)

Tidyverse syntax

19

Page 20: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Combine simple pieces to solve complex puzzles

data_frame %>%

do_this(rules) %>%

do_this(rules)

The tidyverse

20

Page 21: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Gap

min

der

21

Page 22: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q1. How many years does this Gapminder excerpt

cover?

22

Page 23: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

1. arrangeReorder rows based on selected variable

gapminder %>%

arrange(year)

23

“then”Input data frame

dplyr verb variable to arrange by

Page 24: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

1. arrangeReorder rows based on selected variable

gapminder %>%

arrange(year)

24

“then” – Ctrl + Shift + mInput data frame

dplyr verb variable to arrange by

Page 25: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

1. arrangeIn descending order:

gapminder %>%

arrange(desc(year))

25

Page 26: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

1. arrangeIn descending order (easy way):

gapminder %>%

arrange(-year)

26

“then” – Ctrl + Shift + m

Page 27: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q2. Which 5 countries have the highest human

populations?(2007)

27

Page 28: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

gapminder %>%

arrange(desc(pop))

28

Page 29: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

2. filterpick observations by their value

gapminder %>%

filter( )

29

“then”Input data frame

dplyr verb

Page 30: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

2. filterpick observations by their value

gapminder %>%

filter(year == 2007)

30

Page 31: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

2. filterpick observations by their value

gapminder %>%

filter(year == 2007)

31

We are testing equality so ==

Here we are choosing rows

where this expression is

TRUE

The expression inside brackets

should be TRUE or FALSE

Page 32: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

2. filter

gapminder %>%

filter(year == 2007) %>%

arrange(desc(pop))

32

“then”strings multiple

verbstogether

Page 33: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

2. filter

gapminder %>%

filter(continent == “Africa”) %>%

arrange(desc(pop))

33

Use quotes if referring to text

(character) strings

‘single’ or “double”as you wish

Page 34: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

34

Pic

ture

: W

hite, K. (2

011

). 1

01

Thin

gs

to L

earn

in A

rt S

cho

ol.

MIT

Pre

ss.Break / Quiz

What is this not?

Page 35: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q3. Which 5 countries* have the lowest GDP?

(2007)

*Not all countries represented in data

35

Page 36: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q3. Which 5 countries have the lowest GDP?

gapminder %>%

filter(year == 2007) %>%

arrange(gdpPercap)

36

This is per capita GDP(but we can get what we need from existing variables)

Page 37: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

3. mutate

create new variables from existing ones

gapminder %>%

mutate(gdp = pop * gdpPercap)

37

Page 38: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

3. mutate

gapminder %>%

mutate(gdp = pop*gdpPercap)

38

NOT a test of equality, so =

new column name

Usually a function of existing variable(s)

Page 39: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

3. mutate

gapminder %>%

mutate(gdp = pop*gdpPercap) %>%

arrange(gdp)

39

We can refer to variables we’ve just created in the pipe chain

Page 40: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q4. Which country has the highest population for each

year of data?

40

Page 41: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

5. summarisecollapse many values into a single summary value

gapminder %>%

summarise(pop_high = max(pop))

41

(summary)function

new column name

Similar to mutate:

Page 42: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

4. group_byFor each group…

… summarise (collapse into a single summary value)

gapminder %>%

group_by(year) %>%

summarise(pop_high = max(pop))

42

Page 43: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

4. group_byUseful if we desire breakdowns by variable(s)

gapminder %>%

group_by(year) %>%

summarise(pop_high = max(pop))

43

Page 44: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

4.group_by and 5.summarise

gapminder %>%

group_by(year) %>%

summarise(pop_high = max(pop))

44

year pop_high

A column is created for each grouping variable

A column for the summary

Page 45: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

4.group_by and 5.summarise

gapminder %>%

group_by(year) %>%

summarise(pop_high = max(pop))

45

A row for each year group

year pop_high

1952

1957

Page 46: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

4.group_by and 5.summarise

gapminder %>%

group_by(year, continent) %>%

summarise(mean_life = max(pop))

46

A column is created for each

grouping variable

year continent pop_high

1952 Africa

1957 Africa

Page 47: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

4.group_by and 5.summarise

gapminder %>%

group_by(year, continent) %>%

summarise(mean_life = max(pop))

47

A row for each unique combo of

the grouping variables

year continent pop_high

1952 Africa

1957 Africa

Page 48: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q5. How hasmean life expectancy

in Africachanged (1952–2007)?

48

Page 49: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q5. How hasmean life expectancy

in Africachanged (1952–2007)?

49

A summary value…

Page 50: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q5. How hasmean life expectancy

in Africachanged (1952–2007)?

50

A summary value…

for each year…

Page 51: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q5. How hasmean life expectancy

in Africachanged (1952–2007)?

51

A summary value…

for each year…

but pick only the African continent

Page 52: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q5. How hasmean life expectancy

in Africachanged (1952–2007)?

52

A summary value…

for each year…

but pick only the African continent

Over to you:

Page 53: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

Q. How many countries from each continent?

53

Hint:

summarise(any_name = n())

Is a common pattern – it will count the number of rows in each group

Extension:

Page 54: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

This work is licensed as

Creative Commons

Attribution-ShareAlike 4.0

International

To view a copy of this license, visit

https://creativecommons.org/licenses/by-sa/4.0/

For title photo:

https://creativecommons.org/licenses/by/2.0/

54

Page 55: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

End

55

Page 56: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

6. selectselect a subset of variables from existing data set

gapminder %>%

select(var1, var2)

56

Page 57: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

6. selectselect a subset of variables from existing data set

gapminder %>%

select(-var2)

57

To remove a column

Page 58: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

6. selectselect a subset of variables from existing data set

gapminder %>%

select(1:5)

58

You can also refer to columns by number.

Here 1:5 saves having to type: 1,2,3,4,5

Page 59: data wrangling with dplyr - NHS-R Community€¦ · Wrangling Reshaping or transforming “raw” data into a format which is easier to work with. (For later visualisation, computing

6. selectselect a subset of variables from existing data set

gapminder %>%

select(var6, everything())

59

If you want this column at the start of your data frame