@mammykins data scientist matthew gregory...the data best. gds these tests can be automated....

59
Matthew Gregory Data Scientist Government Digital Service @mammykins_

Upload: others

Post on 19-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Matthew GregoryData ScientistGovernment Digital Service@mammykins_

Page 2: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Reproducible Analytical Pipelines

Page 3: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

The problem

GDS

Page 4: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

The same kind of input data is used to output the same kind of report, periodically.

Page 5: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

RAP came about by recognising two things:

Page 6: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

1. Workflows with a large number of manual steps are time consuming and potentially error prone.

Page 7: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

“After subtracting the old rate from the

new rate, the spreadsheet divided by

their sum instead of their average, as

the modeler had intended. This error

likely had the effect of muting volatility

by a factor of two and of lowering the

[Value at Risk].”

Page 8: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

2. Statisticians increasingly have programming skills, and are using statistical programming tools.

Page 9: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

Page 10: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

Whilst the tools may have moved forward, in general, the way we manage our code has not.

Page 11: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

There are many tools and techniques we can learn from software developers.

Page 12: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Transparency andAuditability

with version control

Page 13: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

What, Who, Why, When

Page 14: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

Page 15: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

Quality assurance as

you go, not just at the

end

Page 16: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Quality Assurance

with automated testing

Page 17: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

The aim is to formalise the informal tests already conducted by those who know the data best.

Page 18: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

These tests can be automated

Page 19: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Efficiency

writing software instead of scripts

Page 20: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

Everything is in one place.

This helps with institutional knowledge management.

eesectors

Code

Tests

Documentation

Data

Page 21: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

Automating the repetitive processes can reduce production time by 75%

Page 22: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Where are we now?

Page 23: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

We’re going to hear from our collaborators across Government.

Page 24: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDfE

Laura SelbyStatistician

Page 25: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDFE

Raw data Machine readable csv Output

Our aim when we started:

- To release more data to users via machine readable format

- Drive efficiencies through automation of production

- Use code effectively to store knowledge

Our SFR pipeline:

Page 26: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

What we wanted to avoid

Page 27: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDFE

Manually adding numbers

Page 28: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDFE

Moving across software

Page 29: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDFE

- Reports take a long time to produce- Reports are hard to reproduce

Page 30: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Using rmarkdown

Page 31: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDFE

Page 32: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDFE

A basic example

Page 33: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Applying to National Statistics

Page 34: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDFE

Using functions

Page 35: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDFE

Add some style

You can create a style

sheet in Word which forces

R to output your document

in the desired format.

Page 36: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDFE

Page 37: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDFE

Final output

Page 38: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Use in DfE

Page 39: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSDFE

We were the first Government Department to publish Official Statistics in this way.

But this isn’t limited to just Official Statistics -

- Adhoc pieces of analysis- Quality Assurance reports- Bespoke reports for individual schools and local

authorities

Page 40: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

Christopher FairbanksStatistician

Page 41: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Automated tables

Page 42: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

Introducing Offender Management Statistics Quarterly (OMSQ)

● National Statistic publication

● Quarterly and Annual editions

● 1 quarterly bulletin (produced using RAP)

● 40 - 60 formatted tables (automated or soon to be automated)

Page 43: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

What do the tables look like?● Consistent formatting● Quarters are appended on● Percentage difference column● Fonts, indentation

How are they manually produced?● Outputting tables from SAS● Copying and pasting values● Using VLOOKUPS()● Manual formatting

Page 44: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

What do we want to achieve?

A process that:

● Produce tables quickly● Is easy to use and maintain● Reads data from a central, quality assured dataset

○ Easy to change the central dataset● Output information consistently each quarter:

○ Calculated consistently○ Consistent user friendly format

● Final output is Indestiguisable from manually creating tables

Page 45: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

What did we come up with?

Page 46: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

Process flow - producing automated tables

Canonical Data (simplest data form) Cross-tabulation Apply styles

Read in a .csv or .sas7bdat Using reshape2::dcast Using xltabr and ‘style sheet’

Page 47: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

Introducing ‘xltabr’

● R package, created in-house at MoJ

What can ‘xltabr’ do?● Turns R dataframes into .xlsx formatted tables using a

‘style sheet’

Is this available to everyone?● Yes, https://github.com/moj-analytical-services/xltabr

Page 48: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

Style sheet

Can specify that a cell has any combination of the following styles:● Fonts● Bold/Italics/Underlined● Justification● Row height● Column width● Cell colour● Borders● ...

Page 49: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Final product

Page 50: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

The Final Product

.xlsx workbook:

● Contents page updates when tables are created

● Publication dates update, including the next publication date

● Tab for each table● Hyperlinks link to tables

Page 51: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

The Final Product

Tables:

● Consistent formatting● Automated suppression of

percentage change based on values less than 50

● Specified row/column height/widths

● Specified borders and indentation

Page 52: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

What are the benefits of automating tables?

Accuracy:● Information held in central datasets feed directly into

the tables○ Less room for human error (typos, copy + paste)

Efficiency:● Once automation code has been written, tables can be

created at the touch of a button○ Alleviating time pressure, during busy periods

Page 53: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

Olivia ChristophersonHead of Profession

Page 54: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

Automating DCMS Sector Economic Estimates

● New, high profile publication (various economic measures for 7 DCMS sectors & numerous sub-sectors)

● Very manual process, resource intensive, risk of errors & repetitive work

● Each year produced separately and then QA’ed● Significant demands for additional breakdowns● Risk sectors will be redefined

Page 55: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

Things to consider

● Software – access to R and Rstudio● Installation of git on developers’ laptops needed for

collaborative development● Data scientist or equivalent with sufficient knowledge● Dedicated resource (can’t be done alongside day job!)● Importance of good documentation● Knowing when to stop perfecting code● Good way to secure data science resource● Wider benefits - working with GDS; R skills

Page 56: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSMoJ

Vicky HughesData Scientist

Page 57: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Culture of change

Page 58: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

GDSGDS

Choose where adds most value

Senior buy-in

Make it less scary

Use the cross Government group

Create an internal group

Page 59: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation

Thanks! Questions?

https://ukgovdatascience.github.io/rap_companion/