@mammykins data scientist matthew gregory...the data best. gds these tests can be automated....
TRANSCRIPT
![Page 1: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/1.jpg)
Matthew GregoryData ScientistGovernment Digital Service@mammykins_
![Page 2: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/2.jpg)
Reproducible Analytical Pipelines
![Page 3: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/3.jpg)
The problem
GDS
![Page 4: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/4.jpg)
GDSGDS
The same kind of input data is used to output the same kind of report, periodically.
![Page 5: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/5.jpg)
GDSGDS
RAP came about by recognising two things:
![Page 6: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/6.jpg)
GDSGDS
1. Workflows with a large number of manual steps are time consuming and potentially error prone.
![Page 7: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/7.jpg)
GDSGDS
“After subtracting the old rate from the
new rate, the spreadsheet divided by
their sum instead of their average, as
the modeler had intended. This error
likely had the effect of muting volatility
by a factor of two and of lowering the
[Value at Risk].”
![Page 8: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/8.jpg)
GDSGDS
2. Statisticians increasingly have programming skills, and are using statistical programming tools.
![Page 9: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/9.jpg)
GDSGDS
![Page 10: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/10.jpg)
GDSGDS
Whilst the tools may have moved forward, in general, the way we manage our code has not.
![Page 11: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/11.jpg)
GDSGDS
There are many tools and techniques we can learn from software developers.
![Page 12: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/12.jpg)
Transparency andAuditability
with version control
![Page 13: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/13.jpg)
GDSGDS
What, Who, Why, When
![Page 14: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/14.jpg)
GDSGDS
![Page 15: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/15.jpg)
GDSGDS
Quality assurance as
you go, not just at the
end
![Page 16: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/16.jpg)
Quality Assurance
with automated testing
![Page 17: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/17.jpg)
GDSGDS
The aim is to formalise the informal tests already conducted by those who know the data best.
![Page 18: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/18.jpg)
GDSGDS
These tests can be automated
![Page 19: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/19.jpg)
Efficiency
writing software instead of scripts
![Page 20: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/20.jpg)
GDSGDS
Everything is in one place.
This helps with institutional knowledge management.
eesectors
Code
Tests
Documentation
Data
![Page 21: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/21.jpg)
GDSGDS
Automating the repetitive processes can reduce production time by 75%
![Page 22: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/22.jpg)
Where are we now?
![Page 23: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/23.jpg)
GDSGDS
We’re going to hear from our collaborators across Government.
![Page 24: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/24.jpg)
GDSDfE
Laura SelbyStatistician
![Page 25: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/25.jpg)
GDSDFE
Raw data Machine readable csv Output
Our aim when we started:
- To release more data to users via machine readable format
- Drive efficiencies through automation of production
- Use code effectively to store knowledge
Our SFR pipeline:
![Page 26: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/26.jpg)
What we wanted to avoid
![Page 27: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/27.jpg)
GDSDFE
Manually adding numbers
![Page 28: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/28.jpg)
GDSDFE
Moving across software
![Page 29: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/29.jpg)
GDSDFE
- Reports take a long time to produce- Reports are hard to reproduce
![Page 30: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/30.jpg)
Using rmarkdown
![Page 31: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/31.jpg)
GDSDFE
![Page 32: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/32.jpg)
GDSDFE
A basic example
![Page 33: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/33.jpg)
Applying to National Statistics
![Page 34: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/34.jpg)
GDSDFE
Using functions
![Page 35: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/35.jpg)
GDSDFE
Add some style
You can create a style
sheet in Word which forces
R to output your document
in the desired format.
![Page 36: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/36.jpg)
GDSDFE
![Page 37: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/37.jpg)
GDSDFE
Final output
![Page 38: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/38.jpg)
Use in DfE
![Page 39: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/39.jpg)
GDSDFE
We were the first Government Department to publish Official Statistics in this way.
But this isn’t limited to just Official Statistics -
- Adhoc pieces of analysis- Quality Assurance reports- Bespoke reports for individual schools and local
authorities
![Page 40: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/40.jpg)
GDSMoJ
Christopher FairbanksStatistician
![Page 41: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/41.jpg)
Automated tables
![Page 42: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/42.jpg)
GDSMoJ
Introducing Offender Management Statistics Quarterly (OMSQ)
● National Statistic publication
● Quarterly and Annual editions
● 1 quarterly bulletin (produced using RAP)
● 40 - 60 formatted tables (automated or soon to be automated)
![Page 43: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/43.jpg)
GDSMoJ
What do the tables look like?● Consistent formatting● Quarters are appended on● Percentage difference column● Fonts, indentation
How are they manually produced?● Outputting tables from SAS● Copying and pasting values● Using VLOOKUPS()● Manual formatting
![Page 44: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/44.jpg)
GDSMoJ
What do we want to achieve?
A process that:
● Produce tables quickly● Is easy to use and maintain● Reads data from a central, quality assured dataset
○ Easy to change the central dataset● Output information consistently each quarter:
○ Calculated consistently○ Consistent user friendly format
● Final output is Indestiguisable from manually creating tables
![Page 45: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/45.jpg)
What did we come up with?
![Page 46: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/46.jpg)
GDSMoJ
Process flow - producing automated tables
Canonical Data (simplest data form) Cross-tabulation Apply styles
Read in a .csv or .sas7bdat Using reshape2::dcast Using xltabr and ‘style sheet’
![Page 47: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/47.jpg)
GDSMoJ
Introducing ‘xltabr’
● R package, created in-house at MoJ
What can ‘xltabr’ do?● Turns R dataframes into .xlsx formatted tables using a
‘style sheet’
Is this available to everyone?● Yes, https://github.com/moj-analytical-services/xltabr
![Page 48: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/48.jpg)
GDSMoJ
Style sheet
Can specify that a cell has any combination of the following styles:● Fonts● Bold/Italics/Underlined● Justification● Row height● Column width● Cell colour● Borders● ...
![Page 49: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/49.jpg)
Final product
![Page 50: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/50.jpg)
GDSMoJ
The Final Product
.xlsx workbook:
● Contents page updates when tables are created
● Publication dates update, including the next publication date
● Tab for each table● Hyperlinks link to tables
![Page 51: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/51.jpg)
GDSMoJ
The Final Product
Tables:
● Consistent formatting● Automated suppression of
percentage change based on values less than 50
● Specified row/column height/widths
● Specified borders and indentation
![Page 52: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/52.jpg)
GDSMoJ
What are the benefits of automating tables?
Accuracy:● Information held in central datasets feed directly into
the tables○ Less room for human error (typos, copy + paste)
Efficiency:● Once automation code has been written, tables can be
created at the touch of a button○ Alleviating time pressure, during busy periods
![Page 53: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/53.jpg)
GDSMoJ
Olivia ChristophersonHead of Profession
![Page 54: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/54.jpg)
GDSGDS
Automating DCMS Sector Economic Estimates
● New, high profile publication (various economic measures for 7 DCMS sectors & numerous sub-sectors)
● Very manual process, resource intensive, risk of errors & repetitive work
● Each year produced separately and then QA’ed● Significant demands for additional breakdowns● Risk sectors will be redefined
![Page 55: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/55.jpg)
GDSGDS
Things to consider
● Software – access to R and Rstudio● Installation of git on developers’ laptops needed for
collaborative development● Data scientist or equivalent with sufficient knowledge● Dedicated resource (can’t be done alongside day job!)● Importance of good documentation● Knowing when to stop perfecting code● Good way to secure data science resource● Wider benefits - working with GDS; R skills
![Page 56: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/56.jpg)
GDSMoJ
Vicky HughesData Scientist
![Page 57: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/57.jpg)
Culture of change
![Page 58: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/58.jpg)
GDSGDS
Choose where adds most value
Senior buy-in
Make it less scary
Use the cross Government group
Create an internal group
![Page 59: @mammykins Data Scientist Matthew Gregory...the data best. GDS These tests can be automated. Efficiency writing software instead of scripts. GDS Everything is in one ... Cross-tabulation](https://reader035.vdocument.in/reader035/viewer/2022070810/5f08e2a07e708231d42431cf/html5/thumbnails/59.jpg)
Thanks! Questions?
https://ukgovdatascience.github.io/rap_companion/