reproducible statistics course for the future, -*.ppt to ... · keywords: training,...
TRANSCRIPT
Reproducible Statistics course for the future,
from Stata to RKennedy Mwai1 , Amos Thairu1 Tabitha Mwangi2 , Greg Fegan1,3
1. KEMRI-Wellcome Trust Research Programme, Kilifi
2. Pwani University, Kenya
3. Nuffield Department of Clinical Medicine, Centre for Tropical Medicine & Global Health, University of Oxford
@kenniajin
URL: www.keniajin.com
Introduction
Approach
Keywords: Training, Reproducibility, RStudio , Git , RMarkdown
Reproducibility being laudable and frequently called for, we should be instilling this
practice in students before they set out to do research1. The maturity and extensive
reproducibility abilities of Git, R and RStudio based materials make an excellent
choice for professional statistical skills training2,3.
Using other major statistical software's for training can be challenging when it
comes to creating reproducible courses for universities and training institutions. The
popularity of R and RStudio as tools for statistical programming is rising.
We utilised the above capabilities to convert a statistical methodology for the design
and analysis of epidemiological studies (SMDAES) the course from a STATA® based toan R based reproducible course.
-*.do to *.R files -*.dta to csvread.dta()
-*.ppt to *.RMD or *.RNW*.doc to *.RMD or *.RNW
knitr/rmarkdown
STATA-*.do-*.dta-*.doc.*.ppts RStudio and R
conversion
Local Git Repository
git add .
git commit &
git push .
Online github course repository
git pull
Pros n Cons
Modules
References
Cons
- Initial effort is time consuming
- Server setup resources
- Stata had a lot of background formatting
The course
The course mainly covered materials beneficial to students pursuing public health
related courses. The course materials were adapted from the STATA® workshops that
KEMRI-WTRP have been running in the program for over five years. All the data sets
were converted from .dta to .csv formats using foreign package. The codes used
in the presentations were converted from the initial .do files(STATA®) to .rfiles.The course was hosted on GitHub4 where the facilitators could update the materials
from their local or RStudio server repositories.
Our RStudio server platform was running on Ubuntu 12.04, with 10gb of RAM.
Students and facilitators were instructed to have a username/email before the start
of the course so that they could access the RStudio server platform which was linked
to the university’s active directory. We had approx. 30 active users during the course
period.
The week was distributed within a period of 10 days; with 9 days of coursework and
a last course wrap up day, covering different R help forums. The facilitators were
encouraged to have 15min presentations and then engage the participants R hands
on .
1. Stodden, V. & Miguez, S. Best Practices for Computational Science:Software Infrastructure and Environments for Reproducible and Extensible Research. (2014).at http://dx.doi.org/10.6084/m9.figshare.1027488
2. Muenchen, R. A. R for stats the popularity of data analysis software. (2014 (accessed March 27, 2015)). at http://r4stats.com/articles/popularity
3. Wieczorek, J. Reproducible research, training wheels, and knitr. (2014 (accessed March 30, 2015)).at <http://civilstat.com/2014/02/reproducible-research-training-wheels-and-knitr>
4. Introduction to Statistical Methods using R and R Studio 2015 -- Pwani University at https://github.com/Keniajin/PwaniR_Training_2015