computing for research i spring 2012 lecture 1: january 5 primary instructor: elizabeth...

Computing for Research ISpring 2012

Lecture 1: January 5

Primary Instructor: Elizabeth Garrett-Mayer

Introduction• Description: Students learn to use the primary statistical software packages

for data manipulation and analysis, including (but not limited to): R, R Bioconductor, SAS, SAS macro, and Stata. Additionally, students will learn: how to use the division's high speed cluster-computing environment, how to practice the principles of reproducible research using Sweave in R, how to use LaTeX and BibTeX for manuscript and presentation development, and how to create and maintain a website. This is a three credit course.

• Course Organization: This course is given by the entire division. Instructors will take turns giving lectures in their areas of expertise.

• Textbooks: No textbook. Reading material (primarily found on the web) will be provided as necessary.

• Prerequisites: Biometry 700

Evaluation

• Grading: Instructors will give short exercises to be completed and turned into the primary instructor by the Wednesday of the week following when it was assigned (e.g., assignments given on Monday Jan 16 and Wednesday Jan 18 are both due on Wednesday Jan 25). Each assignment will count equally towards 75% of the course grade. There will be a final project which will account for the remaining 20% of the course grade. The remaining 5% of the course grade will reflect class participation.

• Homeworks Policy: Homeworks are due by 5pm on the due date. All homeworks should be emailed to the primary instructor ([email protected]) or turned in at lecture time. Asking for extensions on homeworks is strongly discouraged. However, it is expected that, on occasion, extenuating circumstances may arise. Therefore, the policy is that each student may request an extension on homework twice and the extension is to be no more than 2 days. After using two extensions, no more extensions will be granted except with a medical note.

mailto:[email protected]

Classroom Etiquette• Attention to material: Laptops are permitted in class, but it is expected

that if they are used, it is to follow along with the lecture. Email and web browsers should not be visited during class time. The instructors are giving their time and expertise. Be respectful and give them your attention.

• Classroom disruptions: Many of us have small children and others who we need to be able to be in contact with during lectures. It is acceptable to bring pagers or cell phones to class. Please be sure they are on silent mode. If you need to leave during lecture to take a phone call, or make a phone call, please do so. However, this should be a relatively rare occurrence. Texting and emailing during lecture time is not acceptable.

• Violations of classroom etiquette policies will result in a 0 for class participation.

ContactPrimary

Instructor: Elizabeth Garrett-Mayer

Website: http://people.musc.edu/~elg26/teaching/statcomputing.2012/statcomputingI.2012.htm

Contact Info: Hollings Cancer Center, Rm. [email protected] (preferred mode of contact is email)792-7764

Time: Tuesdays and Thursdays, 2:00-3:30Location: Cannon 301, Room 305V

Office Hours: by appointment

Office Hours: The primary instructor will be available by appointment. However, given the nature of the course, the primary instructor may not be knowledgeable regarding all of the topics covered. As a result, additional help may be needed to complete assignments from the lecturers. Be considerate and responsible in scheduling time with course instructors and recognize that they all have busy schedules.

http://people.musc.edu/~elg26/teaching/statcomputing.2012/statcomputingI.2012.htm

mailto:[email protected]

Course Objectives

Upon successful completion of the course, the student will be able to • Import, perform simple analyses and produce graphical

displays in Stata, SAS and R• Create new functions or commands in each of R, Stata and

SAS• Generate professional quality scientific manuscripts and

presentations using Latex along with statistical software• Perform standard power and sample size calculations using

available software and simulations.• Operate the division’s cluster computer with batch computing

Schedule, briefly

• R• Data management• STATA• SAS• Batch processing• Latex + Sweave• power calculations• designing website

Detailed ScheduleDate Lecturer TopicTh Jan 5 E. Garrett-Mayer Introduction; Overview and PrinciplesTu Jan 10 E. Garrett-Mayer R: introduction to object-oriented programmingTh Jan 12 Caitlyn Ellerbe R: downloading packages/libraries; data input &

outputTu Jan 17 Cody Chiuzan R: graphicsTu Jan 19 Georgiana

OnicescuR: basic language structure (ifelse, where, looping)

Tu Jan 24 Andrew Lawson R: exploratory data analysis; writing commandsTh Jan 26 Bethany Wolf R: bioconductor Tu Jan 31 Yanqui Weng R: simulations; random number generation; sampling

from distributions

Th Feb 2 Stacia DeStantis

R: regression commands

Tu Feb 7 Amy Wahlquist Data management: RedCap Th Feb 9 Annie Simpson Data management principles & ExcelTu Feb 14 E. Garrett-Mayer STATA: introduction, “immediate” commandsTh Feb 16 E. Garrett-Mayer STATA: graphical displaysTu Feb 21 E. Garrett-Mayer STATA: exploratory data analysis; Th Feb 23 E. Garrett-Mayer STATA regression commandsTu Feb 28 E. Garrett-Mayer STATA: programming and do files

Detailed ScheduleDate Lecturer TopicTh Mar 1 Kyra Robinson SAS: introductionTu Mar 6 Ramesh SAS: IMLTh Mar 8 Renee Martin SAS: macrosTu Mar 20 Valerie Durkalski SAS: proc tabulate and proc reportTh Mar 22 Nate Baker SAS: GplotTu Mar 27 Katherine

NicholasSAS: ODS

Th Mar 29 Jordan Elm SAS: array processingTu Apr 3 Adrian Nida Batch processing (using R) and cluster computingTh Apr 5 Mulugeta

GebregziabherLatex and Bibtex: manuscript production

Tu Apr 10 Emily Kistner-Griffin

Latex and Bibtex: presentations

Th Apr 12 Betsy Hill Reproducible Research: SweaveTu Apr 17 Paul Nietert Sample size calculation software packagesTh Apr 19 Sybil Prince-

NelsonDesigning your own website

Tu Apr 24 FINAL PROJECT

DUE APRIL 30, 5PM

Housekeeping

• We are meeting in a regular classroom• Bringing laptops is allowed and encouraged• Data, code, etc. needed for class will be on the website

prior to class• For optimal interface, install packages ASAP– R (http://cran.r-project.org/)– Stata (DBE helpdesk request)– SAS (DBE helpdesk request)– WinEdt (http://www.winedt.com/)

• Create a bookmark to the course website:http://people.musc.edu/~elg26/teaching/statcomputing.2012/statcomputingI.2012.htm

http://cran.r-project.org/

http://www.winedt.com/



Lecture Notes

• Every lecturer will have his/her own style• Notes may be – prepared ahead of time and posted– Prepared and posted after the lecture– Nonexistent

• Lecture notes will NOT be printed by the instructors prior to lecture.

• If they are available and you would like a paper copy, it is your responsibility to print them out.

Introduction

• 2012: to be a successful biostatistician/epidemiologist, you MUST be competent on the computer.

• Historically: students learned in labs from students

• Moving forward: – many options for analysis and generation of results– Efficiency in computing is essential.– Your computer IS your lab!

Data analysis software

• In this course:– R– Stata– SAS

• Many other options:

SPSS S, Splus Epi Info GraphPad

JMP Matlab JAGS Systat

Minitab EGRET BMDP MedCalc

Mathematica WinBugs GLIM ….

SAS: History• SAS was conceived by Anthony J. Barr in 1966. As a North Carolina

State University graduate student from 1962 to 1964, Barr had created an analysis of variance modeling language. From 1966 to 1968, Barr developed the fundamental structure and language of SAS.

• In January 1968, Barr and James Goodnight collaborated, integrating new multiple regression and analysis of variance routines developed by Goodnight into Barr's framework.

• By 1971, SAS was gaining popularity within the academic community. One strength of the system was analyzing experiments with missing data, which was useful to the pharmaceutical and agricultural industries, among others.

• In 1976, SAS Institute, Inc. was incorporated.• The latest version, SAS version 9.2, was released in March 2008

SAS: functioning

• SAS consists of a number of components, which organizations separately license and install as required.

• Licenses expire! Software cannot be used after expiration (unless renewed)

Why (or why not) SAS?• Most commonly used in pharma (although that may be changing!)• FDA likes SAS• Many jobs for MS statisticians and/or epidemiologists require SAS

expertise• The most common language

• Becoming less the choice of academia– Updates are less frequent than freeware– ‘pros’ of competitors are starting to outweigh the ‘pros of SAS

• Licensing costs• Slow to add new functionality• Lack of consistency with syntax• Learning curve is slower than other programs that now have similar capability

Stata

• Stata is a general-purpose statistical software package created in 1985 by StataCorp.

• Most of its users work in research, especially in the fields of economics, sociology, political science, biomedicine and epidemiology.

• Relatively simple to learn yet powerful• Latest version is Stata 11 (released 2009).• Lots of add-ons for epi users

Why (or why not) Stata?

• Relatively inexpensive (especially as student or single-user)• Biomedical focus: output and functions are tailored to

medical research• Fast and big: can handle and manipulate large datasets• Sophisticated with wide range of tools• Easy to learn language with consistent syntax• Graphics are not as good as other packages (although that

has improved)• Programming (simulations, loops, etc.) is more challenging

R: History• R is a programming language and software environment for statistical computing

and graphics. • The R language has become a de facto standard among statisticians for the

development of statistical software, and is widely used for statistical software development and data analysis.

• R is an implementation of the S programming language. S was created by John Chambers while at Bell Labs. R was created by Ross Ihaka and Robert Gentleman, and is now developed by the R Development Core Team. R is named partly after the first names of the first two R authors, and partly as a play on the name of S.

• R source code is freely available under the GNU General Public License.• The capabilities of R are extended through user-submitted packages, which allow

specialized statistical techniques, graphical devices, as well as import/export capabilities to many external data formats. A core set of packages are included with the installation of R, with more than 2460 (as of July 2010) available at the Comprehensive R Archive Network (CRAN).

R: functionality

• Freeware: latest version can be installed anywhere at anytime

• Packages (a.k.a. libraries) that are user-contributed allow additional features/commands

• Relatively simple interface

Why (or why not) R?• Great for programming and simulations• Handles looping well• Flexible language • FREE!• User-contributed packages included in real-time (i.e., no delay in their

availability)• Most PhD Biostatistics programs teach their students R and many/most

academic statisticians in top programs use R.• Interfaces nicely with other programs such as Latex (Sweave), WinBugs,

C, Emacs.• Can be clunky for data management.• Memory is not as good as SAS and Stata• Quality-control on user-contributed packages not evident

Overview

• Not a question of which one.• Question is “for my current problem, which

package makes the most sense to use?”• Each has strengths and weaknesses

Latex and Sweave• LaTeX is a document markup language and document preparation

system for the TeX typesetting program. • The term LaTeX refers only to the language in which documents are

written, not to the editor used to write those documents. In order to create a document in LaTeX, a .tex file must be created using some form of text editor. (e.g. WinEdt)

• LaTeX is most widely used by mathematicians, scientists, engineers, philosophers, lawyers, linguists, economists, researchers, and other scholars in academia.

• LaTeX is used because of the high quality of typesetting achievable by TeX. The typesetting system offers extensive facilities for automating most aspects of typesetting and desktop publishing, including numbering and cross-referencing, tables and figures, page layout and bibliographies.

http://en.wikipedia.org/wiki/TeX

http://en.wikipedia.org/wiki/TeX

Latex and Sweave• Sweave is a function in R that enables integration of R code into LaTeX

documents. The purpose is "to create dynamic reports, which can be updated automatically if data or analysis change".

• The data analysis is performed at the moment of writing the report, or more exactly, at the moment of compiling the Sweave code with Sweave (i.e., essentially with R) and subsequently with LaTeX. This can facilitate the creation of up-to-date reports for the author.

• Because the Sweave files together with any external R files that might be sourced from them and the data files contain all the information necessary to trace back all steps of the data analyses,

• Sweave also has the potential to make research more transparent and reproducible to others. However, this is only the case to the extent that the author makes the data and the R and Sweave code available.

http://en.wikipedia.org/wiki/LaTeX

Data management

• Analysis of clean data is easy!• The real world: you will get messy data most of

the time from your colleagues• Data management tools will help you;– Deal with messy data– Set up data capture approaches for your colleagues to

minimize messiness• Excel, RedCap and general principles of data

management for statistical analysis will be covered

ExamplePatient # cycle # total ceramide levels S1P levels C18 ceramide S1P/C18

1 0 743.6 197.2 9.8 20.1224493 625.6 177.9 9.9 17.969697

2 0 534.8 148.4 9 16.4888889CR 3 461.6 182.8 10.8 16.9259259

5 527.3 151.4 11.5 13.1652174

3 0 760.5 214.5 12 17.875

4 0 359 167.3 4.3 38.90697673 375.9 125.3 4.6 27.23913045 475.6 116.2 4.4 26.4090909

5 0 394.1 163.1 5.7 28.6140351

6 0 848.7 132.5 10.8 12.26851853 1083.6 203.9 13.5 15.1037037

7 0 684.6 191.4 8.1 23.6296296

8 0 822.7 219.5 8.9 24.6629213

9 0 486.3 198 5.7 34.7368421CR 581.3 186.8 9.6 19.4583333

699.6 42.3 11.4 3.71052632561.7 130.4 6.7 19.4626866

754 320.6 14.4 22.2638889

Sample size and power

• We don’t really use textbook formulas anymore to do simple power calculations (just like we don’t really invert matrices by hand when we analyze data).

• There are a number of packages that quickly and easily perform simple power calculations

• R, SAS and Stata can do some.• But, packages like Nquery, EAST and PASS do a lot

more.• In some non-standard settings, simulations are

required to determine power.

Before getting started…

• Types of files involved in statistical computing– Data files– Results files– Command/batch files– Function files– Graphics files– + more(?)

• TIPS: – develop a common nomenclature for naming files and

folders– Organize projects within folders

Organization is key!

• DO NOT overwrite old files (especially data files)• Save with a new name– Mousedata.xls (file sent from colleague)– Mousedata.clean.xls (your clean version of the data)

• Use a consistent approach, but think ahead– Naming files *.new.* is not a good idea. You may have

a new ‘new’ next week– Numerics are good, but if you think you may need more

than 9 versions, consider how data2 and data10 would be alphabetized.

Examples

• For each Principal Investigator I work with, I have a folder

• Within the PI folder, for each project, I have a folder

• For each time I get a new dataset (or work on a new grant) for that project, I have a folder named with month and year

• Example:I:\\MUSC Oncology\\Kraft, Andrew\\VelcadeTrial\\May2008I:\\MUSC Oncology\\Kraft, Andrew\\R01 June 2007

Examples

• Within each folder of data analysis or grant development calculations, I use the same naming conventions for files:– Rbatch.R: a set of R commands that implement all of the

computation or analyses– Rfunctions.R: a set of R functions that are used by the

batch file– I always save the original data file from the investigator before

making any changes– I add ‘clean’ to the datafile name and save it as a .csv

before use (e.g. mousedata.clean.csv)– My Rbatch.R files always include a line sourcing in the data,

including the folder where the data resides.

Friends in Statistical Computing

1. Google is your friend2. ‘Help’ functions and ‘see also’ links are your

friends3. ‘examples’ are your friends4. Your fellow students are your friends

Friends help friends figure out statistical computing!

Using your noggin• Example 1:

– SPSS is not included in this curriculum. – Can you not use it? NO!– Will you be able to learn it better and faster after having taken this course? YES!

• Example 2:– We will probably not cover the R package nnc (Neareset Neighbor

Autocovariates)– Does that mean you need to find someone to teach it to you? NO!– Will you be able to teach it to yourself? YES!

• Example 3:– None of your instructors are computer scientists (except maybe Annie Simpson)– Does this mean that they are not qualified to teach you? NO!– Most of them are self-taught with regards to these techniques

Final Thoughts for Today

• THIS COURSE WILL POINT YOU IN THE RIGHT DIRECTION AND PROVIDE A SET OF TOOLS

• IT IS YOUR JOB TO MAKE THEM FIT TOGETHER AND USE THEM AS A LAUNCHING PAD TO SOLVE PROBLEMS

• Next up: Intro to R on Tuesday!

References

• Background info on R, SAS, Stata, Latex and Sweave was all pilfered from Wikipedia.

computing for research i spring 2012 lecture 1: january 5 primary instructor: elizabeth...

Documents