stat115 stat225 bist512 bio298 - intro to computational biology yang li lin liu jan 29, 2014 1

52
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

Upload: gregory-green

Post on 26-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Yang LiLin Liu

Jan 29, 2014

1

Page 2: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• Unix part slides courtesy: John Brunelle

• You can check out more details in:– https://software.rc.fas.harvard.edu/

training/intro_unix/latest/#(1)

2

Page 3: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Sign up on Odyssey

• Very simple, just go to http://rc.fas.harvard.edu/, then click on Account and Access Request Forms (right top of the website on Quick Links section), then click on RC Account form, and then fill it in as below – we will take care of the rest!

3

Page 4: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 4

Page 5: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Basic Unix Command

• Log in:• ssh

[email protected]

5

Page 6: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Basic Unix Command

• Upload or download files:Upload:

scp username@host dir/targetfilenameDownload:

scp dir/yourfilename username@host

6

Page 7: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

CaSe SeNsItIvE

• In shell commands, abc will be different from ABC

7

Page 8: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Terminology and notation

• Folders are usually referred to as directories

• Locations in the filesystem, like /n/home00/cfest350, are called paths

• The directory and file names that make up a path are always separated by a forward-slashes

• The top of the hierarchy is /, ie the root directory

8

Page 9: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Terminology and notation

9

Page 10: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Navigating the system: ls

10

Page 11: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Download and unzip files

• wget https://software.rc.fas.harvard.edu/training/examples.tar.gz

• tar xvf examples.tar.gz

11

Page 12: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

What if you get confused

• man ls– Use the arrow keys, page up/down keys,

or the SPACE to navigate– To search for a phrase of text, for

example the word time, type /time and hit ENTER• Hit n to go to the next occurrence• Hit N to go to the previous occurrence• Hit q to quit

12

Page 13: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Kill process

• top• kill• killall• Ctrl-c• Exercise: Run the command

~/examples/bin/ticktock, and kill it once you've had enough

13

Page 14: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Copy files• mkdir workshop• cd workshop• cp ~/examples/aaa .• cp ~/examples/bbb ~/examples/ccc .• cp aaa zzz• rsync: replacement for cp, but can be used

to copy files to/from remote computers– e.g. rsync -avz --progress mywork

username@hostname:~/mywork

14

Page 15: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Moving and removing

15

Page 16: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

File permissions

• The -rw-r--r-- displays the file mode bits– The first character is the type (- for files,

d for directories, and other letters (b, c, l, s etc.) for special files

– Following that are three groups of three characters, for read, write, and execute permissions for user, group, and others

16

Page 17: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 17

Page 18: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• r = 4, w = 2, x = 1, rwx = 7• chmod 755

18

Page 19: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Hidden files

19

Page 20: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

File manipulation

• cat ~/examples/gpl-3.0.txt• less ~/examples/gpl-3.0.txt• File editors: vim/emacs/nano

20

Page 21: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

More shell commands

21

Page 22: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Piping your commands

• cat ~/examples/answers.out | awk '{print $3}'

22

Page 23: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercises

• List the last 5 files in /bin by combining the ls and tail commands with a pipe

• Count the number of lines that contain the word free in ~/examples/gpl-3.0.txt

23

Page 24: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

The shell environment

• echo $PATH• Change $PATH:• PATH=$PATH\:/dir/path ; export PATH

24

Page 25: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Submit a job

• bsub < yourscript.bsub• yourscript.bsub:

#!/bin/sh#BSUB -u linliu@harvard#BSUB -J hellwo_world#BSUB -o hellow_world.out#BSUB -e hellow_world.err#BSUB -q short_serialpython hellow_world.py > hellow_world.out

25

Page 26: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Load modules

• module load dir/software– http://oldrcwebsite.rc.fas.harvard.edu/

faq/modulelist

26

Page 27: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Final tips

• Google is extremely helpful if you want to write some shell scripts

27

Page 28: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Getting started

Where the scripts/commands are executed

Where plots/help displayed, and packages installed.

Where the CODE is scripted

Show the variables/functions in memory

Page 29: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Workspace Management

• Before jumping into R, it is important to ask ourselvesWhere am I?

>getwd()

–I want to be there…• setwd(“C://”)

–With who am I?• dir() # lists all the files in the working directory

–With who I can count on?• ls() #lists all the variables on the current session

Page 30: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Workplace Management (2)

Saving>save(x,file=“name.RData”) #Saves specific

objects>save.image(“name.Rdata”) #Saves the whole

workspace

Loading>load(“name.Rdata”)

‘?function’ and ‘??function’>? To get the documentation of the function>?? Find related functions to the query

Page 31: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R Objects• Almost all things in R are OBJECTS!

– Functions, datasets, results, etc… (graphs NO)

• OBJECTS are classified by two criteria– MODE: How objects are stored in R

• Character, numeric, logical, factor, list, function…• To obtain the mode of an object

> mode(object)

– CLASS: How objects are treated by functions• Vector, matrix, array, data.frame,…• To obtain the class of an object

> class(object)

Page 32: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classescharacter

> assembly = “hg19”> assembly> class(assembly)

numeric> expression = 3.456> expression> class(expression)

integer> nbases = “3000000000L”> nbases> class(nbases)

logical> completed = FALSE> completed> class(completed)

Page 33: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes - Vectorvector

>x=c(10,5,3,6); x[3:4]; x[1]

Computations on vector are performed on each entry of the vector

>y=c(log(x),x,x^2)

Not necessarily to have vectors of the same length in operations!

>w=sqrt(x)+2>z=c(pi,exp(1),sqrt(2))>x+z

–Logical vectors>aux=x<7

Page 34: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R Classes - Listlist

A vector of values of possibly different classes and different length.

Creating it.>x1 = 1:5>x2 = c(T,T,F,T,F)>y=list(question.number = x1, question.answer = x2)

Accesing it.>y;class(y)>y$question.answer[3]; y[[2]][3];

y[[“question.answer”]][3]>y$question.number[which(question.answer == T)]

Page 35: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes - Matrixmatrix

>x=1:8>dim(x)=c(2,4)>y=matrix(1:8,2,4,byrow=F)

Operations are applied on each element

>x*x; max(x)>x=matrix(1:28,ncol=4);

y=7:10 so then x*y is…?>y=matrix(1:8,ncol=2)>y%*%t(y)

Page 36: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes - Matrix

matrixExtracting info

>y[1,] or y[,1]Extending matrices

>cbind(y,seq(101,104))>rbind(y,c(102,109))

Apply is a useful function!>apply(y,2,mean)>apply(y,1,log)

Page 37: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes – Data Frame

data.frameCreating it.

> policy.number = c(“A00187”, “A00300”,”A00467”,”A01226”)> issue.age = c(74,30,68,74)> sex=c(“F”, “M”, “M”, “F”)> smoke=c(“S”,”N”,”N”,”N”)> face.amount = c(420, 1560, 960, 1190)> ins.df = data.frame(policy.number, issue.age, sex, smoke,

face.amount)

Accesing it.> ins.df[1,]; ins.df[,1] # access first row, access first colum> ins.df$policy.number # access policy number column> rownames(ins.df); colnames(ins.df);> index.smokers = which(ins.df$smoke == “S”) # row index of

smokers> ins.df[index.smokers] # access all smokers in the df> ins.df$policy.number[index.smokers] # policy number for

smokers

Page 38: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes – Data Frame

data.frameManipulating it.

> ins.df = rbind(ins.df, c(“A01495”, 62, “M”, “N”, 1330))> sort.age = sort(ins.df$issue.age, index=T)> ins.df = ins.df[sort.age$ix,]> ins.df$visits = c(0,4,2,1,1)> drops = c(“sex”,”visits”)> ins.df[,!(names(ins.df) %in% drops]> ins.df[,”visits”] = c(0,4,2,1,1)> carins.df = data.frame(policy.number =

c("A01495","A00232","A00187"), car.accident = c("Y","N","N"))> ins.merged.df = merge(ins.df, carins.df, by = "policy.number")> Etc…

Page 39: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R Classes - Factorfactor

Qualitative variables that can be included in models.

>smoke = c(“yes”,”no”,”yes”,”no”)>smoke.factor = as.factor(smoke)>smoke.factor>class(smoke)>class(smoke.factor)

Page 40: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Loops and Conditional Statements

ifExample

>a=9>if(a<0){ print (“Negative number”) } else{ print (“Non-negative number”) }

Page 41: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• for>z=rep(1,10)>for (i in 2:10)

{ z[i]=z[i]+exp(1)*z[i-1] }

• while>n=0>tmp=0>while(tmp<100)

{ tmp=tmp+rbinom(1,10,0.5) n=n+1 }

Loops and Conditional Statements

Page 42: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Functions!• My own functions

> function.name=function(arg1,arg2,…,argN) { Body of the function }

> fun.plot=function(y,z){y=log(y)*z-z^3+z^2plot(z,y)}

> z=seq(-11,10)> y=seq(11,32)> fun.plot(y,z)

Page 43: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Functions! (2)• The ‘…’ argument

– Can be used to pass arguments from one function to another• Without the need to specify arguments in

the header

fun.plot=function(y,z,...) { y=log(y)*z-z^3+z^2 plot(z,y,...) }fun.plot(y,z,type="l",col="red")fun.plot(y,z,type="l”,col=“red”,lwd=4)

Page 44: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Handling data I/O

Reading from files to a data frame>read.csv(“filename.csv“) # reads csv files into

a data.frame>read.table(“filename.txt“) # reads txt files in a

table format to a data.frame

Writing from a data frame to a file>write(x,filename) # writes the object x to

filename>write.table(x,filename) # writes the object x to

filename in a table format

Note: have in mind additional options such as, header = TRUE, row.names = TRUE, col.names = TRU, quotes = TRUE, etc.

Page 45: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Plotting!

>x.data=rnorm(1000)>y.data=x.data^3-10*x.data^2>z.data=-0.5*y.data-90

>plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label")

>points(x.data,z.data,col="red")>legend(-2,2,legend=c("Black points","Red

points"),col=c("black","red"),pch=1,text.col=c("black","red"))

Page 46: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Plotting! (2)

You can export graphs in many formats– To check the formats that are available in your

R installation>capabilities()

png>png("Lab2_plot.png",width=520,height=440)>plot(x.data,y.data,main="Title of the graph",xlab="x

label",ylab="y label")>points(x.data,z.data,col="red")>legend(-2,2,legend=c("Black points","Red

points"),col=c("black","red"),pch=1,text.col=c("black","red"))

>dev.off()eps

> postscript("Lab2_plot.eps",width=500,height=440)

Page 47: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Simulation

Sampling>sample(x,repla

ce=TRUE) – put it back into the bag!

Distributions

Page 48: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Libraries!!

Collection of R functionsthat together perform a specialized analysis.

Install packages from CRAN> install.packages(“PackageName”)

Loading libraries> library(LibraryName)

Getting the documentation of a library> library(help=LibraryName)

Listing all the available packages> library()

Page 49: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• www.bioconductor.org

– A suite of R packages for Bioinformatics.

– To use only Core packages• >source(“http://bioconductor.org/biocLite.R”)• >biocLite()

– To use Core and Other packages• >source(“http://bioconductor.org/biocLite.R”)• >biocLite(c(“pkg1”, “pkg2”,…,“pkgN”))

Libraries!!

Page 50: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 1 – The empire strikes back: GOOG versus BAIDU

Plot historical Stock Prices times series using prices from yahoo finance.

(a) Download and install tseries package.

(b) Include tseries package as a library in your code.

(c) Use get.hist.quote to download GOOG and BAIDU historical data.

(d) Plot both time series in the same panel and add a legend to the plot.

Page 51: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 2 – Challenging Challenger

On January 28, 1986, the space Shuttle Challenger exploded in the early stages of its flight. Feynman, along a committee determined that the explosion was due to low temperatures and the failure of O-rings sealed on the booster rockets. The ambient temperature was 36 degrees on the morning of the launch.The scientists had data (temperature, number of failures) from previous flights.

Page 52: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 2 – Challenging Challenger

(a)Plot the number of failures versus the temperature for flights with one or more O-ring failures. Is there any evidence that temperature affects O-ring performance?

(b)Plot the number of failures versus temperature for all the flights. Is there any evidence that temperature affects O-ring performance?

(c) What’s your conclusion? What do you think the scientists plot before taking the decision to fly that day? Just historical curiosity, Whom played a central role in discovering the causes of the failure and how he announced it?