using gc content to distinguish phytophthora sequences from tomato sequences

Post on 17-Dec-2015

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Using GC content to distinguish Phytophthora sequences from

tomato sequences

Mission #1

Calculate the GC content of each sequence in the Phytophthora-tomato interactome

We will use a perl script to accomplish the mission.

Preparation

• Download the perl script (gc.pl) from the class web site and store it in C:/BioDownload folder

• Open cygwin, or command prompt (Vista users), or terminal (Mac users)

• Change directory (cd) to the BioDownload folder

perl<space>gc.pl<space>PhytophSeq1.txt<space>phyto_gc.out

Running the script

In cygwin (Windows users) or terminal (Mac users)

grep<space>--perl-regexp<space>”\t”<space>-c<space>phytoph_gc.out

grep<space>”>”<space>-c<space>PhytophSeq1.txt

You should get the same number from the two commands.

The number should be 3921.

Results

The output file

GC content column

Namecolumn

Build a histogram of the values of GC content

We will use R program to accomplish this mission.

Mission #2

http://www.r-project.org

Mac users

All Windows users

XP users

Vista users

getwd() to know which folder you are in now

setwd(“c:/BioDownload”) to change the working directory to C:/BioDownload

setwd(“/path/to/biodownload”) for Mac users

data<-read.table(“phytoph_gc.out”,sep=“\t”,header=FALSE)

to read in the data in the file phytoph_gc.out (your file name may be different)

data[1:10,]

to see the first 10 lines of the vector “data”

gc<-data[,2]

to assign the values from the 2nd column of “data” to a new vector “gc”

summary(gc)

to get the summary of the values in the vector “gc”

hist(gc,breaks=58)

to draw a histogram of the values in “gc” vector

Breaks indicates how many cells you want for the histogram. It was calculated as 78.7 (max) - 21.2 (min). It means the bin of the histogram is ~ 1 GC value

hist(gc,breaks=58,xlab=“GC content”,ylim=range(c(0,400)),main=“Histogram of GC content of sequences\ninPhytophthora-tomato interactome”)

to make the histogram look better

>pdf(“gc_histogram.pdf”)>hist(gc,breaks=58,xlab=“GC content”,ylim=range(c(0,400)),main=“Histogram of GC content of sequences\ninPhytophthora-tomato interactome”)>dev.off()

To output the histogram to a PDF file.

location

file

top related