the seqinr package -...

The seqinr PackageSeptember 27, 2007

Version 1.1-2

Date 2007-09-26

Title Biological Sequences Retrieval and Analysis

Author Delphine Charif and Jean Lobry and Anamaria Necsulea and Leonor Palmeira

Maintainer Simon Penel <[email protected]>

Depends R (>= 2.4.0)

Suggests ade4, segmented

Description Exploratory data analysis and data visualization for biological sequence (DNA andprotein) data. Include also utilities for sequence data management under the ACNUC system.

License GPL version 2 or newer

URL http://pbil.univ-lyon1.fr/software/SeqinR/seqinr_home.php, Mailing list:http://pbil.univ-lyon1.fr/software/SeqinR/seqinr3_angl.php

ZipData no

R topics documented:AAstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3AnoukResult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4EXP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5G+C Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7GetFromSequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9SEQINR.UTIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13SeqAcnucWeb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14SeqFastaAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15SeqFastadna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16SeqFrag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18aaa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19aacost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1

2 R topics documented:

aaindex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22alllistranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35amb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37c2s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38chargaff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39choosebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41closebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43comp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44computePI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46crelistfromclientdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47dia.bactgensize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49dinucl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51dist.alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52dotPlot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53dotchart.uco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54draw.oriloc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56draw.rearranged.oriloc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58ec999 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59extract.breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60extractseqs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62gb2fasta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64gbk2g2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65gbk2g2.euk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66get.db.growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67get.ncbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68getType . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69getlistrank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70kaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71lseqinr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72n2s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73oriloc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77plot.SeqAcnucWeb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78pmw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79prochlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82read.alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85read.fasta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87readfirstrec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88readsmj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90rearranged.oriloc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91reverse.align . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93rot13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95s2c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96s2n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97seqinr-package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98splitseq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

AAstat 3

stresc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100syncodons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101synsequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103tablecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104toyaa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105toycodon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106translate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107trimSpace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109uco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111ucoweight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114words.pos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115write.fasta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116dinucleotides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Index 120

AAstat To Get Some Protein Statistics

Description

Returns simple protein sequence information including the number of residues, the percentagephysico-chemical classes and the theoretical isoelectric point.

Usage

AAstat(seq, plot = TRUE)

Arguments

seq a protein sequence as a vector of upper-case chars

plot if TRUE, plots the presence of residues splited by physico-chemical classesalong the sequence.

Value

A list with the three following components:

Compo A factor giving the amino acid counts.

Prop A list giving the percentage of each physico-chemical classes (Tiny, Small,Aliphatic, Aromatic, Non-polar, Polar, Charged, Positive, Negative).

Pi The theoretical isoelectric point

Author(s)

D. Charif, J.R. Lobry

4 AnoukResult

References

citation("seqinr")

See Also

computePI, SEQINR.UTIL, SeqFastaAA

Examples

seqAA <- read.fasta(file = system.file("sequences/seqAA.fasta", package = "seqinr"), seqtype = "AA")AAstat(seqAA[[1]])

AnoukResult Expected numeric results for Ka and Ks computation

Description

This data set is what should be obtained when runing kaks() on the test file Anouk.fasta in thesequences directory of the seqinR package.

Usage

data(AnoukResult)

Format

A list with 4 components of class dist.

ka Ka

ks Ks

vka variance for Ka

vks variance for Ks

Details

See the example in kaks.

Source

The fasta test file was provided by Anamaria Necsulea.

References

citation("seqinr")

EXP 5

EXP Vectors of coefficients to compute linear forms.

Description

This dataset is used to compute linear forms on codon frequencies: if codfreq is a vector of codonfrequencies then drop(freq %*% EXP$CG3) will return for instance the G+C content in thirdcodon positions. Base order is the lexical order: a, c, g, t (or u).

Usage

data(EXP)

Format

List of 24 vectors of coefficients

A num [1:4] 1 0 0 0

A3 num [1:64] 1 0 0 0 1 0 0 0 1 0 ...

AGZ num [1:64] 0 0 0 0 0 0 0 0 1 0 ...

ARG num [1:64] 0 0 0 0 0 0 0 0 1 0 ...

AU3 num [1:64] 1 0 0 1 1 0 0 1 1 0 ...

BC num [1:64] 0 1 0 0 0 0 0 0 0 0 ...

C num [1:4] 0 1 0 0

C3 num [1:64] 0 1 0 0 0 1 0 0 0 1 ...

CAI num [1:64] 0.00 0.00 -1.37 -2.98 -2.58 ...

CG num [1:4] 0 1 1 0

CG1 num [1:64] 0 0 0 0 0 0 0 0 0 0 ...

CG12 num [1:64] 0 0 0 0 0.5 0.5 0.5 0.5 0.5 0.5 ...

CG2 num [1:64] 0 0 0 0 1 1 1 1 1 1 ...

CG3 num [1:64] 0 1 1 0 0 1 1 0 0 1 ...

CGN num [1:64] 0 0 0 0 0 0 0 0 0 0 ...

F1 num [1:64] 1.026 0.239 1.026 0.239 -0.097 ...

G num [1:4] 0 0 1 0

G3 num [1:64] 0 0 1 0 0 0 1 0 0 0 ...

KD num [1:64] -3.9 -3.5 -3.9 -3.5 -0.7 -0.7 -0.7 -0.7 -4.5 -0.8 ...

Q num [1:64] 0 0 0 0 1 1 1 1 0 0 ...

QA3 num [1:64] 0 0 0 0 1 0 0 0 0 0 ...

QC3 num [1:64] 0 0 0 0 0 1 0 0 0 0 ...

U num [1:4] 0 0 0 1

U3 num [1:64] 0 0 0 1 0 0 0 1 0 0 ...

6 EXP

Details

It’s better to work directly at the amino-acid level when computing linear forms on amino-acid fre-quencies so as to have a single coefficient vector. For instance EXP$KD to compute the Kyte andDoolittle hydrophaty index from codon frequencies is valid only for the standard genetic code.

An alternative for drop(freq %*% EXP$CG3) is sum( freq * EXP$CG3 ), but this isless efficient in terms of CPU time. The advantage of the latter, however, is that thanks to recyclingrules you can use either sum( freq * EXP$A ) or sum( freq * EXP$A3 ). To do thesame with the %*% operator you have to explicit the recycling rule as in drop( freq %*%rep(EXP$A, 16)).

Source

ANALSEQ EXPFILEs for command EXP.http://biomserv.univ-lyon1.fr/doclogi/docanals/manuel.html

References

citation("seqinr")

A content in A nucleotide

A3 content in A nucleotide in third position of codon

AGZ Arg content (aga and agg codons)

ARG Arg content

AU3 content in A and U nucleotides in third position of codon

BC Good choice (Bon choix). Gouy M., Gautier C. (1982) codon usage in bacteria : Correlationwith gene expressivity. Nucleic Acids Research,10(22):7055-7074.

C content in C nucleotides

C3 content in A nucleotides in third position of codon

CAI Codon adaptation index for E. coli. Sharp, P.M., Li, W.-H. (1987) The codon adaptationindex - a measure of directionam synonymous codon usage bias, and its potential applications.Nucleic Acids Research,15:1281-1295.

CG content in G + C nucleotides

CG1 content in G + C nucleotides in first position of codon

CG12 content in G + C nucleotides in first and second position of codon

CG2 content in G + C nucleotides in second position of codon

CG3 content in G + C nucleotides in third position of codon

CGN content in CGA + CGU + CGA + CGG

F1 From Table 2 in Lobry, J.R., Gautier, C. (1994) Hydrophobicity, expressivity and aromaticityare the major trends of amino-acid usage in 999 Escherichia coli chromosome-encode genes.Nucleic Acids Research,22:3174-3180.

G3 content in G nucleotides in third position of codon

KD Kyte, J., Doolittle, R.F. (1982) A simple method for displaying the hydropathic character of aprotein. J. Mol. Biol.,157 :105-132.

http://biomserv.univ-lyon1.fr/doclogi/docanals/manuel.html

G+C Content 7

Q content in quartet

QA3 content in quartet with the A nucleotide in third position

QC3 content in quartet with the A nucleotide in third position

U content in U nucleotide

U3 content in U nucleotides in third position of codon

Examples

data(EXP)

G+C Content Calculates the fractional G+C content of nucleic acid sequences.

Description

Calculates the fraction of G+C bases of the input nucleic acid sequence(s). It reads in nucleic acidsequences, sums the number of ’g’ and ’c’ bases and writes out the result as the fraction (in theinterval 0.0 to 1.0) to the total number of ’a’, ’c’, ’g’ and ’t’ bases. Global G+C content GC, G+Cin the first position of the codon bases GC1, G+C in the second position of the codon bases GC2,and G+C in the third position of the codon bases GC3 can be computed. All functions can takeambiguous bases into account when requested.

Usage

GC(seq, forceToLower = TRUE, exact = FALSE, oldGC = FALSE)GC1(seq, ...)GC2(seq, ...)GC3(seq, ...)

Arguments

seq a nucleic acid sequence as a vector of single characters

forceToLower logical defaulting to TRUE: force sequence characters in lower-case. Turn thisto FALSE to save time if your sequence is already in lower-case

exact logical defaulting to FALSE: should ambiguous bases taken into account whencomputing the G+C content (see details)

oldGC logical defaulting to FALSE: should the GC content computed as in seqinR <=1.0-6, that is as the sum of ’g’ and ’c’ bases divided by the length of the sequence

... arguments passed to the function GC

8 G+C Content

Details

When exact is set to TRUE the G+C content is estimated with ambiguous bases taken into account.Note that this is time expensive. A first pass is made on non-ambiguous bases to estimate theprobabilities of the four bases in the sequence. They are then used to weight the contributions ofambiguous bases to the G+C content. Let note nx the total number of base ’x’ in the sequence. Forinstance suppose that there are nb bases ’b’. ’b’ stands for "not a", that is for ’c’, ’g’ or ’t’. Thecontribution of ’b’ bases to the GC base count will be:

nb*(nc + ng)/(nc + ng + nt)

The contribution of ’b’ bases to the AT base count will be:

nb*nt/(nc + ng + nt)

All ambiguous bases contributions to the AT and GC counts are weighted is similar way and thenthe G+C content is computed as ngc/(nat + ngc).

Value

GC returns the fraction of G+C as a numeric vector, GC1 returns the fraction of G+C at first codonposition as a numeric vector, GC2 returns the fraction of G+C at second codon position as a numericvector, GC3 returns the fraction of G+C at third codon position as a numeric vector.

Author(s)

D. Charif and L. Palmeira and J.R. Lobry

References

citation("seqinr").

The program codonW used here for comparison is available at http://codonw.sourceforge.net/.

See Also

You can use s2c to convert a string into a vetor of single character and tolower to convert upper-case characters into lower-case characters. Do not confuse with gc for garbage collection.

Examples

mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")GC(mysequence) # 0.4761905GC1(mysequence) # 0.6428571GC2(mysequence) # 0.3571429GC3(mysequence) # 0.4285714

## With upper-case characters:#myUCsequence <- s2c("GGGGGGGGGA")GC(myUCsequence) # 0.9

## With ambiguous bases:

http://codonw.sourceforge.net/

http://codonw.sourceforge.net/

GetFromSequence 9

#GC(s2c("acgt")) # 0.5GC(s2c("acgtssss")) # 0.5GC(s2c("acgtssss"), exact = TRUE) # 0.75

## How to reproduce the results obtained with the C program codonW# version 1.4.4 writen by John Peden. We use here the "input.dat"# test file from codonW (there are no ambiguous base in these# sequences).#inputdatfile <- system.file("sequences/input.dat", package = "seqinr")input <- read.fasta(file = inputdatfile) # read the FASTA fileinputoutfile <- system.file("sequences/input.out", package = "seqinr")input.res <- read.table(inputoutfile, header = TRUE) # read codonW result file

# remove stop codon before computing G+C content (as in codonW)

GC.codonW <- function(dnaseq, ...){GC(dnaseq[seq_len(length(dnaseq) - 3)], ...)

}input.gc <- sapply(input, GC.codonW, forceToLower = FALSE)max(abs(input.gc - input.res$GC)) # 0.0004946237

plot(x = input.gc, y = input.res$GC, las = 1,xlab = "Results with GC()", ylab = "Results from codonW",main = "Comparison of G+C content results")abline(c(0, 1), col = "red")legend("topleft", inset = 0.01, legend = "y = x", lty = 1, col = "red")

GetFromSequence Generic Functions to obtain annotation, fragment, associated key-word(s), length, location, name, sequence, or translation for a se-quence

Description

All methods apply on sequences of class SeqAcnucWeb.

getAnnot, getFrag, getLength, getName, getSequence, and getTrans can also ap-ply on sequences of classes SeqFastadna and SeqFastaAA.

getFrag, getLength, getName, getSequence and getTrans can moreover apply onsequences of classes SeqFrag.

Usage

getAnnot(object, nbl = 10000)getFrag(object, begin, end)getKeyword(object)getLength(object)getLocation(object)

10 GetFromSequence

getName(object)getSequence(object, as.string = FALSE)getTrans(object, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)

Arguments

object an object of the class SeqAcnucWeb [ or of the classes SeqFastadna, SeqFastaAA][ or of the class SeqFrag]

nbl the maximum number of line of annotation to read. Reading of lines stops whennl lines have been transmitted or at the last annotation line of the sequence (SQor ORIGIN line).

begin First base

end Last base

as.string if TRUE sequences are returned as a string instead of a vector of chars.

frame Frame(s) (0,1,2) to translate. By default the frame 0 is used.

sens Sense to translate: F for forward sense and R for revers sense.

numcode The number of the code to use. By default the standard genetic code is used.

NAstring How to translate amino-acids when there are ambiguous bases in codons.

ambiguous If TRUE, ambiguous bases are taken into account so that for instance GGN istranslated to Gly in the standard genetic code.

Value

getAnnot returns a vector of string of characters containing the annotation.

getFrag returns an object of class SeqFrag, which is a vector of chars with many attributes (seeSeqFrag).

getKeyword returns a vector of strings containing the keyword(s) associated to a sequence.

getLength returns a numeric vector giving the length of the sequence.

getLocation returns a list giving the positions of the sequence on the parent sequence. If thesequence is a subsequence (e.g. coding sequence), the function will return the position of each exonon the parent sequence.

getName returns a string of characters containing the name of the sequence.

getSequence returns a vector of chars containing the sequence (default) or a string when as.stringis set to TRUE.

getTrans returns a vector of chars containing the sequence.

Author(s)

D. Charif and J.R. Lobry and L. Palmeira

References

citation("seqinr")

GetFromSequence 11

See Also

query, SeqAcnucWeb, c2s, translate

Examples

###### List all available methods for getSequence generic function:###methods(getSequence)

## SeqAcnucWeb class example:### Not run:# Need internet connection for getSequence.SeqAcnucWeb()choosebank("emblTP")query("fc", "sp=felis catus et t=cds")getSequence(fc$req[[1]])getSequence(fc$req[[1]], as.string = TRUE)

## End(Not run)## SeqFastaAA class example:#aafile <- system.file("sequences/seqAA.fasta", package = "seqinr")sfaa <- read.fasta(aafile, seqtype = "AA")getSequence(sfaa[[1]])getSequence(sfaa[[1]], as.string = TRUE)

## SeqFastadna class example:#dnafile <- system.file("sequences/malM.fasta", package = "seqinr")sfdna <- read.fasta(file = dnafile)getSequence(sfdna[[1]])getSequence(sfdna[[1]], as.string = TRUE)

## SeqFrag class example:#sfrag <- getFrag(object = sfdna[[1]], begin = 1, end = 10)getSequence(sfrag)getSequence(sfrag, as.string = TRUE)

## Default getSequence method example:#getSequence(letters)getSequence(letters, as.string = TRUE)

###### List all available methods for getAnnot generic function:###methods(getAnnot)

## SeqAcnucWeb class example:

12 GetFromSequence

### Not run:# Need internet connection for getAnnot.SeqAcnucWeb()choosebank("emblTP")query("fc", "sp=felis catus et t=cds")annots <- getAnnot(fc$req[[1]])cat(annots, sep = "\n")

## End(Not run)## SeqFastaAA class example:#getAnnot(sfaa[[1]])

## SeqFastadna class example:#getAnnot(sfdna[[1]])

## Default getAnnot method example:### Not run:# An error is produced because there are no annotations by defaultgetAnnot(letters)

## End(Not run)

###### List all available methods for getKeyword generic function:###methods(getKeyword)

## SeqAcnucWeb class example:### Not run:# Need internet connection for getKeyword.SeqAcnucWeb()choosebank("emblTP")query("fc", "sp=felis catus et t=cds")getKeyword(fc$req[[1]])# Should be: [1] "DIVISION ORG" "RELEASE 62" "CYTOCHROME B" "SOURCE" "CDS"

## End(Not run)## Default getKeyword method example:### Not run:# An error is produced because there are no keywords by defaultgetKeyword(letters)

## End(Not run)

###### List all available methods for getFrag generic function:###

SEQINR.UTIL 13

methods(getFrag)

###### List all available methods for getLength generic function:###methods(getLength)

###### List all available methods for getLocation generic function:###methods(getLocation)

###### List all available methods for getName generic function:###methods(getName)

## SeqAcnucWeb class example:### Not run:# Need internet connection for getName.SeqAcnucWeb()choosebank("emblTP")query("fc", "sp=felis catus et t=cds")sapply(fc$req, getName)

## End(Not run)###### List all available methods for getTrans generic function:###methods(getTrans)

SEQINR.UTIL utility data for seqinr

Description

This data set gives the genetics code, the name of each codon, the IUPAC one-letter code foraminoacids and the physico-chemical class of amino acid and the pK values of amino acids de-scribed in Bjellqvist et al. (1993).

Usage

data(SEQINR.UTIL)

Format

SEQINR.UTIL is a list containing the 4 following objects:

CODES.NCBI is a data frame containing the genetics code : The standard (’Universal’) genetic code with aselection of non-standard codes.

14 SeqAcnucWeb

CODON.AA is a three columns data frame. The first column is a factor containing the codon. The secondcolumn is a factor giving the aminoacids names for each codon. The last column is a factorgiving the IUPAC one-letter code for aminoacids

AA.PROPERTY is a list giving the physico-chemical class of amino acid. The differents classes are the follow-ing one : Tiny, Small, Aliphatic, Aromatic, Non.polar, Polar, Charged, Basic, Acidic

pK is a data frame. It gives the pK values of amino acids described in Bjellqvist et al. (1993) , whichwere defined by examining polypeptide migration between pH 4.5 to 7.3 in an immobilised pHgradient gel environment with 9.2M and 9.8M urea at 15 degree or 25 degree

Source

Data prepared by D.Charif 〈[email protected]〉.The genetic codes have been taken from the ncbi taxonomy database: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c. Last update October 05, 2000.The IUPAC one-letter code for aminoacids is descibed at: http://www.chem.qmul.ac.uk/iupac/AminoAcid/. pK values of amino acids were taken from Bjellqvist et al.Bjellqvist, B.,Hughes, G.J., Pasquali, Ch., Paquet, N., Ravier, F., Sanchez, J.-Ch., Frutiger, S. &Hochstrasser, D.F.(1993) The focusing positions of polypeptides in immobilized pH gradients canbe predicted from their amino acid sequences. Electrophoresis, 14, 1023-1031.

References

citation("seqinr")

Examples

data(SEQINR.UTIL)

SeqAcnucWeb Sequence coming from an ACNUC data bases located on the web

Description

as.SeqAcnucWeb is called by many functions, for instance by query.SeqAcnucWeb, andshould not directly be called by the user. It creates an object of class SeqAcnucWeb. is.SeqAcnucWebreturns TRUE if the object is of class SeqAcnucWeb.

Usage

as.SeqAcnucWeb(object,length, frame, ncbigc,socket=FALSE)is.SeqAcnucWeb(object)

http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c


http://www.chem.qmul.ac.uk/iupac/AminoAcid/


SeqFastaAA 15

Arguments

object a string giving the name of a sequence presents in the data baselength a string giving the length of the sequence presents in the data baseframe a string giving the frame of the sequence presents in the data basencbigc a string giving the ncbi genetic code of the sequence presents in the data basesocket an object of class socket

Value

as.SeqAcnucWeb returns an object sequence of class SeqAcnucWeb

Author(s)

D. Charif

References

citation("seqinr")

Examples

## Not run: s = choosebank.socket("genbank")## Not run: query.socket(s$socket,"felis","sp=felis catus et t=cds et o=mitochondrion")## Not run: is.SeqAcnucWeb(felis$req[[1]])

SeqFastaAA AA sequence in Fasta Format

Description

as.SeqFastaAA is called by the function as read.fasta. It creates an object of class SeqFastaAA.is.SeqFastaAA returns TRUE if the object is of class SeqFastaAA. summary.SeqFastaAAgives the AA composition of an object of class SeqFastaAA.

Usage

as.SeqFastaAA(object, name = NULL, Annot = NULL)is.SeqFastaAA(object)## S3 method for class 'SeqFastaAA':summary(object,...)

Arguments

object a vector of chars representing a biological sequencename NULL a character string specifying a name for the sequenceAnnot NULL a character string specifying some annotations for the sequence... additional arguments affecting the summary produced

16 SeqFastadna

Value

as.SeqFastaAA returns an object sequence of class SeqFastaAA. summary.SeqFastaAAreturns a list which the following components:

composition the AA counting of the sequenceAA.Property the percentage of each group of amino acid in the sequence. By example, the

groups are small, tiny, aliphatic, aromatic ...

Author(s)

D. Charif

References

citation("seqinr")

Examples

s <- read.fasta(File = system.file("sequences/seqAA.fasta", package = "seqinr"), seqtype="AA")is.SeqFastaAA(s[[1]])summary(s[[1]])myseq <- s2c("MSPTAYRRGSPAFLV*")as.SeqFastaAA(myseq, name = "myseq", Annot = "blablabla")myseq

SeqFastadna Class for DNA sequence in Fasta Format

Description

as.SeqFastadna is called by many functions as read.fasta. It creates an object of classSeqFastadna. is.SeqFastadna returns TRUE if the object is of class SeqFastadna.summary.SeqFastadna gives the base composition of an object of class SeqFastadna.

Usage

as.SeqFastadna(object, name = NULL, Annot = NULL)is.SeqFastadna(object)## S3 method for class 'SeqFastadna':summary(object, alphabet = s2c("acgt"), ...)

Arguments

object a vector of chars representing a biological sequencename NULL a character string specifying a name for the sequenceAnnot NULL a character string specifying some annotations for the sequence... additional arguments affecting the summary producedalphabet a vector of single characters

SeqFrag 17

Value

as.SeqFastadna returns an object sequence of class SeqFastadna. summary.SeqFastadnareturns a list which the following components:

length the legth of the sequence

compo the base counting of the sequence

GC the percentage of G+C in the sequence

Author(s)

D. Charif

References

citation("seqinr")

Examples

s <- read.fasta(system.file("sequences/malM.fasta",package="seqinr"))is.SeqFastadna(s[[1]])summary(s[[1]])myseq <- s2c("acgttgatgctagctagcatcgat")as.SeqFastadna(myseq, name = "myseq", Annot = "blablabla")myseq

SeqFrag Class for sub-sequences

Description

as.SeqFrag is called by all methods of getFrag, but not directly by the users. It creates anobject sequence of class SeqFrag.

Usage

as.SeqFrag(object, begin, end, compl = FALSE, name = "frag")is.SeqFrag(object)

Arguments

object an object sequence of class seqFastadna, seqFastaAA, seqAcnucWebor seqFrag

begin the first base of the fragment to get

end the last base of the fragment to get

compl if TRUE, you must give a name for the sequence

name the name of the fragment of the sequence

18 a

Value

as.SeqFrag returns a biological sequence representing by a vector of chars with the followingattributes:

seqMother the name of the sequence from which the sequence comes

begin the position of the first base of the fragment on the mother sequence

end the position of the last base of the fragment on the mother sequence

class SeqFrag which is the newest class of the sequence

is.SeqFrag returns TRUE if the object is of class Seqfrag.

Author(s)

D. Charif

References

citation("seqinr")

See Also

getFrag

Examples

s = read.fasta(File=system.file("sequences/malM.fasta",package = "seqinr"))getFrag(s[[1]],1,10)

a Converts amino-acid three-letter code into the one-letter one

Description

This is a vectorized function to convert three-letters amino-acid code into the one-letter one, forinstance "Ala" into "A".

Usage

a(aa)

Arguments

aa A vector of string. All strings are 3 chars long.

Details

Allowed character values for aa are given by aaa(). All other values will generate a warning andreturn NA. Called without arguments, a() returns the list of all possible output values.

aaa 19

Value

A vector of single characters.

Author(s)


References

The IUPAC one-letter code for aminoacids is described at: http://www.chem.qmul.ac.uk/iupac/AminoAcid/citation("seqinr")

See Also

aaa, translate

Examples

## Show all possible input values:#

aaa()

## Convert them in one letter-code:#

a(aaa())

## Check consistency of results:#

stopifnot( aaa(a(aaa())) == aaa())

## Show what happens with non-allowed values:#

a("SOS") # should be NA and a warning is generated

aaa Converts amino-acid one-letter code into the three-letter one

Description

This is a vectorized function to convert one-letter amino-acid code into the three-letter one, forinstance "A" into "Ala".



20 aaa

Usage

aaa(aa)

Arguments

aa A vector of single characters.

Details

Allowed character values for aa are given by a(). All other values will generate a warning andreturn NA. Called without arguments, aaa() returns the list of all possible output values.

Value

A vector of char string. All strings are 3 chars long.

Author(s)

J.R. Lobry

References

The IUPAC one-letter code for aminoacids is described at: http://www.chem.qmul.ac.uk/iupac/AminoAcid/ citation("seqinr")

See Also

a, translate

Examples

## Show all possible input values:#

a()

## Convert them in one letter-code:#

aaa(a())

## Check consistency of results:#

stopifnot(a(aaa(a())) == a())

## Show what happens with non-allowed values:



aacost 21

#

aaa("Z") # should be NA and a warning is generated

aacost Aerobic cost of amino-acids in Escherichia coli and G+C classes

Description

The metabolic cost of amino-acid biosynthesis in E. coli under aerobic conditions from table 1 inAkashi and Gojobori (2002). The G+C classes are from Lobry (1997).

Usage

data(aacost)

Format

A data frame with 20 rows for the amino-acids and the following 7 columns:

aaa amino-acid (three-letters code).

a amino-acid (one-letter code).

prec precursor metabolites (see details).

p number of high-energy phosphate bonds contained in ATP and GTP molecules.

h number of available hydrogen atoms carried in NADH, NADPH, and FADH2 molcules.

tot total metabolic cost assuming 2 high-energy phosphate bonds per hydrogen atom.

gc an ordered factor (l<m<h) for the G+C class of the amino-acid (see details)

Details

precursor metabolites: penP, ribose 5-phosphate; PRPP, 5-phosphoribosyl pyrophosphate; eryP,erythrose 4-phosphate; 3pg, 3-phosphoglycerate; pep, phosphoenolpyruvate; pyr, pyruvate; acCoA,acetyl-CoA; akg, alpha-ketoglutarate; oaa, oxaloacetate. Negative signs on precursor metabolitesindicate chemicals gained through biosynthetic pathways. Costs of precursors reflect averages forgrowth on glucose, acetate, and malate (see Table 6 in the supporting information from Akashi andGojobori 2002).The levels l<m<h for the gc ordered factor stand for Low G+C, Middle G+C, High G+C amino-acid, respectively. The frequencies of Low G+C amino-acids monotonously decrease with G+Ccontent. The frequencies of High G+C amino- acids monotonously increase with G+C content.The frequencies of Middle G+C amino-acids first increase and then decrease with G+C content.These G+C classes are from Lobry (1997).example(aacost) reproduces figure 2 from Lobry (2004).

22 aaindex

Source

Akashi, H, Gojobori, T. (2002) Metabolic efficiency and amino acid composition in the proteomesof Escherichia coli and Bacillus subtilis. Proceedings of the National Academy of Sciences of theUnited States of America, 99:3695-3700.Lobry, J.R. (1997) Influence of genomic G+C content on average amino-acid composition of pro-teins from 59 bacterial species. Gene, 205:309-316.Lobry, J.R. (2004) Life history traits and genome structure: aerobiosis and G+C content in bacteria.Lecture Notes in Computer Sciences, 3039:679-686.

References

citation("seqinr")

Examples

data(aacost)levels(aacost$gc) <- c("low G+C", "mid G+C", "high G+C")stripchart(aacost$tot~aacost$gc, pch = 19, ylim = c(0.5,3.5),

xlim = c(0, max(aacost$tot)),xlab = "Metabolic cost (high-energy phosphate bonds equivalent)",main = "Metabolic cost of the 20 amino-acids\nas function of their G+C class" )

boxplot(aacost$tot~aacost$gc, horizontal = TRUE, add = TRUE)

aaindex List of 544 physicochemical and biological properties for the 20amino-acids

Description

Data were imported from release 9.1 (AUG 2006) of the aaindex1 database. See the referencesection to cite this database in a publication.

Usage

data(aaindex)

Format

A named list with 544 elements having each the following components:

H String: Accession number in the aaindex database.

D String: Data description.

R String: LITDB entry number.

A String: Author(s).

T String: Title of the article.

J String: Journal reference and comments.

aaindex 23

C String: Accession numbers of similar entries with the correlation coefficients of 0.8 (-0.8) ormore (less). Notice: The correlation coefficient is calculated with zeros filled for missingvalues.

I Numeric named vector: amino acid index data.

Details

A short description of each entry is available under the D component:

alpha-CH chemical shifts (Andersen et al., 1992)Hydrophobicity index (Argos et al., 1982)Signal sequence helical potential (Argos et al., 1982)Membrane-buried preference parameters (Argos et al., 1982)Conformational parameter of inner helix (Beghin-Dirkx, 1975)Conformational parameter of beta-structure (Beghin-Dirkx, 1975)Conformational parameter of beta-turn (Beghin-Dirkx, 1975)Average flexibility indices (Bhaskaran-Ponnuswamy, 1988)Residue volume (Bigelow, 1967)Information value for accessibility; average fraction 35 Information value for accessibility; averagefraction 23 Retention coefficient in TFA (Browne et al., 1982)Retention coefficient in HFBA (Browne et al., 1982)Transfer free energy to surface (Bull-Breese, 1974)Apparent partial specific volume (Bull-Breese, 1974)alpha-NH chemical shifts (Bundi-Wuthrich, 1979)alpha-CH chemical shifts (Bundi-Wuthrich, 1979)Spin-spin coupling constants 3JHalpha-NH (Bundi-Wuthrich, 1979)Normalized frequency of alpha-helix (Burgess et al., 1974)Normalized frequency of extended structure (Burgess et al., 1974)Steric parameter (Charton, 1981)Polarizability parameter (Charton-Charton, 1982)Free energy of solution in water, kcal/mole (Charton-Charton, 1982)The Chou-Fasman parameter of the coil conformation (Charton-Charton, 1983)A parameter defined from the residuals obtained from the best correlation of the Chou-Fasman pa-rameter of beta-sheet (Charton-Charton, 1983)The number of atoms in the side chain labelled 1+1 (Charton-Charton, 1983)The number of atoms in the side chain labelled 2+1 (Charton-Charton, 1983)The number of atoms in the side chain labelled 3+1 (Charton-Charton, 1983)The number of bonds in the longest chain (Charton-Charton, 1983)A parameter of charge transfer capability (Charton-Charton, 1983)A parameter of charge transfer donor capability (Charton-Charton, 1983)Average volume of buried residue (Chothia, 1975)Residue accessible surface area in tripeptide (Chothia, 1976)Residue accessible surface area in folded protein (Chothia, 1976)Proportion of residues 95 Proportion of residues 100 Normalized frequency of beta-turn (Chou-Fasman, 1978a)Normalized frequency of alpha-helix (Chou-Fasman, 1978b)Normalized frequency of beta-sheet (Chou-Fasman, 1978b)Normalized frequency of beta-turn (Chou-Fasman, 1978b)

24 aaindex

Normalized frequency of N-terminal helix (Chou-Fasman, 1978b)Normalized frequency of C-terminal helix (Chou-Fasman, 1978b)Normalized frequency of N-terminal non helical region (Chou-Fasman, 1978b)Normalized frequency of C-terminal non helical region (Chou-Fasman, 1978b)Normalized frequency of N-terminal beta-sheet (Chou-Fasman, 1978b)Normalized frequency of C-terminal beta-sheet (Chou-Fasman, 1978b)Normalized frequency of N-terminal non beta region (Chou-Fasman, 1978b)Normalized frequency of C-terminal non beta region (Chou-Fasman, 1978b)Frequency of the 1st residue in turn (Chou-Fasman, 1978b)Frequency of the 2nd residue in turn (Chou-Fasman, 1978b)Frequency of the 3rd residue in turn (Chou-Fasman, 1978b)Frequency of the 4th residue in turn (Chou-Fasman, 1978b)Normalized frequency of the 2nd and 3rd residues in turn (Chou-Fasman, 1978b)Normalized hydrophobicity scales for alpha-proteins (Cid et al., 1992)Normalized hydrophobicity scales for beta-proteins (Cid et al., 1992)Normalized hydrophobicity scales for alpha+beta-proteins (Cid et al., 1992)Normalized hydrophobicity scales for alpha/beta-proteins (Cid et al., 1992)Normalized average hydrophobicity scales (Cid et al., 1992)Partial specific volume (Cohn-Edsall, 1943)Normalized frequency of middle helix (Crawford et al., 1973)Normalized frequency of beta-sheet (Crawford et al., 1973)Normalized frequency of turn (Crawford et al., 1973)Size (Dawson, 1972)Amino acid composition (Dayhoff et al., 1978a)Relative mutability (Dayhoff et al., 1978b)Membrane preference for cytochrome b: MPH89 (Degli Esposti et al., 1990)Average membrane preference: AMP07 (Degli Esposti et al., 1990)Consensus normalized hydrophobicity scale (Eisenberg, 1984)Solvation free energy (Eisenberg-McLachlan, 1986)Atom-based hydrophobic moment (Eisenberg-McLachlan, 1986)Direction of hydrophobic moment (Eisenberg-McLachlan, 1986)Molecular weight (Fasman, 1976)Melting point (Fasman, 1976)Optical rotation (Fasman, 1976)pK-N (Fasman, 1976)pK-C (Fasman, 1976)Hydrophobic parameter pi (Fauchere-Pliska, 1983)Graph shape index (Fauchere et al., 1988)Smoothed upsilon steric parameter (Fauchere et al., 1988)Normalized van der Waals volume (Fauchere et al., 1988)STERIMOL length of the side chain (Fauchere et al., 1988)STERIMOL minimum width of the side chain (Fauchere et al., 1988)STERIMOL maximum width of the side chain (Fauchere et al., 1988)N.m.r. chemical shift of alpha-carbon (Fauchere et al., 1988)Localized electrical effect (Fauchere et al., 1988)Number of hydrogen bond donors (Fauchere et al., 1988)Number of full nonbonding orbitals (Fauchere et al., 1988)Positive charge (Fauchere et al., 1988)

aaindex 25

Negative charge (Fauchere et al., 1988)pK-a(RCOOH) (Fauchere et al., 1988)Helix-coil equilibrium constant (Finkelstein-Ptitsyn, 1977)Helix initiation parameter at posision i-1 (Finkelstein et al., 1991)Helix initiation parameter at posision i,i+1,i+2 (Finkelstein et al., 1991)Helix termination parameter at posision j-2,j-1,j (Finkelstein et al., 1991)Helix termination parameter at posision j+1 (Finkelstein et al., 1991)Partition coefficient (Garel et al., 1973)Alpha-helix indices (Geisow-Roberts, 1980)Alpha-helix indices for alpha-proteins (Geisow-Roberts, 1980)Alpha-helix indices for beta-proteins (Geisow-Roberts, 1980)Alpha-helix indices for alpha/beta-proteins (Geisow-Roberts, 1980)Beta-strand indices (Geisow-Roberts, 1980)Beta-strand indices for beta-proteins (Geisow-Roberts, 1980)Beta-strand indices for alpha/beta-proteins (Geisow-Roberts, 1980)Aperiodic indices (Geisow-Roberts, 1980)Aperiodic indices for alpha-proteins (Geisow-Roberts, 1980)Aperiodic indices for beta-proteins (Geisow-Roberts, 1980)Aperiodic indices for alpha/beta-proteins (Geisow-Roberts, 1980)Hydrophobicity factor (Goldsack-Chalifoux, 1973)Residue volume (Goldsack-Chalifoux, 1973)Composition (Grantham, 1974)Polarity (Grantham, 1974)Volume (Grantham, 1974)Partition energy (Guy, 1985)Hydration number (Hopfinger, 1971), Cited by Charton-Charton (1982)Hydrophilicity value (Hopp-Woods, 1981)Heat capacity (Hutchens, 1970)Absolute entropy (Hutchens, 1970)Entropy of formation (Hutchens, 1970)Normalized relative frequency of alpha-helix (Isogai et al., 1980)Normalized relative frequency of extended structure (Isogai et al., 1980)Normalized relative frequency of bend (Isogai et al., 1980)Normalized relative frequency of bend R (Isogai et al., 1980)Normalized relative frequency of bend S (Isogai et al., 1980)Normalized relative frequency of helix end (Isogai et al., 1980)Normalized relative frequency of double bend (Isogai et al., 1980)Normalized relative frequency of coil (Isogai et al., 1980)Average accessible surface area (Janin et al., 1978)Percentage of buried residues (Janin et al., 1978)Percentage of exposed residues (Janin et al., 1978)Ratio of buried and accessible molar fractions (Janin, 1979)Transfer free energy (Janin, 1979)Hydrophobicity (Jones, 1975)pK (-COOH) (Jones, 1975)Relative frequency of occurrence (Jones et al., 1992)Relative mutability (Jones et al., 1992)Amino acid distribution (Jukes et al., 1975)

26 aaindex

Sequence frequency (Jungck, 1978)Average relative probability of helix (Kanehisa-Tsong, 1980)Average relative probability of beta-sheet (Kanehisa-Tsong, 1980)Average relative probability of inner helix (Kanehisa-Tsong, 1980)Average relative probability of inner beta-sheet (Kanehisa-Tsong, 1980)Flexibility parameter for no rigid neighbors (Karplus-Schulz, 1985)Flexibility parameter for one rigid neighbor (Karplus-Schulz, 1985)Flexibility parameter for two rigid neighbors (Karplus-Schulz, 1985)The Kerr-constant increments (Khanarian-Moore, 1980)Net charge (Klein et al., 1984)Side chain interaction parameter (Krigbaum-Rubin, 1971)Side chain interaction parameter (Krigbaum-Komoriya, 1979)Fraction of site occupied by water (Krigbaum-Komoriya, 1979)Side chain volume (Krigbaum-Komoriya, 1979)Hydropathy index (Kyte-Doolittle, 1982)Transfer free energy, CHP/water (Lawson et al., 1984)Hydrophobic parameter (Levitt, 1976)Distance between C-alpha and centroid of side chain (Levitt, 1976)Side chain angle theta(AAR) (Levitt, 1976)Side chain torsion angle phi(AAAR) (Levitt, 1976)Radius of gyration of side chain (Levitt, 1976)van der Waals parameter R0 (Levitt, 1976)van der Waals parameter epsilon (Levitt, 1976)Normalized frequency of alpha-helix, with weights (Levitt, 1978)Normalized frequency of beta-sheet, with weights (Levitt, 1978)Normalized frequency of reverse turn, with weights (Levitt, 1978)Normalized frequency of alpha-helix, unweighted (Levitt, 1978)Normalized frequency of beta-sheet, unweighted (Levitt, 1978)Normalized frequency of reverse turn, unweighted (Levitt, 1978)Frequency of occurrence in beta-bends (Lewis et al., 1971)Conformational preference for all beta-strands (Lifson-Sander, 1979)Conformational preference for parallel beta-strands (Lifson-Sander, 1979)Conformational preference for antiparallel beta-strands (Lifson-Sander, 1979)Average surrounding hydrophobicity (Manavalan-Ponnuswamy, 1978)Normalized frequency of alpha-helix (Maxfield-Scheraga, 1976)Normalized frequency of extended structure (Maxfield-Scheraga, 1976)Normalized frequency of zeta R (Maxfield-Scheraga, 1976)Normalized frequency of left-handed alpha-helix (Maxfield-Scheraga, 1976)Normalized frequency of zeta L (Maxfield-Scheraga, 1976)Normalized frequency of alpha region (Maxfield-Scheraga, 1976)Refractivity (McMeekin et al., 1964), Cited by Jones (1975)Retention coefficient in HPLC, pH7.4 (Meek, 1980)Retention coefficient in HPLC, pH2.1 (Meek, 1980)Retention coefficient in NaClO4 (Meek-Rossetti, 1981)Retention coefficient in NaH2PO4 (Meek-Rossetti, 1981)Average reduced distance for C-alpha (Meirovitch et al., 1980)Average reduced distance for side chain (Meirovitch et al., 1980)Average side chain orientation angle (Meirovitch et al., 1980)

aaindex 27

Effective partition energy (Miyazawa-Jernigan, 1985)Normalized frequency of alpha-helix (Nagano, 1973)Normalized frequency of bata-structure (Nagano, 1973)Normalized frequency of coil (Nagano, 1973)AA composition of total proteins (Nakashima et al., 1990)SD of AA composition of total proteins (Nakashima et al., 1990)AA composition of mt-proteins (Nakashima et al., 1990)Normalized composition of mt-proteins (Nakashima et al., 1990)AA composition of mt-proteins from animal (Nakashima et al., 1990)Normalized composition from animal (Nakashima et al., 1990)AA composition of mt-proteins from fungi and plant (Nakashima et al., 1990)Normalized composition from fungi and plant (Nakashima et al., 1990)AA composition of membrane proteins (Nakashima et al., 1990)Normalized composition of membrane proteins (Nakashima et al., 1990)Transmembrane regions of non-mt-proteins (Nakashima et al., 1990)Transmembrane regions of mt-proteins (Nakashima et al., 1990)Ratio of average and computed composition (Nakashima et al., 1990)AA composition of CYT of single-spanning proteins (Nakashima-Nishikawa, 1992)AA composition of CYT2 of single-spanning proteins (Nakashima-Nishikawa, 1992)AA composition of EXT of single-spanning proteins (Nakashima-Nishikawa, 1992)AA composition of EXT2 of single-spanning proteins (Nakashima-Nishikawa, 1992)AA composition of MEM of single-spanning proteins (Nakashima-Nishikawa, 1992)AA composition of CYT of multi-spanning proteins (Nakashima-Nishikawa, 1992)AA composition of EXT of multi-spanning proteins (Nakashima-Nishikawa, 1992)AA composition of MEM of multi-spanning proteins (Nakashima-Nishikawa, 1992)8 A contact number (Nishikawa-Ooi, 1980)14 A contact number (Nishikawa-Ooi, 1986)Transfer energy, organic solvent/water (Nozaki-Tanford, 1971)Average non-bonded energy per atom (Oobatake-Ooi, 1977)Short and medium range non-bonded energy per atom (Oobatake-Ooi, 1977)Long range non-bonded energy per atom (Oobatake-Ooi, 1977)Average non-bonded energy per residue (Oobatake-Ooi, 1977)Short and medium range non-bonded energy per residue (Oobatake-Ooi, 1977)Optimized beta-structure-coil equilibrium constant (Oobatake et al., 1985)Optimized propensity to form reverse turn (Oobatake et al., 1985)Optimized transfer energy parameter (Oobatake et al., 1985)Optimized average non-bonded energy per atom (Oobatake et al., 1985)Optimized side chain interaction parameter (Oobatake et al., 1985)Normalized frequency of alpha-helix from LG (Palau et al., 1981)Normalized frequency of alpha-helix from CF (Palau et al., 1981)Normalized frequency of beta-sheet from LG (Palau et al., 1981)Normalized frequency of beta-sheet from CF (Palau et al., 1981)Normalized frequency of turn from LG (Palau et al., 1981)Normalized frequency of turn from CF (Palau et al., 1981)Normalized frequency of alpha-helix in all-alpha class (Palau et al., 1981)Normalized frequency of alpha-helix in alpha+beta class (Palau et al., 1981)Normalized frequency of alpha-helix in alpha/beta class (Palau et al., 1981)Normalized frequency of beta-sheet in all-beta class (Palau et al., 1981)

28 aaindex

Normalized frequency of beta-sheet in alpha+beta class (Palau et al., 1981)Normalized frequency of beta-sheet in alpha/beta class (Palau et al., 1981)Normalized frequency of turn in all-alpha class (Palau et al., 1981)Normalized frequency of turn in all-beta class (Palau et al., 1981)Normalized frequency of turn in alpha+beta class (Palau et al., 1981)Normalized frequency of turn in alpha/beta class (Palau et al., 1981)HPLC parameter (Parker et al., 1986)Partition coefficient (Pliska et al., 1981)Surrounding hydrophobicity in folded form (Ponnuswamy et al., 1980)Average gain in surrounding hydrophobicity (Ponnuswamy et al., 1980)Average gain ratio in surrounding hydrophobicity (Ponnuswamy et al., 1980)Surrounding hydrophobicity in alpha-helix (Ponnuswamy et al., 1980)Surrounding hydrophobicity in beta-sheet (Ponnuswamy et al., 1980)Surrounding hydrophobicity in turn (Ponnuswamy et al., 1980)Accessibility reduction ratio (Ponnuswamy et al., 1980)Average number of surrounding residues (Ponnuswamy et al., 1980)Intercept in regression analysis (Prabhakaran-Ponnuswamy, 1982)Slope in regression analysis x 1.0E1 (Prabhakaran-Ponnuswamy, 1982)Correlation coefficient in regression analysis (Prabhakaran-Ponnuswamy, 1982)Hydrophobicity (Prabhakaran, 1990)Relative frequency in alpha-helix (Prabhakaran, 1990)Relative frequency in beta-sheet (Prabhakaran, 1990)Relative frequency in reverse-turn (Prabhakaran, 1990)Helix-coil equilibrium constant (Ptitsyn-Finkelstein, 1983)Beta-coil equilibrium constant (Ptitsyn-Finkelstein, 1983)Weights for alpha-helix at the window position of -6 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of -5 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of -4 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of -3 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of -2 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of -1 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of 0 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of 1 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of 2 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of 3 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of 4 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of 5 (Qian-Sejnowski, 1988)Weights for alpha-helix at the window position of 6 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of -6 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of -5 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of -4 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of -3 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of -2 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of -1 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of 0 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of 1 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of 2 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of 3 (Qian-Sejnowski, 1988)

aaindex 29

Weights for beta-sheet at the window position of 4 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of 5 (Qian-Sejnowski, 1988)Weights for beta-sheet at the window position of 6 (Qian-Sejnowski, 1988)Weights for coil at the window position of -6 (Qian-Sejnowski, 1988)Weights for coil at the window position of -5 (Qian-Sejnowski, 1988)Weights for coil at the window position of -4 (Qian-Sejnowski, 1988)Weights for coil at the window position of -3 (Qian-Sejnowski, 1988)Weights for coil at the window position of -2 (Qian-Sejnowski, 1988)Weights for coil at the window position of -1 (Qian-Sejnowski, 1988)Weights for coil at the window position of 0 (Qian-Sejnowski, 1988)Weights for coil at the window position of 1 (Qian-Sejnowski, 1988)Weights for coil at the window position of 2 (Qian-Sejnowski, 1988)Weights for coil at the window position of 3 (Qian-Sejnowski, 1988)Weights for coil at the window position of 4 (Qian-Sejnowski, 1988)Weights for coil at the window position of 5 (Qian-Sejnowski, 1988)Weights for coil at the window position of 6 (Qian-Sejnowski, 1988)Average reduced distance for C-alpha (Rackovsky-Scheraga, 1977)Average reduced distance for side chain (Rackovsky-Scheraga, 1977)Side chain orientational preference (Rackovsky-Scheraga, 1977)Average relative fractional occurrence in A0(i) (Rackovsky-Scheraga, 1982)Average relative fractional occurrence in AR(i) (Rackovsky-Scheraga, 1982)Average relative fractional occurrence in AL(i) (Rackovsky-Scheraga, 1982)Average relative fractional occurrence in EL(i) (Rackovsky-Scheraga, 1982)Average relative fractional occurrence in E0(i) (Rackovsky-Scheraga, 1982)Average relative fractional occurrence in ER(i) (Rackovsky-Scheraga, 1982)Average relative fractional occurrence in A0(i-1) (Rackovsky-Scheraga, 1982)Average relative fractional occurrence in AR(i-1) (Rackovsky-Scheraga, 1982)Average relative fractional occurrence in AL(i-1) (Rackovsky-Scheraga, 1982)Average relative fractional occurrence in EL(i-1) (Rackovsky-Scheraga, 1982)Average relative fractional occurrence in E0(i-1) (Rackovsky-Scheraga, 1982)Average relative fractional occurrence in ER(i-1) (Rackovsky-Scheraga, 1982)Value of theta(i) (Rackovsky-Scheraga, 1982)Value of theta(i-1) (Rackovsky-Scheraga, 1982)Transfer free energy from chx to wat (Radzicka-Wolfenden, 1988)Transfer free energy from oct to wat (Radzicka-Wolfenden, 1988)Transfer free energy from vap to chx (Radzicka-Wolfenden, 1988)Transfer free energy from chx to oct (Radzicka-Wolfenden, 1988)Transfer free energy from vap to oct (Radzicka-Wolfenden, 1988)Accessible surface area (Radzicka-Wolfenden, 1988)Energy transfer from out to in(95 Mean polarity (Radzicka-Wolfenden, 1988)Relative preference value at N" (Richardson-Richardson, 1988)Relative preference value at N’ (Richardson-Richardson, 1988)Relative preference value at N-cap (Richardson-Richardson, 1988)Relative preference value at N1 (Richardson-Richardson, 1988)Relative preference value at N2 (Richardson-Richardson, 1988)Relative preference value at N3 (Richardson-Richardson, 1988)Relative preference value at N4 (Richardson-Richardson, 1988)Relative preference value at N5 (Richardson-Richardson, 1988)

30 aaindex

Relative preference value at Mid (Richardson-Richardson, 1988)Relative preference value at C5 (Richardson-Richardson, 1988)Relative preference value at C4 (Richardson-Richardson, 1988)Relative preference value at C3 (Richardson-Richardson, 1988)Relative preference value at C2 (Richardson-Richardson, 1988)Relative preference value at C1 (Richardson-Richardson, 1988)Relative preference value at C-cap (Richardson-Richardson, 1988)Relative preference value at C’ (Richardson-Richardson, 1988)Relative preference value at C" (Richardson-Richardson, 1988)Information measure for alpha-helix (Robson-Suzuki, 1976)Information measure for N-terminal helix (Robson-Suzuki, 1976)Information measure for middle helix (Robson-Suzuki, 1976)Information measure for C-terminal helix (Robson-Suzuki, 1976)Information measure for extended (Robson-Suzuki, 1976)Information measure for pleated-sheet (Robson-Suzuki, 1976)Information measure for extended without H-bond (Robson-Suzuki, 1976)Information measure for turn (Robson-Suzuki, 1976)Information measure for N-terminal turn (Robson-Suzuki, 1976)Information measure for middle turn (Robson-Suzuki, 1976)Information measure for C-terminal turn (Robson-Suzuki, 1976)Information measure for coil (Robson-Suzuki, 1976)Information measure for loop (Robson-Suzuki, 1976)Hydration free energy (Robson-Osguthorpe, 1979)Mean area buried on transfer (Rose et al., 1985)Mean fractional area loss (Rose et al., 1985)Side chain hydropathy, uncorrected for solvation (Roseman, 1988)Side chain hydropathy, corrected for solvation (Roseman, 1988)Loss of Side chain hydropathy by helix formation (Roseman, 1988)Transfer free energy (Simon, 1976), Cited by Charton-Charton (1982)Principal component I (Sneath, 1966)Principal component II (Sneath, 1966)Principal component III (Sneath, 1966)Principal component IV (Sneath, 1966)Zimm-Bragg parameter s at 20 C (Sueki et al., 1984)Zimm-Bragg parameter sigma x 1.0E4 (Sueki et al., 1984)Optimal matching hydrophobicity (Sweet-Eisenberg, 1983)Normalized frequency of alpha-helix (Tanaka-Scheraga, 1977)Normalized frequency of isolated helix (Tanaka-Scheraga, 1977)Normalized frequency of extended structure (Tanaka-Scheraga, 1977)Normalized frequency of chain reversal R (Tanaka-Scheraga, 1977)Normalized frequency of chain reversal S (Tanaka-Scheraga, 1977)Normalized frequency of chain reversal D (Tanaka-Scheraga, 1977)Normalized frequency of left-handed helix (Tanaka-Scheraga, 1977)Normalized frequency of zeta R (Tanaka-Scheraga, 1977)Normalized frequency of coil (Tanaka-Scheraga, 1977)Normalized frequency of chain reversal (Tanaka-Scheraga, 1977)Relative population of conformational state A (Vasquez et al., 1983)Relative population of conformational state C (Vasquez et al., 1983)

aaindex 31

Relative population of conformational state E (Vasquez et al., 1983)Electron-ion interaction potential (Veljkovic et al., 1985)Bitterness (Venanzi, 1984)Transfer free energy to lipophilic phase (von Heijne-Blomberg, 1979)Average interactions per side chain atom (Warme-Morgan, 1978)RF value in high salt chromatography (Weber-Lacey, 1978)Propensity to be buried inside (Wertz-Scheraga, 1978)Free energy change of epsilon(i) to epsilon(ex) (Wertz-Scheraga, 1978)Free energy change of alpha(Ri) to alpha(Rh) (Wertz-Scheraga, 1978)Free energy change of epsilon(i) to alpha(Rh) (Wertz-Scheraga, 1978)Polar requirement (Woese, 1973)Hydration potential (Wolfenden et al., 1981)Principal property value z1 (Wold et al., 1987)Principal property value z2 (Wold et al., 1987)Principal property value z3 (Wold et al., 1987)Unfolding Gibbs energy in water, pH7.0 (Yutani et al., 1987)Unfolding Gibbs energy in water, pH9.0 (Yutani et al., 1987)Activation Gibbs energy of unfolding, pH7.0 (Yutani et al., 1987)Activation Gibbs energy of unfolding, pH9.0 (Yutani et al., 1987)Dependence of partition coefficient on ionic strength (Zaslavsky et al., 1982)Hydrophobicity (Zimmerman et al., 1968)Bulkiness (Zimmerman et al., 1968)Polarity (Zimmerman et al., 1968)Isoelectric point (Zimmerman et al., 1968)RF rank (Zimmerman et al., 1968)Normalized positional residue frequency at helix termini N4’(Aurora-Rose, 1998)Normalized positional residue frequency at helix termini N"’ (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini N" (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini N’(Aurora-Rose, 1998)Normalized positional residue frequency at helix termini Nc (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini N1 (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini N2 (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini N3 (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini N4 (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini N5 (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini C5 (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini C4 (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini C3 (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini C2 (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini C1 (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini Cc (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini C’ (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini C" (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini C"’ (Aurora-Rose, 1998)Normalized positional residue frequency at helix termini C4’ (Aurora-Rose, 1998)Delta G values for the peptides extrapolated to 0 M urea (O’Neil-DeGrado, 1990)Helix formation parameters (delta delta G) (O’Neil-DeGrado, 1990)Normalized flexibility parameters (B-values), average (Vihinen et al., 1994)

32 aaindex

Normalized flexibility parameters (B-values) for each residue surrounded by none rigid neighbours(Vihinen et al., 1994)Normalized flexibility parameters (B-values) for each residue surrounded by one rigid neighbours(Vihinen et al., 1994)Normalized flexibility parameters (B-values) for each residue surrounded by two rigid neighbours(Vihinen et al., 1994)Free energy in alpha-helical conformation (Munoz-Serrano, 1994)Free energy in alpha-helical region (Munoz-Serrano, 1994)Free energy in beta-strand conformation (Munoz-Serrano, 1994)Free energy in beta-strand region (Munoz-Serrano, 1994)Free energy in beta-strand region (Munoz-Serrano, 1994)Free energies of transfer of AcWl-X-LL peptides from bilayer interface to water (Wimley-White,1996)Thermodynamic beta sheet propensity (Kim-Berg, 1993)Turn propensity scale for transmembrane helices (Monne et al., 1999)Alpha helix propensity of position 44 in T4 lysozyme (Blaber et al., 1993)p-Values of mesophilic proteins based on the distributions of B values (Parthasarathy-Murthy, 2000)p-Values of thermophilic proteins based on the distributions of B values (Parthasarathy-Murthy,2000)Distribution of amino acid residues in the 18 non-redundant families of thermophilic proteins (Ku-mar et al., 2000)Distribution of amino acid residues in the 18 non-redundant families of mesophilic proteins (Kumaret al., 2000)Distribution of amino acid residues in the alpha-helices in thermophilic proteins (Kumar et al.,2000)Distribution of amino acid residues in the alpha-helices in mesophilic proteins (Kumar et al., 2000)Side-chain contribution to protein stability (kJ/mol) (Takano-Yutani, 2001)Propensity of amino acids within pi-helices (Fodje-Al-Karadaghi, 2002)Hydropathy scale based on self-information values in the two-state model (5 Hydropathy scalebased on self-information values in the two-state model (9 Hydropathy scale based on self-informationvalues in the two-state model (16 Hydropathy scale based on self-information values in the two-statemodel (20 Hydropathy scale based on self-information values in the two-state model (25 Hydropa-thy scale based on self-information values in the two-state model (36 Hydropathy scale based onself-information values in the two-state model (50 Averaged turn propensities in a transmembranehelix (Monne et al., 1999)Alpha-helix propensity derived from designed sequences (Koehl-Levitt, 1999)Beta-sheet propensity derived from designed sequences (Koehl-Levitt, 1999)Composition of amino acids in extracellular proteins (percent) (Cedano et al., 1997)Composition of amino acids in anchored proteins (percent) (Cedano et al., 1997)Composition of amino acids in membrane proteins (percent) (Cedano et al., 1997)Composition of amino acids in intracellular proteins (percent) (Cedano et al., 1997)Composition of amino acids in nuclear proteins (percent) (Cedano et al., 1997)Surface composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)Surface composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)Surface composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)

aaindex 33

Surface composition of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)Interior composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)Interior composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)Interior composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)Interior composition of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)Entire chain composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)Entire chain composition of amino acids in intracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)Entire chain composition of amino acids in extracellular proteins of mesophiles (percent) (Fukuchi-Nishikawa, 2001)Entire chain compositino of amino acids in nuclear proteins (percent) (Fukuchi-Nishikawa, 2001)Screening coefficients gamma, local (Avbelj, 2000)Screening coefficients gamma, non-local (Avbelj, 2000)Slopes tripeptide, FDPB VFF neutral (Avbelj, 2000)Slopes tripeptides, LD VFF neutral (Avbelj, 2000)Slopes tripeptide, FDPB VFF noside (Avbelj, 2000)Slopes tripeptide FDPB VFF all (Avbelj, 2000)Slopes tripeptide FDPB PARSE neutral (Avbelj, 2000)Slopes dekapeptide, FDPB VFF neutral (Avbelj, 2000)Slopes proteins, FDPB VFF neutral (Avbelj, 2000)Side-chain conformation by gaussian evolutionary method (Yang et al., 2002)Amphiphilicity index (Mitaku et al., 2002)Volumes including the crystallographic waters using the ProtOr (Tsai et al., 1999)Volumes not including the crystallographic waters using the ProtOr (Tsai et al., 1999)Electron-ion interaction potential values (Cosic, 1994)Hydrophobicity scales (Ponnuswamy, 1993)Hydrophobicity coefficient in RP-HPLC, C18 with 0.1 Hydrophobicity coefficient in RP-HPLC,C8 with 0.1 Hydrophobicity coefficient in RP-HPLC, C4 with 0.1 Hydrophobicity coefficient inRP-HPLC, C18 with 0.1 Hydrophilicity scale (Kuhn et al., 1995)Retention coefficient at pH 2 (Guo et al., 1986)Modified Kyte-Doolittle hydrophobicity scale (Juretic et al., 1998)Interactivity scale obtained from the contact matrix (Bastolla et al., 2005)Interactivity scale obtained by maximizing the mean of correlation coefficient over single-domainglobular proteins (Bastolla et al., 2005)Interactivity scale obtained by maximizing the mean of correlation coefficient over pairs of se-quences sharing the TIM barrel fold (Bastolla et al., 2005)Linker propensity index (Suyama-Ohara, 2003)Knowledge-based membrane-propensity scale from 1D_Helix in MPtopo databases (Punta-Maritan,2003)Knowledge-based membrane-propensity scale from 3D_Helix in MPtopo databases (Punta-Maritan,2003)Linker propensity from all dataset (George-Heringa, 2003)Linker propensity from 1-linker dataset (George-Heringa, 2003)Linker propensity from 2-linker dataset (George-Heringa, 2003)

34 aaindex

Linker propensity from 3-linker dataset (George-Heringa, 2003)Linker propensity from small dataset (linker length is less than six residues) (George-Heringa, 2003)Linker propensity from medium dataset (linker length is between six and 14 residues) (George-Heringa, 2003)Linker propensity from long dataset (linker length is greater than 14 residues) (George-Heringa,2003)Linker propensity from helical (annotated by DSSP) dataset (George-Heringa, 2003)Linker propensity from non-helical (annotated by DSSP) dataset (George-Heringa, 2003)The stability scale from the knowledge-based atom-atom potential (Zhou-Zhou, 2004)The relative stability scale extracted from mutation experiments (Zhou-Zhou, 2004)Buriability (Zhou-Zhou, 2004)Linker index (Bae et al., 2005)Mean volumes of residues buried in protein interiors (Harpaz et al., 1994)Average volumes of residues (Pontius et al., 1996)Hydrostatic pressure asymmetry index, PAI (Di Giulio, 2005)Hydrophobicity index (Wolfenden et al., 1979)Average internal preferences (Olsen, 1980)Hydrophobicity-related index (Kidera et al., 1985)Apparent partition energies calculated from Wertz-Scheraga index (Guy, 1985)Apparent partition energies calculated from Robson-Osguthorpe index (Guy, 1985)Apparent partition energies calculated from Janin index (Guy, 1985)Apparent partition energies calculated from Chothia index (Guy, 1985)Hydropathies of amino acid side chains, neutral form (Roseman, 1988)Hydropathies of amino acid side chains, pi-values in pH 7.0 (Roseman, 1988)Weights from the IFH scale (Jacobs-White, 1989)Hydrophobicity index, 3.0 pH (Cowan-Whittaker, 1990)Scaled side chain hydrophobicity values (Black-Mould, 1991)Hydrophobicity scale from native protein structures (Casari-Sippl, 1992)NNEIG index (Cornette et al., 1987)SWEIG index (Cornette et al., 1987)PRIFT index (Cornette et al., 1987)PRILS index (Cornette et al., 1987)ALTFT index (Cornette et al., 1987)ALTLS index (Cornette et al., 1987)TOTFT index (Cornette et al., 1987)TOTLS index (Cornette et al., 1987)Relative partition energies derived by the Bethe approximation (Miyazawa-Jernigan, 1999)Optimized relative partition energies - method A (Miyazawa-Jernigan, 1999)Optimized relative partition energies - method B (Miyazawa-Jernigan, 1999)Optimized relative partition energies - method C (Miyazawa-Jernigan, 1999)Optimized relative partition energies - method D (Miyazawa-Jernigan, 1999)Hydrophobicity index (Engelman et al., 1986)Hydrophobicity index (Fasman, 1989)

Source

http://www.genome.jp/aaindex

http://www.genome.jp/aaindex

alllistranks 35

References

From the original aaindex documentation:

Please cite the following references when making use of the database:

Kawashima, S. and Kanehisa, M. (2000) AAindex: amino acid index database. Nucleic Acids Res.,28:374.

Tomii, K. and Kanehisa, M. (1996) Analysis of amino acid indices and mutation matrices for se-quence comparison and structure prediction of proteins. Protein Eng., 9:27-36.

Nakai, K., Kidera, A., and Kanehisa, M. (1988) Cluster analysis of amino acid indices for predic-tion of protein structure and function. Protein Eng. 2:93-100.

Examples

## Load data:#

data(aaindex)

## Supose that we need the Kyte & Doolittle Hydrophaty index. We first look# at the entries with Kyte as author:#

which(sapply(aaindex, function(x) length(grep("Kyte", x$A)) != 0))

## This should return that entry number 151 named KYTJ820101 is the only# one that fit our request. We can access to it by position or by name,# for instance:#

aaindex[[151]]$Iaaindex[["KYTJ820101"]]$Iaaindex$KYTJ820101$I

alllistranks To get the count of existing lists and all their ranks on server

Description

This is a low level function to get the total number of list and all their ranks in an opened database.

36 alllistranks

Usage

alllistranks(socket = "auto", verbose = FALSE)alr(socket = "auto", verbose = FALSE)

Arguments

socket a socket of class connection and sockconn returned by choosebank. Defaultvalue (auto) means that the socket will be set to to the socket component of thebanknameSocket variable.

verbose if TRUE, verbose mode is on

Details

This low level function is usually not used directly by the user.

Value

A list with two components:

count count of existing lists

rank their rank

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

choosebank, query

Examples

## Not run:# Need internet connectionchoosebank("genbank")query("tmp1", "sp=Borrelia burgdorferi", virtual=T)query("tmp2", "sp=Borrelia burgdorferi", virtual=T)query("tmp3", "sp=Borrelia burgdorferi", virtual=T)alr()## Should be:## $count# [1] 3## $ranks

amb 37

# [1] 2 3 4#

## End(Not run)

amb Expansion of IUPAC nucleotide symbols

Description

This function returns the list of nucleotide matching a given IUPAC nucleotide symbol, for instancec("c", "g") for "s".

Usage

amb(base, forceToLower = TRUE, checkBase = TRUE,IUPAC = s2c("acgturymkswbdhvn"), u2t = TRUE)

Arguments

base an IUPAC symbol for a nucleotide as a single character

forceToLower if TRUE the base is forced to lower case

checkBase if TRUE the character is checked to belong to the allowed IUPAC symbol list

IUPAC the list of allowed IUPAC symbols

u2t if TRUE "u" for uracil in RNA are changed into "t" for thymine in DNA

Details

Non ambiguous bases are returned unchanged (except for "u" when u2t is TRUE).

Value

When base is missing, the list of IUPAC symbols is returned, otherwise a vector with expandedsymbols.

Author(s)

J.R. Lobry

References

The nomenclature for incompletely specified bases in nucleic acid sequences at: http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html

citation("seqinr")

http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html

http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html

38 c2s

See Also

Use tolower to change upper case letters into lower case letters.

Examples

## The list of IUPAC symbols:#

amb()

## And their expansion:#

sapply(amb(), amb)

c2s conversion of a vector of chars into a string

Description

This is a simple utility function to convert a vector of chars such as c("m", "e", "r", "g", "e", "d")into a single string such as "merged".

Usage

c2s(chars = c("m", "e", "r", "g", "e", "d"))

Arguments

chars a vector of chars

Value

a string

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

s2c

chargaff 39

Examples

c2s( c("m","e","r","g","e","d") )

chargaff Base composition in ssDNA for 7 bacterial DNA

Description

Long before the genomic era, it was possible to get some data for the global composition of single-stranded DNA chromosomes by direct chemical analyses. These data are from Chargaff’s lab andgive the base composition of the L (Ligth) strand for 7 bacterial chromosomes.

Usage

data(chargaff)

Format

A data frame with 7 observations on the following 4 variables.

[A ] frequencies of A bases in percent

[G ] frequencies of G bases in percent

[C ] frequencies of C bases in percent

[T ] frequencies of T bases in percent

Details

Data are from Table 2 in Rudner et al. (1969) for the L-strand. Data for Bacillus subtilis were takenfrom a previous paper: Rudner et al. (1968). This is in fact the average value observed for twodifferent strains of B. subtilis: strain W23 and strain Mu8u5u16.Denaturated chromosomes can be separated by a technique of intermitent gradient elution from acolumn of methylated albumin kieselguhr (MAK), into two fractions, designated, by virtue of theirbuoyant densities, as L (light) and H (heavy). The fractions can be hydrolyzed and subjected tochromatography to determined their global base composition.The surprising result is that we have almost exactly A=T and C=G in single stranded-DNAs. Thesecond paragraph page 157 in Rudner et al. (1969) says: "Our previous work on the complementarystrands of B. subtilis DNA suggested an additional, entirely unexpected regularity, namely, theequality in either strand of 6-amino and 6-keto nucleotides ( A + C = G + T). This relationship,which would normally have been regarded merely as the consequence of base-pairing in DNAduplex and would not have been predicted as a likely property of a single strand, is shown here toapply to all strand specimens isolated from denaturated DNA of the AT type (Table 2, preps. 1-4).It cannot yet be said to be established for the DNA specimens from the equimolar and GC types(nos. 5-7)."

40 chargaff

Source

Rudner, R., Karkas, J.D., Chargaff, E. (1968) Separation of B. subtilis DNA into complementarystrands, III. Direct Analysis. Proceedings of the National Academy of Sciences of the United Statesof America, 60:921-922.Rudner, R., Karkas, J.D., Chargaff, E. (1969) Separation of microbial deoxyribonucleic acids intocomplementary strands. Proceedings of the National Academy of Sciences of the United States ofAmerica, 63:152-159.

References

Try example(chargaff) to mimic figure page 17 in http://pbil.univ-lyon1.fr/members/lobry/articles/HDR.pdf. The red areas correspond to non-allowed values beausethe sum of the four bases frequencies cannot exceed 100%. The white areas correspond to possiblevalues (more exactly to the projection from R^4 to the corresponding R^2 planes of the region ofallowed values). The blue lines correspond to the very small subset of allowed values for which wehave in addition PR2 state, that is [A]=[T] and [C]=[G]. Remember, these data are for ssDNA!

citation("seqinr")

Examples

data(chargaff)op <- par(no.readonly = TRUE)par(mfrow = c(4,4), mai = rep(0,4), xaxs = "i", yaxs = "i")xlim <- ylim <- c(0, 100)

for( i in 1:4 ){for( j in 1:4 ){if( i == j ){plot(chargaff[,i], chargaff[,j],t = "n", xlim = xlim, ylim = ylim,xlab = "", ylab = "", xaxt = "n", yaxt = "n")polygon(x = c(0, 0, 100, 100), y = c(0, 100, 100, 0), col = "lightgrey")for( k in seq(from = 0, to = 100, by = 10) ){lseg <- 3segments(k, 0, k, lseg)segments(k, 100 - lseg, k, 100)segments(0, k, lseg, k)segments(100 - lseg, k, 100, k)

}string <- paste(names(chargaff)[i],"\n\n",xlim[1],"% -",xlim[2],"%")text(x=mean(xlim),y=mean(ylim), string, cex = 1.5)

}else{plot(chargaff[,i], chargaff[,j], pch = 1, xlim = xlim, ylim = ylim,xlab = "", ylab = "", xaxt = "n", yaxt = "n", cex = 2)iname <- names(chargaff)[i]

http://pbil.univ-lyon1.fr/members/lobry/articles/HDR.pdf

http://pbil.univ-lyon1.fr/members/lobry/articles/HDR.pdf

choosebank 41

jname <- names(chargaff)[j]direct <- function() segments(0, 0, 50, 50, col="blue")invers <- function() segments(0, 50, 50, 0, col="blue")PR2 <- function(){if( iname == "[A]" & jname == "[T]" ) { direct(); return() }if( iname == "[T]" & jname == "[A]" ) { direct(); return() }if( iname == "[C]" & jname == "[G]" ) { direct(); return() }if( iname == "[G]" & jname == "[C]" ) { direct(); return() }invers()

}PR2()polygon(x = c(0, 100, 100), y = c(100, 100, 0), col = "pink4")polygon(x = c(0, 0, 100), y = c(0, 100, 0))

}}

}# Clean uppar(op)

choosebank To select a database structured under ACNUC and located on the web

Description

This function allows to select one of the databases structured under ACNUC and located on the web.Called without arguments, choosebank(), will return the list of available databases. Then, youcan use query to make your query and get a list of sequence names. Remote access to ACNUCdatabases works by opening a socket connection on a port (for example on port number 5558 atpbil.univ-lyon1.fr) and by communicating on this socket following the protocol described in thesection references.

Usage

choosebank(bank = NA, host = "pbil.univ-lyon1.fr", port = 5558, verbose = FALSE,timeout = 5, infobank = FALSE, tagbank = NA)

Arguments

bank String. The name of the bank. If NA, choosebank will return the names ofall database known by the server.

host String. Host name for port

port Integer. The TCP port number

verbose Logical. If TRUE, verbose mode is on

timeout Integer. The timeout in seconds for socketConnection. Default 5 seconds.

infobank Logical. If infobank is TRUE and bank is NA, a data.frame with all databaseinformations will be returned

42 choosebank

tagbank String. If bank is NA and tagbank is documented, the names of special pur-poses databases are returned. Current allowed values are TP for frozen databasesand TEST for test databases.

Details

When called without arguments, choosebank() returns a list of all the databases names knownby the server, as a vector of string. When called with choosebank(infobank = TRUE), adata.frame with more information is returned.

ValueWhen called with a regular bank name, an (invisible) list with five components

socket an object of class socket

bankname the name of the bank

totseqs the total number of sequences present in the opened database

totspecs the total number of species present in the opened database

totkeys the total number of keywords present in the opened databaseWhen called with bank = NA

A vector of all available bank names.When called with bank = NA and infobank = TRUE, a data.frame with three columns

bank The name of the bank.

status The bank status (on/of).

info Short description of bank with last release date.

normal-bracket60bracket-normal

Note

The object of class socket will be the first argument of the query function.

Author(s)

D. Charif

References

For more information about the socket communication protocol with ACNUC please get at http://pbil.univ-lyon1.fr/databases/acnuc/remote_acnuc.html. To get the releasedate and content of all the databases located at the pbil, please look at the following url: http://pbil.univ-lyon1.fr/search/releases.phpGouy, M., Milleret, F., Mugnier, C., Jacobzone, M., Gautier,C. (1984) ACNUC: a nucleic acid se-quence data base and analysis system. Nucl. Acids Res., 12:121-127.Gouy, M., Gautier, C., Attimonelli, M., Lanave, C., Di Paola, G. (1985) ACNUC - a portable re-trieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput.

http://pbil.univ-lyon1.fr/databases/acnuc/remote_acnuc.html

http://pbil.univ-lyon1.fr/databases/acnuc/remote_acnuc.html

http://pbil.univ-lyon1.fr/search/releases.php


closebank 43

Appl. Biosci., 3:167-172.Gouy, M., Gautier, C., Milleret, F. (1985) System analysis and nucleic acid sequence banks. Biochimie,67:433-436.

citation("seqinr")

See Also

query, connection, socketConnection

Examples

## Not run: mybank <- choosebank()## Not run: choosebank(mybank[1])

closebank To close a remote ACNUC database

Description

This function tries to close a remote ACNUC database.

Usage

closebank(bank = NA, host = "pbil.univ-lyon1.fr", port = 5558, verbose = FALSE)

Arguments

bank String. if NA the last opened database as given by banknameSocket in the globalenvironment will be closed.

host String. Host name for port

port Integer. The TCP port number

verbose Logical. If TRUE, verbose mode is on

References

citation("seqinr")

See Also

choosebank

Examples

## Not run: choosebank("genbank")## Not run: closebank()

44 comp

comp complements a nucleic acid sequence

Description

Complements a sequence, for instance if the sequence is "a","c","g","t" it returns "t","g","c","a".This is not the reverse complementary strand. This function can handle ambiguous bases if required.

Usage

comp(seq, forceToLower = TRUE, ambiguous = FALSE)

Arguments

seq a DNA sequence as a vector of single chars

forceToLower if TRUE character in seq are forced to lower case

ambiguous if TRUE ambiguous bases in seq are handled

Value

a vector of characters which is the complement of the sequence, not the reverse complementarystrand. Undefined values are returned as NA.

Author(s)


References

citation("seqinr")

See Also

Because ssDNA sequences are always written in the 5’->3’ direction, use rev(comp(seq)) to get thereverse complementary strand (see rev).

Examples

#### Show that comp() does not return the reverve complementary strand:##c2s(comp(s2c("aaaattttggggcccc")))#### Show how to get the reverse complementary strand:##c2s(rev(comp(s2c("aaaattttggggcccc"))))#### Show what happens with non allowed values:

computePI 45

##c2s(rev(comp(s2c("aaaaXttttYggggZcccc"))))#### Show what happens with ambiguous bases:##allbases <- s2c("abcdghkmstvwn")comp(allbases) # NA are producedcomp(allbases, ambiguous = TRUE) # No more NA#### Routine sanity check:##stopifnot(identical(comp(allbases, ambiguous = TRUE), s2c("tvghcdmksabwn")))

computePI To Compute the Theoretical Isoelectric Point

Description

This function calculates the theoretical isoelectric point of a protein. Isoelectric point is the pH atwhich the protein has a neutral charge. This estimate does not account for the post-translationalmodifications.

Usage

computePI(seq)

Arguments

seq Protein sequence as a vector of single chars in upper case

Value

The theoretical isoelectric point (pI) as a numerical vector of length one.

Note

Protein pI is calculated using pK values of amino acids described in Bjellqvist et al. See alsoSEQINR.UTIL for more details.

Author(s)

D. Charif and J.R. Lobry

References

The algorithm is the same as the one which is implemented at the following url: http://www.expasy.org/tools/pi_tool-doc.html but with many trials in case of convergence fail-ure of the non-linear regression procedure. citation("seqinr")

http://www.expasy.org/tools/pi_tool-doc.html

http://www.expasy.org/tools/pi_tool-doc.html

46 count

See Also

SEQINR.UTIL

Examples

## Simple sanity check with all 20 amino-acids in one-letter code alphabetical order:#prot <- s2c("ACDEFGHIKLMNPQRSTVWY")stopifnot(all.equal(computePI(prot), 6.78454))## Read a protein sequence in a FASTA file and then compute its pI :#myProts <- read.fasta(file = system.file("sequences/seqAA.fasta", package = "seqinr"), seqtype = "AA")computePI(myProts[[1]]) # Should be 8.534902

count Composition of dimer/trimer/etc oligomers

Description

Counts the number of times dimer/trimer/etc oligomers occurs in a sequence. Note that the oligomersare overlapping.

Usage

count(seq, word, frame = 0, freq = FALSE, alphabet = s2c("acgt"))

Arguments

seq a vector of single chars

word an integer giving the size of word (n-mer) to count

frame an integer (0, 1, 2,...) giving the frame (starting position)

freq if TRUE, word frequencies are computed instead of counts

alphabet a vector of single characters

Details

count counts the occurence of all words by moving a window of length word. The window stepis always an unit. frame controls the starting position in the sequence for the count.

Value

This function returns a factor whose levels are all the possible oligomers. All oligomers are returned,even if absent from the sequence.

crelistfromclientdata 47

Author(s)


References

citation("seqinr")

See Also

table

Examples

a <- s2c("acgggtacggtcccatcgaa")#### To count dinucleotide occurrences in sequence a:##count(a, 2)#### To count trinucleotide occurrences in sequence a, in frame 2:##count(a, 3, 2)#### To count dinucleotide frequencies in sequence a:##count(a, 2, freq = TRUE)#### Simple sanity check:##alldinucl <- "aattgtctaggcgacca"stopifnot(all(count(s2c(alldinucl), 2) == 1))alldiaa <- "aaxxzxbxvxyxwxtxsxpxfxmxkxlxixhxgxexqxcxdxnxrxazzbzvzyzwztzszpzfzmzkzlzizhzgzezqzczdznzrzabbvbybwbtbsbpbfbmbkblbibhbgbebqbcbdbnbrbavvyvwvtvsvpvfvmvkvlvivhvgvevqvcvdvnvrvayywytysypyfymykylyiyhygyeyqycydynyryawwtwswpwfwmwkwlwiwhwgwewqwcwdwnwrwattstptftmtktltithtgtetqtctdtntrtasspsfsmskslsishsgsesqscsdsnsrsappfpmpkplpiphpgpepqpcpdpnprpaffmfkflfifhfgfefqfcfdfnfrfammkmlmimhmgmemqmcmdmnmrmakklkikhkgkekqkckdknkrkallilhlglelqlcldlnlrlaiihigieiqicidiniriahhghehqhchdhnhrhaggegqgcgdgngrgaeeqecedenereaqqcqdqnqrqaccdcncrcaddndrdannrnarra"stopifnot(all(count(s2c(alldiaa), 2, alphabet = s2c("arndcqeghilkmfpstwyvbzx")) == 1))

crelistfromclientdataTo create on server a bitlist from data lines sent by client

Description

This function is usefull if you have a local file with sequence names (sequence ID), or sequenceaccession numbers, or species names, or keywords. This allows you to create on the server a listwith the corresponding items.

Usage

crelistfromclientdata(listname, file, type, socket = "auto", invisible = TRUE, verbose = FALSE, virtual = FALSE)clfcd(listname, file, type, socket = "auto", invisible = TRUE, verbose = FALSE, virtual = FALSE)

48 crelistfromclientdata

Arguments

listname The name of the list as a quoted string of chars

file The local file name

type Could be one of "SQ", "AC", "SP", "KW", see examples

socket a socket of class connection and sockconn returned by choosebank.Defaultvalue (auto) means that the socket will be set to to the socket component of thebanknameSocket variable.

invisible if FALSE, the result is returned visibly.


virtual if TRUE, no attempt is made to retrieve the information about all the elementsof the list. In this case, the req component of the list is set to NA.

Details

Value

A list with the following components:

bank the name of the bank that has been choosen by choosebank.socket

call original call

name list name

nelem number of elements in the list on the server

typelist the type of the elemnts of the list. Could be SQ for a list of sequence names,KW for a list of keywords, SP for a list of species names.

req a list of sequence names that fit the required criteria or NA when called withparameter virtual is TRUE

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

choosebank, query

dia.bactgensize 49

Examples

## Not run:# Need internet connection# Need internet connectionchoosebank("emblTP")## Example with a file that contains sequence names:#fileSQ <- system.file("sequences/bb.mne", package = "seqinr")crelistfromclientdata("listSQ", file = fileSQ, type = "SQ")sapply(listSQ$req, getName)## Example with a file that contains sequence accession numbers:#fileAC <- system.file("sequences/bb.acc", package = "seqinr")crelistfromclientdata("listAC", file = fileAC, type = "AC")sapply(listAC$req, getName)## Example with a file that contains species names:#fileSP <- system.file("sequences/bb.sp", package = "seqinr")crelistfromclientdata("listSP", file = fileSP, type = "SP")sapply(listSP$req, getName)## Example with a file that contains keywords:#fileKW <- system.file("sequences/bb.kwd", package = "seqinr")crelistfromclientdata("listKW", file = fileKW, type = "KW")sapply(listKW$req, getName)

## End(Not run)

dia.bactgensize Distribution of bacterial genome size from GOLD

Description

This function tries to download the last update of the GOLD (Genomes OnLine Database) to extractbacterial genomes sizes when available. The histogram and the default density() output is produced.Optionally, a maximum likelihood estimate of a superposition of two or three normal distributionsis also represented

Usage

dia.bactgensize(fit = 2, p = 0.5, m1 = 2000, sd1 = 600, m2 = 4500,sd2 = 1000, p3 = 0.05, m3 = 9000, sd3 = 1000)

50 dia.bactgensize

Arguments

fit integer value. If fit == O no normal fit is produced, if fit == 2 try to fita superposition of two normal distributions, if fit == 3 try to fit a superposi-tion of three normal distribution.

p initial guess for the proportion of the first population.

m1 initial guess for the mean of the first population.

sd1 initial guess for the standard deviation of the first population.

m2 initial guess for the mean of the second population.

sd2 initial guess for the standard deviation of the second population.

p3 initial guess for the proportion of the third population.

m3 initial guess for the mean of the third population.

sd3 initial guess for the standard deviation of the second population.

Value

An invisible dataframe with three components:

comp1 genus name

comp2 species names

comp3 genome size in Kb

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

density

Examples

## Not run: dia.bactgensize()

dinucl 51

dinucl Mean zscore on 242 complete bacterial chromosomes

Description

This dataset contains the mean zscores as computed on all intergenic sequences (intergenic) and onall CDS (coding) from 242 complete bacterial chromosomes (as retrieved from Genome Reviewsdatabase on June 16, 2005).

Usage

data(dinucl)

Format

List of two dataframes of 242 chromosomes and 16 dinucleotides: one for intergenic, one for codingsequences.

intergenic the mean of zscore computed with the base model on each intergenic sequence

coding the mean of zscore computed with the codon model on each coding sequence

References

Palmeira, L., Guéguen, L. and Lobry JR. (2006) UV-targeted dinucleotides are not depleted in light-exposed Prokaryotic genomes. Molecular Biology and Evolution, 23:2214-2219.http://mbe.oxfordjournals.org/cgi/reprint/23/11/2214

citation("seqinr")

See Also

zscore

Examples

data(dinucl)

http://mbe.oxfordjournals.org/cgi/reprint/23/11/2214

52 dist.alignment

dist.alignment Pairwise Distances from Aligned Protein or DNA/RNA Sequences

Description

These functions compute a matrix of pairwise distances from aligned sequences using similarity(Fitch matrix) or identity matrix.

Usage

dist.alignment(x, matrix = c("similarity", "identity"))

Arguments

x an object of class alignment, as returned by read.alignment for instance

matrix the matrix distance to be used, partial matching allowed

Value

The distance matrix, object of class dist, computed by using the specified distance measure.

Author(s)


References

The reference for the similarity matrix is :Fitch, W.M. (1966) Mutation values for the interconversion of amino acid pair. J. Mol. Biol., 16:9-16.

citation("seqinr")

See Also

read.alignment

Examples

myseqs <- read.alignment(file = system.file("sequences/test.mase",package = "seqinr"), format = "mase")dist.alignment(myseqs, matrix = "identity" )

dotPlot 53

dotPlot Dot Plot Comparison of two sequences

Description

Dot plots are most likely the oldest visual representation used to compare two sequences (see Maizeland Lenk 1981 and references therein). In its simplest form, a dot is produced at position (i,j) iffcharacter number i in the first sequence is the same as character number j in the second sequence.More eleborated forms use sliding windows and a threshold value for two windows to be consideredas matched.

Usage

dotPlot(seq1, seq2, wsize = 1, wstep = 1, nmatch = 1, col = c("white", "black"),xlab = deparse(substitute(seq1)), ylab = deparse(substitute(seq2)), ...)

Arguments

seq1 the first sequence (x-axis) as a vector of single chars.

seq2 the second sequence (y-axis) as a vector of single char.

wsize the size in chars of the moving window.

wstep the size in chars for the steps of the moving window. Use wstep == wsizefor non-overlapping windows.

nmatch if the number of match per window is greater than or equal to nmatch then adot is produced.

col color of points passed to image.

xlab label of x-axis passed to image.

ylab label of y-axis passed to image.

... further arguments passed to image.

Value

NULL.

Author(s)

J.R. Lobry

References

Maizel, J.V. and Lenk, R.P. (1981) Enhanced Graphic Matrix Analysis of Nucleic Acid and ProteinSequences. Proceedings of the National Academy of Science USA, 78:7665-7669.

citation("seqinr")

54 dotchart.uco

See Also

image

Examples

## Identity is on the main diagonal:#dotPlot(letters, letters, main = "Direct repeat")## Internal repeats are off the main diagonal:#dotPlot(rep(letters, 2), rep(letters, 2), main = "Internal repeats")## Inversions are orthogonal to the main diagonal:#dotPlot(letters, rev(letters), main = "Inversion")## Insertion in the second sequence yields a vertical jump:#dotPlot(letters, c(letters[1:10], s2c("insertion"), letters[11:26]),main = "Insertion in the second sequence", asp = 1)

## Insertion in the first sequence yields an horizontal jump:#dotPlot(c(letters[1:10], s2c("insertion"), letters[11:26]), letters,main = "Insertion in the first sequence", asp = 1)

## Protein sequences have usually a good signal/noise ratio because there# are 20 possible amino-acids:#aafile <- system.file("sequences/seqAA.fasta", package = "seqinr")protein <- read.fasta(aafile)[[1]]dotPlot(protein, protein, main = "Dot plot of a protein\nwsize = 1, wstep = 1, nmatch = 1")## Nucleic acid sequences have usually a poor signal/noise ratio because# there are only 4 different bases:#dnafile <- system.file("sequences/malM.fasta", package = "seqinr")dna <- protein <- read.fasta(dnafile)[[1]]dotPlot(dna[1:200], dna[1:200], main = "Dot plot of a nucleic acid sequence\nwsize = 1, wstep = 1, nmatch = 1")## Play with the wsize, wstep and nmatch arguments to increase the# signal/noise ratio:#dotPlot(dna[1:200], dna[1:200], wsize = 3, wstep = 3, nmatch = 3,main = "Dot plot of a nucleic acid sequence\nwsize = 3, wstep = 3, nmatch = 3")

dotchart.uco Cleveland plot for codon usage tables

dotchart.uco 55

Description

Draw a Cleveland dot plot for codon usage tables

Usage

dotchart.uco(x, numcode = 1, aa3 = TRUE, cex = 0.7, alphabet = s2c("tcag"),pch = 21, gpch = 20, bg = par("bg"), color = par("fg"), gcolor = par("fg"),

lcolor = "gray", xlim, ...)

Arguments

x table of codon usage as computed by uco.

numcode the number of the code to be used by translate.

aa3 logical. If TRUE use the three-letter code for amino- acids. If FALSE use theone-letter code for amino-acids.

cex the character size to be used.

alphabet character for codons labels

pch the plotting character or symbol to be used.

gpch the plotting character or symbol to be used for group values.

bg the background color to be used.

color the color(s) to be used for points an labels.

gcolor the single color to be used for group labels and values.

lcolor the color(s) to be used for the horizontal lines.

xlim horizontal range for the plot

... graphical parameters can also be specified as arguments

Value

An invisible list with components:

x table of codon usage

labels codon names

groups amino acid factor

gdata sums by amino acid

ypg the y-axis coordinates for amino acids

ypi the y-axis coordinates for codons

Author(s)

J.R. Lobry

References

Cleveland, W. S. (1985) The Elements of Graphing Data. Monterey, CA: Wadsworth. citation("seqinr")

56 draw.oriloc

See Also

dotchart, uco, aaa, translate

Examples

# Load dataset:data(ec999)# Compute codon usage for all coding sequences:ec999.uco <- lapply(ec999, uco, index="eff")# Put it in a dataframe:df <- as.data.frame(lapply(ec999.uco, as.vector))# Add codon names:row.names(df) <- names(ec999.uco[[1]])# Compute global codon usage:global <- rowSums(df)# Choose a title for the graph:title <- "Codon usage in 999 E. coli coding sequences"# Plot data:dotchart.uco(global, main = title)

draw.oriloc Graphical representation for nucleotide skews in prokaryotic chromo-somes.

Description

Graphical representation for nucleotide skews in prokaryotic chromosomes.

Usage

draw.oriloc(ori, main = "Title",xlab = "Map position in Kb",ylab = "Cumulated combined skew in Kb", las = 1, las.right = 3,ta.mtext = "Cumul. T-A skew", ta.col = "pink", ta.lwd = 1,cg.mtext = "Cumul. C-G skew", cg.col = "lightblue", cg.lwd = 1,cds.mtext = "Cumul. CDS skew", cds.col = "lightgreen", cds.lwd = 1,sk.col = "black", sk.lwd = 2,add.grid = TRUE, ...)

Arguments

ori A data frame obtained with the oriloc function.

main The main title of the plot.

xlab The x-axis title.

ylab The y-axis title.

las The style of axis labels for the bottom and left axes.

las.right The style of axis labels for the right axis.

draw.oriloc 57

ta.mtext The marginal legend for the TA skew.

ta.col The color for the TA skew.

ta.lwd The line width for the TA skew.

cg.mtext The marginal legend for the CG skew.

cg.col The color for the CG skew.

cg.lwd The line width for the CG skew.

cds.mtext The marginal legend for the CDS skew.

cds.col The color for the CDS skew.

cds.lwd The line width for the CDS skew.

sk.col The color for the cumulated combined skew.

sk.lwd The line width for the cumulated combined skew.

add.grid Logical, if TRUE a vertical grid is added to the plot.

... Further arguments are passed to the function plot.

Details

Author(s)

Jean R. Lobry

References

citation("seqinr")

See Also

oriloc, rearranged.oriloc, extract.breakpoints

Examples

## Example with Chlamydia trachomatis complete genome#ori <- oriloc()draw.oriloc(ori)

## The same, using more options from function draw.oriloc()#draw.oriloc(ori,main = expression(italic(Chlamydia~~trachomatis)~~complete~~genome),ta.mtext = "TA skew", ta.col = "red",cg.mtext = "CG skew", cg.col = "blue",cds.mtext = "CDS skew", cds.col = "seagreen",add.grid = FALSE)

58 draw.rearranged.oriloc

draw.rearranged.orilocGraphical representation for rearranged nucleotide skews in prokary-otic chromosomes.

Description

Graphical representation for rearranged nucleotide skews in prokaryotic chromosomes.

Usage

draw.rearranged.oriloc(rearr.ori, breaks.gcfw = NA, breaks.gcrev = NA, breaks.atfw = NA, breaks.atrev = NA)

Arguments

rearr.ori A data frame obtained with the rearranged.oriloc function.

breaks.gcfw The coordinates of the breakpoints in the GC-skew, for forward transcribed pro-tein coding sequences. These coordinates can be obtained with the extract.breakpointsfunction.

breaks.gcrev The coordinates of the breakpoints in the GC-skew, for reverse transcribed pro-tein coding sequences. These coordinates can be obtained with the extract.breakpointsfunction.

breaks.atfw The coordinates of the breakpoints in the AT-skew, for forward transcribed pro-tein coding sequences. These coordinates can be obtained with the extract.breakpointsfunction.

breaks.atrev The coordinates of the breakpoints in the AT-skew, for reverse transcribed pro-tein coding sequences. These coordinates can be obtained with the extract.breakpointsfunction.

Author(s)

Jean R. Lobry and A. Necsulea

References

Necsulea, A. and Lobry, J.R. (in prep) A novel method for assessing the effect of replication onDNA base composition asymmetry.

See Also

rearranged.oriloc, extract.breakpoints

ec999 59

Examples

### Example for Chlamydia trachomatis ####

### Rearrange the chromosome and compute the nucleotide skews ###

## Not run:r.ori <- rearranged.oriloc(seq.fasta = system.file("sequences/ct.fasta",package = "seqinr"),

g2.coord = system.file("sequences/ct.coord",package = "seqinr"))## End(Not run)

### Extract the breakpoints for the rearranged nucleotide skews ###

## Not run: breaks <- extract.breakpoints(r.ori, type = c("gcfw", "gcrev"), nbreaks = c(2, 2), gridsize = 50, it.max = 100)

### Draw the rearranged nucleotide skews and place the position of the breakpoints on the graphics ###

## Not run: draw.rearranged.oriloc(r.ori, breaks.gcfw = breaks$gcfw$breaks, breaks.gcrev = breaks$gcrev$breaks)

ec999 999 coding sequences from E. coli

Description

This dataset contains 999 coding sequences from the Escherichia coli chromosome

Usage

data(ec999)

Format

List of 999 vectors of characters, one for each coding sequence.

ECFOLE.FOLE chr [1:672] "A" "T" "G" "C" ...

ECMSBAG.MSBA chr [1:1749] "A" "T" "G" "C" ...

ECNARZYW-C.NARV chr [1:681] "A" "T" "G" "A" ...

... \... TRUNCATED ...\

XYLEECOM.MALK chr [1:1116] "A" "T" "G" "G" ...

XYLEECOM.LAMB chr [1:1341] "A" "T" "G" "A" ...

XYLEECOM.MALM chr [1:921] "A" "T" "G" "A" ...

60 extract.breakpoints

References

Lobry, J.R., Gautier, C. (1994) Hydrophobicity, expressivity and aromaticity are the major trends ofamino-acid usage in 999 Escherichia coli chromosome-encode genes. Nucleic Acids Research,22:3174-3180.

citation("seqinr")

Examples

data(ec999)

extract.breakpointsExtraction of breakpoint positions on the rearranged nucleotide skews.

Description

Extraction of breakpoint positions on the rearranged nucleotide skews.

Usage

extract.breakpoints(rearr.ori,type = c("atfw", "atrev", "gcfw", "gcrev"), nbreaks, gridsize = 100, it.max = 500)

Arguments

rearr.ori A data frame obtained with the rearranged.oriloc function.

type The type of skew for which to extract the breakpoints; must be a subset ofc("atfw","atrev","gcfw","gcrev").

nbreaks The number of breakpoints to extract for each type of skew. Provide a vector ofthe same length as type.

gridsize To make sure that the best breakpoints are found, and to avoid finding only alocal extremum of the likelihood and residual sum of square functions, a gridsearch is performed. The search for breakpoints is repeated gridsize times,with different starting values for the breakpoints.

it.max The maximum number of iterations to be performed when searching for thebreakpoints. This argument corresponds to the it.max argument in segmented.

Details

This method uses the segmented function in the segmented package to extract the breakpointspositions in the rearranged nucleotide skews obtained with the rearranged.oriloc function.To make sure that the best breakpoints are found, and to avoid finding only a local extremum ofthe likelihood and residual sum of square functions, a grid search is performed. The search forbreakpoints is repeated gridsize times, with different starting values for the breakpoints.

extract.breakpoints 61

Value

This function returns a list, with as many elements as the type argument (for example $gcfwwill contain the results for the rearranged GC-skew, for forward-encoded genes). Each elementof this list is also a list, containing the following information: in $breaks the position of thebreakpoints on the rearranged chromosome; in $slopes.left the slopes of the segments on theleft side of each breakpoint; in $slopes.right the slopes of the segments on the right side ofeach breakpoint; in $real.coord, the coordinates of the breakpoints on the real chromosome(before rearrangement).

Author(s)

A. Necsulea

References

citation("segmented")


See Also

oriloc, draw.rearranged.oriloc, rearranged.oriloc

Examples






## Not run: breaks <- extract.breakpoints(r.ori,type = c("gcfw", "gcrev"), nbreaks = c(2, 2), gridsize = 50, it.max = 100)



62 extractseqs

extractseqs To extract the sequences information of a sequence or a list of se-quence in different formats

Description

The function allows to extract large amount of data as whole genome sequences,using differentoutput formats and types of extraction. This function is not yet available for windows.

Usage

extractseqs(listname,socket = "auto", format="fasta",operation="simple",feature="xx", bounds="xx", minbounds="xx", verbose = FALSE, nzlines=1000)exseq(listname,socket = "auto", format="fasta",operation="simple", feature="xx", bounds="xx", minbounds="xx", verbose = FALSE, nzlines=1000)

Arguments

listname the name of list on server (may be a virtual list)


format the format of output.Can be acnuc, fasta,flat or coordinates

operation the type of extraction. Can be simple, translate, fragment, featureor region

feature -optional- the feature to be extracted (for operations "feature" or "region"): afeature table item (CDS, mRNA,...)

bounds -optional- the bounds for extraction (for operations "fragment" or "region")

minbounds -optional- the minimal bounds for extraction (for operations "fragment" or "re-gion")


nzlines number of line in zlib mode

Details

To extract a list of sequences (lrank argument) or a single sequence (seqnum argument) using dif-ferent output formats and types of extraction. All formats except "coordinates" extract sequencedata. Format "coordinates" extract coordinate data; start > end indicates the complementary strand.

listname sequence list name.

socket a socket of class connection and sockconn returned by choosebank. Default value (auto)means that the socket will be set to to the socket component of the banknameSocket variable.

format acnuc, fasta, flat or coordinates

operation simple, translate, fragment, feature or region

feature (for operations "feature" or "region") a feature table item (CDS, mRNA,...).

extractseqs 63

simple each sequence or subsequence is extracted.translate meaningful only for protein-coding (sub)sequences that are extracted as protein se-

quences. Nothing is extracted for non-protein coding sequences.fragment Allows to extract any part of the sequence(s) in list. Such part is specified by the

bounds and minbounds arguments according to the syntax suggested by these examples:

& 132,1600 & to extract from nucl. 132 to nucl 1600 of the sequence. If applied to a subsequence, coordinates are in the parent seq relatively to the subsequence start point.\ & -10,10 & to extract from 10 nucl. BEFORE the 5’ end of the sequence to nucl. 10 of it. Useful only for subsequences, and produces a fragment extracted from its parent sequence.\ & e-20,e+10 & to extract from 20 nucl. BEFORE the 3’ end of the sequence to 10 nucl. AFTER its 3’ end. Useful only for subsequences, and produces a fragment extracted from its parent sequence.\ & -20,e+5 & to extract from 20 nucl. BEFORE the 5’ end of the sequence to 5 nucl. AFTER its 3’ end.

bounds (for operations "fragment" or "region") see syntax above.

minbounds same syntax as bounds. When the sequence data is too short for this quantity to beextracted, nothing is extracted. When the sequence data is between minbounds and bounds,extracted sequence data is extended by N’s to the desired length.

Value

Sequence data.

Author(s)

S.Penel

References

citation("seqinr")

See Also

choosebank, query getlistrank

Examples

## Not run:# Need internet connectionchoosebank("swissprot")query("MyListName", "k=globin", virtual = T)MyList.fasta <- exseq("MyListName", verbose=T)#

## End(Not run)

64 gb2fasta

gb2fasta conversion of GenBank file into fasta file

Description

Converts a single entry in GenBank format into a fasta file.

Usage

gb2fasta(source.file ="ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Agrobacterium_tumefaciens_C58_Cereon/NC_003065.gbk",destination.file = "Agrobacterium_tumefaciens_C58_Cereon.fasta")

Arguments

source.file GenBank filedestination.file

Fasta file

Details

Multiple entries in GenBank file are not supported.

Value

none

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

oriloc

Examples

## Not run: gb2fasta()

gbk2g2 65

gbk2g2 Conversion of a GenBank format file into a glimmer-like one

Description

This function reads a file in GenBank format and converts the features corresponding to CDS (Cod-ing Sequences) into a format similar to glimmer program output.

Usage

gbk2g2(gbkfile = "ftp://pbil.univ-lyon1.fr/pub/logiciel/oriloc/ct.gbk",g2.coord = "g2.coord")

Arguments

gbkfile The name of the GenBank file

g2.coord The name of the glimmer-like file

Details

Partial CDS (either 5’ or 3’) and join in features are discarded.

Value

Author(s)

J.R. lobry

References

citation("seqinr")

See Also

oriloc

Examples

## Not run: gbk2g2()

66 gbk2g2.euk

gbk2g2.euk Conversion of a GenBank format file into a glimmer-like one. Eukary-otic version.

Description

This function reads a file in GenBank format and converts the features corresponding to CDS (Cod-ing Sequences) into a format similar to glimmer program output. This function is specifically madefor eukaryotic sequences, i.e. with introns.

Usage

gbk2g2.euk(gbkfile = system.file("sequences/ame1.gbk", package ="seqinr"),g2.coord = "g2.coord")

Arguments

gbkfile The name of the GenBank file

g2.coord The name of the output file

Details

This function returns the coordinates of the exons annotated in the GenBank format file.

Value

A data frame with three columns will be written to the g2.coord file. The first column corre-sponds to the name of the gene, given in the GenBank file through the /gene feature. The secondand third column contain the start and the stop position of the exon.

Author(s)

J.R. Lobry and A. Necsulea

References

citation("seqinr")

See Also

oriloc, gbk2g2

Examples

## Not run: gbk2g2.euk()

get.db.growth 67

get.db.growth Get the exponential growth of nucleic acid database content

Description

Connects to the embl database to read the last release note about the number of nucleotides in theDDBJ/EMBL/Genbank database content. A log-linear fit is represented by dia.bd.gowth() with anestimate of the doubling time in months.

Usage

get.db.growth(where = "http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.txt" )dia.db.growth( get.db.growth.out = get.db.growth(), Moore = TRUE, ... )

Arguments

where the file containig the database growth table.get.db.growth.out

the output from get.db.growth()

Moore logical, if TRUE add lines corresponding to an exponential growth rate with adoubling time of 18 months, that is Moore’s law.

... further arguments to plot

Value

A dataframe with the statistics from the embl site.

Author(s)

J.R. Lobry

References

http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.txt

citation("seqinr")

Examples

## Not run: data <- get.db.growth()## Not run: dia.db.growth(data)



68 get.ncbi

get.ncbi Bacterial complete genome data from ncbi ftp site

Description

Try to connect to ncbi ftp site to get a list of complete bacterial genomes.

Usage

get.ncbi(repository = "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/")

Arguments

repository Where to look for data. The default value is the location of the complete bacte-rial genome sequences at ncbi ftp repository.

Value

Returns a data frame which contains the following columns:

species The species name as given by the corresponding folder name in the repository(e.g. Yersinia_pestis_KIM).

accession The accession number as given by the common prefix of file names in the repos-itory (e.g. NC_004088).

size.bp The size of the sequence in bp (e.g. 4600755).

type A factor with two levels (plasmid or chromosome) temptatively deduced fromthe description of the sequence.

WARNING

This function is highly dependant on ncbi ftp site conventions for which we have no control. Theftp connection apparently does not work when there is a proxy, this problem is circumvented herein a rather crude way.

Author(s)

J.R. Lobry

References

citation("seqinr")

Examples

## Not run: bacteria <- get.ncbi()## Not run: summary(bacteria)

getType 69

getType To get available subsequence types in an opened ACNUC database

Description

This function returns all subsequence types (e.g. CDS, TRNA) present in an opened ACNUCdatabase, using default database if no socket is provided.

Usage

getType(socket = "auto")

Arguments

socket An object of class connection returned by choosebank. The default bankis used by when socket == "auto".

Value

It returns a list containing a short description for each sequence type.

Author(s)


References

citation("seqinr")

See Also

choosebank, query

Examples

## Not run:# Need internet connectionchoosebank("genbank")geType()

## End(Not run)

70 getlistrank

getlistrank To get the rank of a list from its name

Description

This is a low level function to get the rank of a list on server from its name.

Usage

getlistrank(listname, socket = "auto", verbose = FALSE)glr(listname, socket = "auto", verbose = FALSE)

Arguments

listname the name of list on server



Details

This low level function is usually not used directly by the user.

Value

The rank of list named listname on server, or 0 if no list with this name exists.

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

choosebank, query

Examples

## Not run:# Need internet connectionchoosebank("genbank")query("MyListName", "sp=Borrelia burgdorferi", virtual=T)glr("MyListName")#

kaks 71

# Should be:## [1] 2#

## End(Not run)

kaks to Get an Estimation of Ka and Ks

Description

Ks and Ka are respectively the number of substitutions per synonymous site and per nonsynony-mous site between two protein-coding genes. The ratio of nonsynonymous (Ka) to synonymous(Ks) nucleotide substitution rates is an indicator of selective pressures on genes. A ratio signifi-cantly greater than 1 indicates positive selective pressure. A ratio around 1 indicates either neutralevolution at the protein level or an averaging of sites under positive and negative selective pressures.A ratio less than 1 indicates pressures to conserve protein sequence (i.e. purifying selection). Thisfunction estimates the Ka and Ks values for a set of aligned sequences using the method publishedby Li (1993) and gives the associated variance matrix.

Usage

kaks(x, debug = FALSE, forceUpperCase = TRUE)

Arguments

x An object of class alignment

debug If TRUE turns debug mode onforceUpperCase

If TRUE, the default value, all character in sequences are forced to the uppercase if at least one ’a’, ’c’, ’g’, or ’t’ is found in the sequences. Turning it toFALSE if the sequences are already in upper case will save time.

Value

ks matrix of Ks values

ka matrix of Ka values

vks variance matrix of Ks

vka variance matrix of Ka

Note

When the alignment does not contain enough information (i.e we approach saturation), the Ka andKs values take the value 10. Negative values indicate that Ka and Ks can not be computed.Codons with ambiguous bases are treated as gaps.Codons with gaps are not used for computations.

72 lseqinr

Author(s)


References

Li, W.-H. (1993) Unbiased estimation of the rates of synonymous and nonsynonymous substitution.J. Mol. Evol., 36:96-99.Hurst, L.D. (2002) The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet.,18:486-486.The C programm implementing this method was provided by Manolo Gouy. More info is neededhere to trace back the original C source so as to credit correct source. The original FORTRAN-77code by Chung-I Wu modified by Ken Wolfe is available here http://wolfe.gen.tcd.ie/lab/pub/li93/.For a recent discussion about the estimation of Ka and Ks see:Tzeng, Y.H., Pan, R., Li, W.-H. (2004) Comparison of three methods for estimating rates of syn-onymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol, 21:2290-2298.The method implemented here is noted LWL85 in the above paper.citation("seqinr")

See Also

read.alignment

Examples

## Simple Toy example:#s <- read.alignment(File = system.file("sequences/test.phylip", package = "seqinr"), format = "phylip")kaks(s)## Check numeric results on an simple test example:#data(AnoukResult)Anouk <- read.alignment(File = system.file("sequences/Anouk.fasta", package = "seqinr"), format = "fasta")if( ! all.equal(kaks(Anouk), AnoukResult) ) {warning("Poor numeric results with Anouk test file")

} else {print("Results are OK with Anouk test file")

}

lseqinr To see what’s inside the package seqinr

Description

This is just a shortcut for ls("package:seqinr")

http://wolfe.gen.tcd.ie/lab/pub/li93/

http://wolfe.gen.tcd.ie/lab/pub/li93/

n2s 73

Usage

lseqinr()

Value

The list of objects in the package seqinr

Note

Use library(help=seqinr) to have a summary of the functionc available in the package.

Author(s)

J.R. Lobry

References

citation("seqinr")

Examples

lseqinr()

n2s function to convert the numeric encoding of a DNA sequence into avector of characters

Description

By default, if no ‘levels’ arguments is provided, this function will just transform your vector ofinteger into a DNA sequence according to the lexical order: 0 -> "a", 1 -> "c", 2 ->"g", 3 -> "t", others -> NA.

Usage

n2s(nseq, levels = c("a", "c", "g", "t"), base4 = TRUE)

Arguments

nseq A vector of integers

levels the translation vector

base4 when this logical is true, the numerical encoding of levels starts at 0, when itis false the numerical encoding of levels starts at 1.

Value

a vector of characters

74 oriloc

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

s2n

Examples

##example of the default behaviour:nseq <- sample(x = 0:3, size = 100, replace = TRUE)n2s(nseq)# Show what happens with out-of-range and NA values:nseq[1] <- NAnseq[2] <- 777n2s(nseq)[1:10]# How to get an RNA instead:n2s(nseq, levels = c("a", "c", "g", "u"))

oriloc Prediction of origin and terminus of replication in bacteria.

Description

This program finds the putative origin and terminus of replication in procaryotic genomes. Theprogram discriminates between codon positions.

Usage

oriloc(seq.fasta = system.file("sequences/ct.fasta", package ="seqinr"),g2.coord = system.file("sequences/ct.predict", package = "seqinr"),glimmer.version = 3,

oldoriloc = FALSE, gbk = NULL, clean.tmp.files = TRUE, rot = 0)

Arguments

seq.fasta the name of a file which contains the dna sequence of a bacterial chromosomein fasta format

g2.coord the name of file which contains the output of glimmer program (*.predict inglimmer version 3)

glimmer.versionglimmer version used, could be 2 or 3

oriloc 75

oldoriloc logical to be set at TRUE to reproduce the (deprecated) outputs of previous(publication date: 2000) version of the oriloc program

gbk the URL of a file in GenBank format

clean.tmp.filesLogical, if TRUE temporary files are removed

rot Integer, with zero default value, used to permute circurlarly the genome.

Details

The method builds on the fact that there are compositional asymmetries between the leading andthe lagging strand for replication. The programs works only with third codon positions so as toincrease the signal/noise ratio. To discriminate between codon positions, the program use as inputeither an annotated genbank file, either a fasta file and a glimmer2.0 (or glimmer3.0) output file.

Value

A data.frame with seven columns: g2num for the CDS number in the g2.coord file, start.kbfor the start position of CDS expressed in Kb (this is the position of the first occurence of a nu-cleotide in a CDS regardless of its orientation), end.kb for the last position of a CDS, CDS.excessfor the DNA walk for gene orientation (+1 for a CDS in the direct strand, -1 for a CDS in the reversestrand) cummulated over genes, skew for the cummulated composite skew in third codon positions,x for the cummulated T - A skew in third codon position, y for the cummulated C - G skew in thirdcodon positions.

Note

The method works only for genomes having a single origin of replication from which the replicationis bidirectional. To detect the composition changes, a DNA-walk is performed. In a 2-dimensionalDNA walk, a C in the sequence corresponds to the movement in the positive y-direction and Gto a movement in the negative y-direction. T and A are mapped by analogous steps along the x-axis. When there is a strand asymmetry, this will form a trajectory that turns at the origin andterminus of replication. Each step is the sum of nucleotides in a gene in third codon positions. Thenorthogonal regression is used to find a line through this trajectory. Each point in the trajectory willhave a corresponding point on the line, and the coordinates of each are calculated. Thereafter, thedistances from each of these points to the origin (of the plane), are calculated. These distanceswill represent a form of cumulative skew. This permets us to make a plot with the gene position(gene number, start or end position) on the x-axis and the cumulative skew (distance) at the y-axis.Depending on where the sequence starts, such a plot will display one or two peaks. Positive peakmeans origin, and negative means terminus. In the case of only one peak, the sequence starts at theorigin or terminus site.

Author(s)

J.R. Lobry and A.C. Frank

76 oriloc

References

More illustrated explanations to help understand oriloc outputs are available there: http://pbil.univ-lyon1.fr/software/Oriloc/howto.html.

Examples of oriloc outputs on real sequence data are there: http://pbil.univ-lyon1.fr/software/Oriloc/index.html.

The original paper for oriloc:Frank, A.C., Lobry, J.R. (2000) Oriloc: prediction of replication boundaries in unannotated bacte-rial chromosomes. Bioinformatics, 16:566-567.http://bioinformatics.oupjournals.org/cgi/reprint/16/6/560

A simple informal introduction to DNA-walks:Lobry, J.R. (1999) Genomic landscapes. Microbiology Today, 26:164-165.http://www.socgenmicrobiol.org.uk/QUA/049906.pdf

An early and somewhat historical application of DNA-walks:Lobry, J.R. (1996) A simple vectorial representation of DNA sequences for the detection of repli-cation origins in bacteria. Biochimie, 78:323-326.

Glimmer, a very efficient open source software for the prediction of CDS from scratch in prokary-otic genome, is decribed at http://www.cbcb.umd.edu/software/glimmer/.For a description of Glimmer 1.0 and 2.0 see:

Delcher, A.L., Harmon, D., Kasif, S., White, O., Salzberg, S.L. (1999) Improved microbial geneidentification with GLIMMER, Nucleic Acids Research, 27:4636-4641.

Salzberg, S., Delcher, A., Kasif, S., White, O. (1998) Microbial gene identification using interpo-lated Markov models, Nucleic Acids Research, 26:544-548.

citation("seqinr")

See Also

draw.oriloc, rearranged.oriloc

Examples

## Not run:## A little bit too long for routine checks because oriloc() is already# called in draw.oriloc.Rd documentation file. Try example(draw.oriloc)# instead, or copy/paste the following code:#

http://pbil.univ-lyon1.fr/software/Oriloc/howto.html

http://pbil.univ-lyon1.fr/software/Oriloc/howto.html

http://pbil.univ-lyon1.fr/software/Oriloc/index.html

http://pbil.univ-lyon1.fr/software/Oriloc/index.html

http://bioinformatics.oupjournals.org/cgi/reprint/16/6/560

http://www.socgenmicrobiol.org.uk/QUA/049906.pdf

http://www.cbcb.umd.edu/software/glimmer/

permutation 77

out <- oriloc()plot(out$st, out$sk, type = "l", xlab = "Map position in Kb",

ylab = "Cumulated composite skew",main = expression(italic(Chlamydia~~trachomatis)~~complete~~genome))

## End(Not run)

permutation Sequence permutation according to several different models

Description

Generates a random permutation of a given sequence, according to a given model. Available modelsare : base, position, codon, syncodon.

Usage

permutation(sequence,modele='base',frame=0,replace=FALSE,prot=FALSE,numcode=1,ucoweight = NULL)

Arguments

sequence A nucleic acids sequence

modele A string of characters describing the model chosen for the random generation

frame Only active for the position, codon, syncodon models: starting positionof CDS as in splitseq

replace This option is not active for the syncodon model: if TRUE, sampling is donewith replacement

prot Only available for the codon model: if TRUE, the first and last codons arepreserved, and only intern codons are shuffled

numcode Only available for the syncodonmodel: the genetic code number as in translate.

ucoweight A list of weights containing the desired codon usage bias as generated by ucoweight.If none is specified, the codon usage of the given sequence is used.

Details

The base model allows for random sequence generation by shuffling (with/without replacement)of all bases in the sequence.

The position model allows for random sequence generation by shuffling (with/without replace-ment) of bases within their position in the codon (bases in position I, II or III stay in position I, IIor III in the new sequence.

The codon model allows for random sequence generation by shuffling (with/without replacement)of codons.

The syncodon model allows for random sequence generation by shuffling (with/without replace-ment) of synonymous codons.

78 plot.SeqAcnucWeb

Value

a sequence generated from the original one by a given model

Author(s)

Leonor Palmeira

References

citation("seqinr")

See Also

synsequence

Examples

data(ec999)sequence=ec999[1][[1]]

new=permutation(sequence,modele='base')identical(all.equal(count(new,1),count(sequence,1)),TRUE)

new=permutation(sequence,modele='position')identical(all.equal(GC(new),GC(sequence)),TRUE)identical(all.equal(GC2(new),GC2(sequence)),TRUE)identical(all.equal(GC3(new),GC3(sequence)),TRUE)

new=permutation(sequence,modele='codon')identical(all.equal(uco(new),uco(sequence)),TRUE)

new=permutation(sequence,modele='syncodon',numcode=1)identical(all.equal(translate(new),translate(sequence)),TRUE)

plot.SeqAcnucWeb To Plot Subsequences on the Parent Sequence

Description

This function allows to plot all the type of subsequences on a parent sequence.Subsequences arerepresented by colored rectangle on the parent sequence. As an example, types could be CDS,TRNA, RRNA .... In order to get all the types that are available for the selected database, please rungetType.socket.

Usage

## S3 method for class 'SeqAcnucWeb':plot(x, type = "all", ...)

pmw 79

Arguments

x A sequence of class SeqAcnucWeb

type The type of subsequences to plot. if all, all the type of subsequences will bedrawn.

... graphical parameters can also be specified as arguments

Value

The plot. A list giving, for each subsequence, its position on the parent sequence.

Author(s)

D. Charif

References

http://pbil.univ-lyon1.fr/databases/acnuc.html

citation("seqinr")

See Also

getType, query

Examples

## Not run:

### Need internet connectionchoosebank("hovernucl")query("list","sp=homo sapiens et k=globin@")plot.SeqAcnucWeb(list$req[[22]])plot.SeqAcnucWeb(list$req[[22]],type=c("CDS","5'NCR"))

## End(Not run)

pmw Protein Molecular Weight

Description

With default parameter values, returns the apparent molecular weight of one mole (6.0221415 e+23)of the input protein expressed in gram at see level on Earth with terrestrial isotopic composition.

Usage

pmw(seqaa, Ar = c(C = 12.0107, H = 1.00794, O = 15.9994,N = 14.0067, P = 30.973762, S = 32.065), gravity = 9.81,unit = "gram", checkseqaa = TRUE)

http://pbil.univ-lyon1.fr/databases/acnuc.html

80 pmw

Arguments

seqaa a protein sequence as a vector of single chars. Allowed values are "*ACDE-FGHIKLMNPQRSTVWY", non allowed values are ignored.

Ar a named vector for the mean relative atomic masses of CHONPS atoms. De-faults values are from to the natural terrestrial sources according to the 43rdIUPAC General Assembly in Beijing, China in August 2005 (See http://www.iupac.org/reports/periodic_table/ for updates).

gravity gravitational field constant in standard units. Defaults to 9.81 m/s2, that is to theaverage value at see level on Earth. Negative values are not allowed.

unit a string that could be "gram" to get the result in grams (1 g = 0.001 kg) or "N"to get the result in Newton units (1 N = 1 kg.m/s2).

checkseqaa if TRUE pmw() warns if a non-allowed character in seqaa is found.

Details

Algorithm Computing the molecular mass of a protein is close to a linear form on amino-acidfrequencies, but not exactly since we have to remove n - 1 water molecules for peptidic boundformation.

Cysteine All cysteines are supposed to be in reduced (-SH) form.

Methionine All methionines are supposed to be not oxidized.

Modifications No post-traductional modifications (such as phosphorylations) are taken into ac-count.

Rare Rare amino-acids (pyrolysine and selenocysteine) are not handled.

Warning Do not use defaults values for Ar to compute the molecular mass of alien’s proteins: theisotopic composition for CHONPS atoms could be different from terrestrial data in a xenobi-otic context. Some aliens are easily offended, make sure not to initiate one more galactic warby repporting wrong results.

Value

The protein molecular weight as a single numeric value.

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

s2c, c2s, aaa, a

http://www.iupac.org/reports/periodic_table/

http://www.iupac.org/reports/periodic_table/

prochlo 81

Examples

allowed <- s2c("*ACDEFGHIKLMNPQRSTVWY") # All allowed chars in a proteinpmw(allowed)all.equal(pmw(allowed), 2395.71366) # Should be true on most platforms## Compute the apparent molecular weight on Moon surface:#pmw(allowed, g = 1.6)## Compute the apparent molecular weight in absence of gravity:#pmw(allowed, g = 0) # should be zero## Reports results in Newton units:#pmw(allowed, unit = "N")## Compute the mass in kg of one mol of this protein:#pmw(allowed)/10^3## Compute the mass for all amino-acids:#sapply(allowed[-1], pmw) -> aamwnames(aamw) <- aaa(names(aamw))aamw

prochlo Zscore on three strains of Prochlorococcus marinus

Description

This dataset contains the zscores computed with the codon model on all CDS from 3 strains ofProcholorococcus marinus (as retrieved from Genome Reviews database on June 16, 2005)

Usage

data(prochlo)

Format

List of three dataframes of the zscore of each of the 16 dinucleotides on each CDS retrieved fromthe specific strain.

BX548174 strain adapted to living at a depth of 5 meters (high levels of UV exposure) basemodelon each intergenic sequence

AE017126 strain adapted to living at a depth of 120 meters (low levels of UV exposure)

BX548175 strain adapted to living at a depth of 135 meters (low levels of UV exposure)

82 query

References


citation("seqinr")

See Also

zscore

Examples

data(prochlo)

query To get a list of sequence names from an ACNUC data base located onthe web

Description

This is a major command of the package. It executes all sequence retrievals using any selectioncriteria the data base allows. The sequences are coming from ACNUC data base located on the weband they are transfered by socket. The command produces the list of all sequence names that fit therequired criteria. The sequence names belong to the class of sequence SeqAcnucWeb.

Usage

query(listname, query, socket = "auto", invisible = TRUE, verbose = FALSE, virtual = FALSE)

Arguments

listname The name of the list as a quoted string of chars

query A quoted string of chars containing the request with the syntax given in thedetails section

socket a socket of class connection and sockconn returned by choosebank.Defaultvalue (auto) means that the socket will be set to to the socket component of thebanknameSocket variable.

invisible if FALSE, the result is returned visibly.


virtual if TRUE, no attempt is made to retrieve the information about all the elementsof the list. In this case, the req component of the list is set to NA.


query 83

Details

The query language defines several selection criteria and operations between lists of elementsmatching criteria. It creates mainly lists of sequences, but also lists of species (or, more gen-erally, taxa) and of keywords. See http://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html#QUERYLANGUAGE for the last update of the description of the query lan-guage.

Selection criteria (no space before the = sign) are:

SP=taxon seqs attached to taxon or any other below in tree; @ wildcard possible

TID=id seqs attached to given numerical NCBI’s taxon id

K=keyword seqs attached to keyword or any other below in tree; @ wildcard possible

T=type seqs of specified type

J=journalname seqs published in journal specified using defined journal code

R=refcode seqs from reference specified such as in jcode/volume/page (e.g., JMB/13/5432)

AU=name seqs from references having specified author (only last name, no initial)

AC=accessionno seqs attached to specified accession number

N=seqname seqs of given name (ID or LOCUS); @ wildcard possible

Y=year seqs published in specified year; > and < can be used instead of =

O=organelle seqs from specified organelle named following defined code (e.g., chloroplast)

M=molecule seqs from specified molecule as named in ID or LOCUS annotation records

ST=status seqs from specified data class (EMBL) or review level (UniProt)

F=filename seqs whose names are in given file, one name per line (unimplemented use clfcdinstead)

FA=filename seqs attached to accession numbers in given file, one number per line (unimple-mented use clfcd instead)

FK=filename produces the list of keywords named in given file, one keyword per line (unimple-mented use clfcd instead)

FS=filename produces the list of species named in given file, one species per line (unimplementeduse clfcd instead)

listname the named list that must have been previously constructed

Operators (always followed and preceded by blanks or parentheses) are:

AND intersection of the 2 list operands

OR union of the 2 list operands

NOT complementation of the single list operand

PAR compute the list of parent seqs of members of the single list operand

SUB add subsequences of members of the single list operand

PS project to species: list of species attached to member sequences of the operand list

PK project to keywords: list of keywords attached to member sequences of the operand list

http://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html#QUERYLANGUAGE

http://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html#QUERYLANGUAGE

84 query

UN unproject: list of seqs attached to members of the species or keywords list operand

SD compute the list of species placed in the tree below the members of the species list operand

KD compute the list of keywords placed in the tree below the members of the keywords list operand

The query language is case insensitive.Three operators (AND, OR, NOT) can be ambiguous becausethey can also occur within valid criterion values. Such ambiguities can be solved by encapsulatingelementary selection criteria between escaped double quotes.

Value

A list with the following components:

bank the name of the bank that has been choosen by choosebank.socket

call original call

name list name

nelem number of elements in the list on the server

typelist the type of the elemnts of the list. Could be SQ for a list of sequence names,KW for a list of keywords, SP for a list of species names.

req a list of sequence names that fit the required criteria or NA when called withparameter virtual is TRUE

Note

Most of the documentation was imported from ACNUC help files written by Manolo Gouy

Author(s)

J.R. Lobry & D. Charif

References

To get the release date and content of all the databases located at the pbil, please look at the follow-ing url: http://pbil.univ-lyon1.fr/search/releases.phpGouy, M., Milleret, F., Mugnier, C., Jacobzone, M., Gautier,C. (1984) ACNUC: a nucleic acid se-quence data base and analysis system. Nucl. Acids Res., 12:121-127.Gouy, M., Gautier, C., Attimonelli, M., Lanave, C., Di Paola, G. (1985) ACNUC - a portable re-trieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput.Appl. Biosci., 3:167-172.Gouy, M., Gautier, C., Milleret, F. (1985) System analysis and nucleic acid sequence banks. Biochimie,67:433-436.

citation("seqinr")

See Also

choosebank, getSequence, getName, crelistfromclientdata


read.alignment 85

Examples

## Not run:# Need internet connectionchoosebank("genbank")query("bb", "sp=Borrelia burgdorferi")# To get the names of the 4 first sequences:sapply(bb$req[1:4], getName)# To get the 4 first sequences:sapply(bb$req[1:4], getSequence, as.string = TRUE)

## End(Not run)

read.alignment Read aligned sequence files in mase, clustal, phylip, fasta or msf for-mat

Description

Read a file in mase, clustal, phylip, fasta or msf format. These formats are used to storenucleotide or protein multiple alignments.

Usage

read.alignment(file, format, File = NULL)

Arguments

file the name of the file which the aligned sequences are to be read from. If it doesnot contain an absolute or relative path, the file name is relative to the currentworking directory, getwd.

format a character string specifying the format of the file : mase, clustal, phylip,fasta or msf

File synonymous of file maintained for backward compatibility.

Details

"mase" The mase format is used to store nucleotide or protein multiple alignments. The beginning of thefile must contain a header containing at least one line (but the content of this header may be empty).The header lines must begin by ;;. The body of the file has the following structure: First, eachentry must begin by one (or more) commentary line. Commentary lines begin by the character ;.Again, this commentary line may be empty. After the commentaries, the name of the sequence iswritten on a separate line. At last, the sequence itself is written on the following lines.

86 read.alignment

"clustal" The CLUSTAL format (*.aln) is the format of the ClustalW multialignment tool output. It canbe described as follows. The word CLUSTAL is on the first line of the file. The alignment isdisplayed in blocks of a fixed length, each line in the block corresponding to one sequence. Eachline of each block starts with the sequence name (maximum of 10 characters), followed by at leastone space character. The sequence is then displayed in upper or lower cases, ’-’ denotes gaps. Theresidue number may be displayed at the end of the first line of each block.

"msf" MSF is the multiple sequence alignment format of the GCG sequence analysis package. It beginswith the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or!!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. Do not edit or delete the file typeif its present.(optional). A description line which contains informative text describing what is inthe file. You can add this information to the top of the MSF file using a text editor.(optional) Adividing line which contains the number of bases or residues in the sequence, when the file wascreated, and importantly, two dots (..) which act as a divider between the descriptive informationand the following sequence information.(required) msf files contain some other information: theName/Weight, a Separating Line which must include two slashes (//) to divide the name/weightinformation from the sequence alignment.(required) and the multiple sequence alignment.

"phylip" PHYLIP is a tree construction program. The format is as follows: the number of sequences andtheir length (in characters) is on the first line of the file. The alignment is displayed in an interleavedor sequential format. The sequence names are limited to 10 characters and may contain blanks.

"fasta" Sequence in fasta format begins with a single-line description (distinguished bby a greater-than(>) symbol), followed by sequence data on the next line.

Value

It returns an object of class alignment which is a list with the following components:

nb the number of aligned sequences

nam a vector of strings containing the names of the aligned sequences

seq a vector of strings containing the aligned sequences

com a vector of strings containing the commentaries for each sequence or NA if thereis no comments

Author(s)


References

citation("seqinr")

Examples

mase <- read.alignment(file = system.file("sequences/test.mase", package = "seqinr"), format = "mase")clustal <- read.alignment(file = system.file("sequences/test.aln", package = "seqinr"), format="clustal")phylip <- read.alignment(file = system.file("sequences/test.phylip", package = "seqinr"), format = "phylip")msf <- read.alignment(file = system.file("sequences/test.msf", package = "seqinr"), format = "msf")fasta <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"), format = "fasta")

read.fasta 87

read.fasta read FASTA formatted files

Description

Read sequences from a file in FASTA format.

Usage

read.fasta(file = system.file("sequences/ct.fasta", package = "seqinr"),seqtype = "DNA", File = NULL, as.string = FALSE, forceDNAtolower = TRUE,set.attributes = TRUE)

Arguments

file The name of the file which the sequences in fasta format are to be read fromseqtype the nature of the sequence: DNA or AAFile Synonymous of file. Maintained for upper compatibility with code based on

seqinR <= 1.0-4as.string if TRUE sequences are returned as a string instead of a vector of single charac-

tersforceDNAtolower

whether sequences with seqtype == "DNA" should be returned as lowercase letters

set.attributeswhether sequence attributes should be set

Details

FASTA is a widely used format in biology, some FASTA files are distributed with the seqinr pack-age, see the examples section bellow.

Value

By default read.fasta return a list of vector of chars. Each element is a sequence object of theclass SeqFastadna or SeqFastaAA.

Author(s)


References

citation("seqinr")

See Also

write.fasta, read.alignment

88 readfirstrec

Examples

## Example of a DNA file in FASTA format:#dnafile <- system.file("sequences/malM.fasta", package = "seqinr")## Read with defaults arguments, looks like:## $XYLEECOM.MALM# [1] "a" "t" "g" "a" "a" "a" "a" "t" "g" "a" "a" "t" "a" "a" "a" "a" "g" "t"# ...read.fasta(file = dnafile)## The same but do not turn the sequence into a vector of single characters, looks like:## $XYLEECOM.MALM# [1] "atgaaaatgaataaaagtctcatcgtcctctgtttatcagcagggttactggcaagcgc# ...read.fasta(file = dnafile, as.string = TRUE)## The same but do not force lower case letters, looks like:## $XYLEECOM.MALM# [1] "ATGAAAATGAATAAAAGTCTCATCGTCCTCTGTTTATCAGCAGGGTTACTGGCAAGC# ...read.fasta(file = dnafile, as.string = TRUE, forceDNAtolower = FALSE)## Example of a protein file in FASTA format:#aafile <- system.file("sequences/seqAA.fasta", package = "seqinr")## Read the protein sequence file, looks like:## $A06852# [1] "M" "P" "R" "L" "F" "S" "Y" "L" "L" "G" "V" "W" "L" "L" "L" "S" "Q" "L"# ...read.fasta(aafile, seqtype = "AA")## The same, but as string and without attributes, looks like:## $A06852# [1] "MPRLFSYLLGVWLLLSQLPREIPGQSTNDFIKACGRELVRLWVEICGSVSWGRTALSLEEP# QLETGPPAETMPSSITKDAEILKMMLEFVPNLPQELKATLSERQPSLRELQQSASKDSNLNFEEFK# KIILNRQNEAEDKSLLELKNLGLDKHSRKKRLFRMTLSEKCCQVGCIRKDIARLC*"read.fasta(aafile, seqtype = "AA", as.string = TRUE, set.attributes = FALSE)

readfirstrec Low level function to get the record count of the specified ACNUCindex file

readfirstrec 89

Description

Called without arguments, the list of available values for argument type is returned.

Usage

readfirstrec(socket = "auto", type)

Arguments


type the ACNUC index file

Details

Available index files are:

AUT AUTHOR one record for each author name (last name only, no initials)

BIB BIBLIO one record for each reference

ACC ACCESS one record for each accession number

SMJ SMJYT one record for each status, molecule, journal, year, type, organelle, division, and dbstructure information

SUB SUBSEQ one record for each parent or sub-sequence

LOC LOCUS one record for each parent sequence

KEY KEYWORDS one record for each keyword

SPEC SPECIES one record for each taxon

SHRT SHORTL mostly, one record for each element of a short list

LNG LONGL one record for each group of SUBINLNG elements of a long list

EXT EXTRACT (for nucleotide databases only) one record for each exon of each subsequence

TXT TEXT one lrtxt-character record for each label of a species, keyword, or SMJYT

Value

The record count of ACNUC index file, or NA if missing (typically when asking for type = EXT ona protein database).

Author(s)

J.R. Lobry

References

See ACNUC physical structure at http://pbil.univ-lyon1.fr/databases/acnuc/structure.html.

citation("seqinr")

http://pbil.univ-lyon1.fr/databases/acnuc/structure.html


90 readsmj

See Also

choosebank

Examples

## Not run:# Need internet connectionchoosebank("genbank")allowedtype <- readfirstrec()sapply(allowedtype, function(x) readfirstrec(type = x))

## End(Not run)

readsmj Low level function to read ACNUC SMJYT index files

Description

Extract informations from the SMJYT index file for status, molecule, journal, year, type, organelle,division, and db structure information.

Usage

readsmj(socket = "auto", num = 2, nl = 10, recnum.add = FALSE, nature.add = TRUE,plong.add = FALSE, libel.add = FALSE, sname.add = FALSE, all.add = FALSE)

Arguments


num rank number of first record.

nl number of records to read.

recnum.add to extract record numbers.

nature.add to extract as a factor with human understandable levels the nature of the name.Unordered levels are: status, molecule, journal, year, type, organelle, divisionand dbstrucinfo.

plong.add to extract the plong.

libel.add to extract the label of the name.

sname.add to extract the short version of the name, that is without the first two characters.

all.add to extract all (all flags set to TRUE).


Value

A data.frame with requested columns.

rearranged.oriloc 91

Author(s)

J.R. Lobry

References

See ACNUC physical structure at: http://pbil.univ-lyon1.fr/databases/acnuc/structure.html.

citation("seqinr")

See Also

choosebank to start a session and readfirstrec to get the total number of records.

rearranged.oriloc Detection of replication-associated effects on base composition asym-metry in prokaryotic chromosomes.

Description

Detection of replication-associated effects on base composition asymmetry in prokaryotic chromo-somes.

Usage

rearranged.oriloc(seq.fasta = system.file("sequences/ct.fasta",package = "seqinr"),g2.coord = system.file("sequences/ct.coord",package = "seqinr"))

Arguments

seq.fasta The path of the file containing a FASTA-format sequence. Default value: sys-tem.file("sequences/ct.fasta",package = "seqinr") - the FASTA sequence of theChlamydia trachomatis chromosome.

g2.coord The path of the file containing the coordinates of the protein coding genes foundon this chromosome. This file can be obtained using the function gbk2g2. Theformat of the file is similar to the output of the Glimmer2 program. The firstcolumn contains the index or the name of the gene, the second one containsthe start position and the third column contains the end position. For reversetranscribed genes, the start position is greater than the end position.



92 rearranged.oriloc

Details

The purpose of this method is to decouple replication-related and coding sequence-related effects onbase composition asymmetry. In order to do so, the analyzed chromosome is artificially rearrangedto obtain a perfect gene orientation bias - all forward transcribed genes on the first half of thechromosome, and all reverse transcribed genes on the other half. This rearrangement conservesthe relative order of genes within each of the two groups - both forward-encoded and reverse-encoded genes are placed on the rearranged chromosome in increasing order of their coordinateson the real chromosome. If the replication mechanism has a significant effect on base compositionasymmetry, this should be seen as a change of slope in the nucleotide skews computed on therearranged chromosome; the change of slope should take place at the origin or the terminus ofreplication. Use extract.breakpoints to detect the position of the changes in slope on therearranged nucleotide skews.

Value

A data.frame with six columns: meancoord.rearr contains the gene index on the rearrangedchromosome; gcskew.rearr contains the normalized GC-skew ((G-C)/(G+C)) computed on thethird codon positions of protein coding genes, still on the rearranged chromosome; atskew.rearrcontains the normalized AT-skew ((A-T)/(A+T)) computed on the third codon positions of proteincoding genes; strand.rearr contains the transcription strand of the gene (either "forward" or"reverse"); order contains the permutation that was used to obtain a perfect gene orientation bias;meancoord.real contains the mid-coordinate of the genes on the real chromosome (before therearrangement).

Author(s)

A. Necsulea

References


See Also

oriloc, draw.rearranged.oriloc, extract.breakpoints

Examples






reverse.align 93

## Not run: breaks <- extract.breakpoints(r.ori, type = c("gcfw", "gcrev"), nbreaks =c(2, 2), gridsize = 50, it.max = 100)



reverse.align Reverse alignment - from protein sequence alignment to nucleic se-quence alignment

Description

This function produces an alignment of nucleic protein-coding sequences, using as a guide thealignment of the corresponding protein sequences.

Usage

reverse.align(nucl.file, protaln.file, input.format = 'fasta', out.file,output.format = 'fasta', align.prot = FALSE, numcode = 1,clustal.path = NULL)

Arguments

nucl.file A character string specifying the name of the FASTA format file containing thenucleotide sequences.

protaln.file A character string specifying the name of the file containing the aligned proteinsequences. This argument must be provided if align.prot is set to FALSE.

input.format A character string specifying the format of the protein alignment file : ’mase’,’clustal’, ’phylip’, ’fasta’ or ’msf’.

out.file A character string specifying the name of the output file.output.format

A character string specifying the format of the output file. Currently the onlyimplemented format is ’fasta’.

align.prot Boolean. If TRUE, the nucleic sequences are translated and then the proteinsequences are aligned with the ClustalW program. The path of the ClustalWbinary must also be given (clustal.path)

numcode The NCBI genetic code number for the translation of the nucleic sequences. Bydefault the standard genetic code is used.

clustal.path The path of the ClustalW binary. This argument only needs to be set if align.protis TRUE.

94 reverse.align

Details

This function an alignment of nucleic protein-coding sequences using as a guide the alignmentof the corresponding protein sequences. The file containing the nucleic sequences is given in thecompulsory argument ’nucl.file’; this file must be written in the FASTA format.

The alignment of the protein sequences can either be provided directly, trough the ’protaln.file’parameter, or reconstructed with ClustalW, if the parameter ’align.prot’ is set to TRUE. In the lattercase, the pathway of the ClustalW binary must be given in the ’clustal.path’ argument.

The protein and nucleic sequences must have the same name in the files nucl.file and protaln.file.

The reverse-aligned nucleotide sequences are written to the file specified in the compulsory ’out.file’argument. For now, the only output format implemented is FASTA.

Warning: the ’align.prot=TRUE’ option has only been tested on LINUX operating systems. ClustalWmust be installed on your system in order for this to work.

Value

NULL

Author(s)

A. Necsulea

References

citation(’seqinr’)

See Also

read.alignment, read.fasta, write.fasta

Examples

## Read example 'bordetella.fasta': a triplet of orthologous genes from# three bacterial species (Bordetella pertussis, B. parapertussis and# B. bronchiseptica):#

triplet <- read.fasta(system.file('sequences/bordetella.fasta',package='seqinr'))

## For this example, 'bordetella.pep.aln' contains the aligned protein# sequences, in the Clustal format:#

triplet.pep<- read.alignment(system.file('sequences/bordetella.pep.aln',package='seqinr'),format='clustal')

##

rot13 95

#

reverse.align(nucl.file=system.file('sequences/bordetella.fasta',package='seqinr'),protaln.file=system.file('sequences/bordetella.pep.aln',package='seqinr'),input.format='clustal', out.file='test.revalign')

## Alternatively, we can use ClustalW to align the translated nucleic#sequences. Here the ClustalW program is accessible simply by the#'clustalw' name.#

## Not run:reverse.align(nucl.file=system.file('sequences/bordetella.fasta',package='seqinr'),out.file='test.revalign.clustal', align.prot=TRUE, clustal.path='clustalw')## End(Not run)

rot13 Ergheaf gur EBG-13 pvcurevat bs n fgevat

Description

rot13 applied to the above title returns the string "Returns the ROT-13 ciphering of a string".

Usage

rot13(string)

Arguments

string a string of characters.

Value

a string of characters.

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

chartr

96 s2c

Examples

#### Simple ciphering of a string:##message <- "Hello, world!"rot13(message) # "Uryyb, jbeyq!"#### Routine sanity check:##stopifnot(identical(rot13(rot13(message)), message))

s2c conversion of a string into a vector of chars

Description

This is a simple utility function to convert a string such as "BigBang" into a vector of chars such as"B" "i" "g" "B" "a" "n" "g"

Usage

s2c(string)

Arguments

string a string of chars

Value

a vector of chars

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

c2s

Examples

s2c( "BigBang" )

s2n 97

s2n simple numerical encoding of a DNA sequence.

Description

By default, if no levels arguments is provided, this function will just code your DNA sequencein integer values following the lexical order (a > c > g > t), that is 0 for "a", 1 for "c", 2 for"g", 3 for "t" and NA for ambiguous bases.

Usage

s2n(seq, levels = s2c("acgt"), base4 = TRUE, forceToLower = TRUE)

Arguments

seq the sequence as a vector of single chars

levels allowed char values, by default a, c, g and t

base4 if TRUE the numerical encoding will start at O, if FALSE at 1

forceToLower if TRUE the sequence is forced to lower case caracters

Value

a vector of integers

Note

The idea of starting numbering at 0 by default is that it enforces a kind of isomorphism between thepaste operator on DNA chars and the + operator on integer coding for DNA chars. By this way, youcan work either in the char set, either in the integer set, depending on what is more convenient foryour purpose, and then switch from one set to the other one as you like.

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

n2s, factor, unclass

98 seqinr-package

Examples

#### Example of default behaviour:##urndna <- s2c("acgt")seq <- sample( urndna, 100, replace = TRUE ) ; seqs2n(seq)#### How to deal with RNA:##urnrna <- s2c("acgt")seq <- sample( urnrna, 100, replace = TRUE ) ; seqs2n(seq)#### what happens with unknown characters:##urnmess <- c(urndna,"n")seq <- sample( urnmess, 100, replace = TRUE ) ; seqs2n(seq)#### How to change the encoding for unknown characters:##tmp <- s2n(seq) ; tmp[is.na(tmp)] <- -1; tmp#### Simple sanity check:##stopifnot(all(s2n(s2c("acgt")) == 0:3))

seqinr-package Biological Sequences Retrieval and Analysis

Description

Exploratory data analysis and data visualization for biological sequence (DNA and protein) data.Include also utilities for sequence data management under the ACNUC system.

Author(s)

Delphine Charif and Jean Lobry and Leonor Palmeira

Maintainer: Simon Penel <[email protected]>

References

citation(’seqinr’)

splitseq 99

splitseq split a sequence into sub-sequences

Description

Split a sequence into sub-sequences of 3 (the default size) with no overlap between the sub-sequences.

Usage

splitseq(seq, frame = 0, word = 3)

Arguments

seq a vector of chars

frame an integer (0, 1, 2) giving the starting position to split the sequence

word an integer giving the size of the sub-sequences

Value

This function returns a vector which contains the sub-sequences.

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

split

Examples

cds <- s2c("aacgttgcaggtcgctcgctacgtagctactgttt")## To obtain the codon sequence in frame 0:#stopifnot(identical(splitseq(cds),c("aac", "gtt", "gca", "ggt", "cgc", "tcg", "cta", "cgt", "agc", "tac", "tgt")))

## Show the effect of frame and word with a ten char sequence:#(tenchar <- s2c("1234567890"))splitseq(tenchar, frame = 0)splitseq(tenchar, frame = 1)

100 stresc

splitseq(tenchar, frame = 2)splitseq(tenchar, frame = 0, word = 2)splitseq(tenchar, frame = 0, word = 1)

stresc Utility function to escape LaTeX special characters present in a string

Description

This function returns a vector of strings in which LaTeX special characters are escaped, this isuseful in conjunction with xtable.

Usage

stresc(strings)

Arguments

strings A vector of strings to deal with.


Value

Returns a vector of strings with escaped characters within each string.

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

s2c

Examples

stresc("MISC_RNA")stresc(c("BB_0001","BB_0002"))

syncodons 101

syncodons Synonymous codons

Description

Returns all synonymous codons for each codon given

Usage

syncodons(codons, numcode = 1)

Arguments

codons A sequence of codons as generated by splitseq

numcode The genetic code number as in translate

Value

a list containing, for each codon given (list tags), all synonymous codons (including the originalone)

Author(s)

Leonor Palmeira, J.R. Lobry

References

citation("seqinr")

See Also

synsequence

Examples

## The four synonymous codons for Alanine in the standard genetic code:#syncodons("ggg")## With a sequence:#toycds <- s2c("tctgagcaaataaatcgg")syncodons(splitseq(toycds))## Sanity check with the standard genetic code:#stdgencode <- structure(list(ttt = c("ttc", "ttt"),

102 syncodons

ttc = c("ttc", "ttt"),tta = c("cta", "ctc", "ctg", "ctt", "tta", "ttg"),ttg = c("cta", "ctc", "ctg", "ctt", "tta", "ttg"),tct = c("agc", "agt", "tca", "tcc", "tcg", "tct"),tcc = c("agc", "agt", "tca", "tcc", "tcg", "tct"),tca = c("agc", "agt", "tca", "tcc", "tcg", "tct"),tcg = c("agc", "agt", "tca", "tcc", "tcg", "tct"),tat = c("tac", "tat"),tac = c("tac", "tat"),taa = c("taa", "tag", "tga"),tag = c("taa", "tag", "tga"),tgt = c("tgc", "tgt"),tgc = c("tgc", "tgt"),tga = c("taa", "tag", "tga"),tgg = "tgg",ctt = c("cta", "ctc", "ctg", "ctt", "tta", "ttg"),ctc = c("cta", "ctc", "ctg", "ctt", "tta", "ttg"),cta = c("cta", "ctc", "ctg", "ctt", "tta", "ttg"),ctg = c("cta", "ctc", "ctg", "ctt", "tta", "ttg"),cct = c("cca", "ccc", "ccg", "cct"),ccc = c("cca", "ccc", "ccg", "cct"),cca = c("cca", "ccc", "ccg", "cct"),ccg = c("cca", "ccc", "ccg", "cct"),cat = c("cac", "cat"),cac = c("cac", "cat"),caa = c("caa", "cag"),cag = c("caa", "cag"),cgt = c("aga", "agg", "cga", "cgc", "cgg", "cgt"),cgc = c("aga", "agg", "cga", "cgc", "cgg", "cgt"),cga = c("aga", "agg", "cga", "cgc", "cgg", "cgt"),cgg = c("aga", "agg", "cga", "cgc", "cgg", "cgt"),att = c("ata", "atc", "att"),atc = c("ata", "atc", "att"),ata = c("ata", "atc", "att"),atg = "atg",act = c("aca", "acc", "acg", "act"),acc = c("aca", "acc", "acg", "act"),aca = c("aca", "acc", "acg", "act"),acg = c("aca", "acc", "acg", "act"),aat = c("aac", "aat"),aac = c("aac", "aat"),aaa = c("aaa", "aag"),aag = c("aaa", "aag"),agt = c("agc", "agt", "tca", "tcc", "tcg", "tct"),agc = c("agc", "agt", "tca", "tcc", "tcg", "tct"),aga = c("aga", "agg", "cga", "cgc", "cgg", "cgt"),agg = c("aga", "agg", "cga", "cgc", "cgg", "cgt"),gtt = c("gta", "gtc", "gtg", "gtt"),gtc = c("gta", "gtc", "gtg", "gtt"),gta = c("gta", "gtc", "gtg", "gtt"),gtg = c("gta", "gtc", "gtg", "gtt"),gct = c("gca", "gcc", "gcg", "gct"),gcc = c("gca", "gcc", "gcg", "gct"),

synsequence 103

gca = c("gca", "gcc", "gcg", "gct"),gcg = c("gca", "gcc", "gcg", "gct"),gat = c("gac", "gat"),gac = c("gac", "gat"),gaa = c("gaa", "gag"),gag = c("gaa", "gag"),ggt = c("gga", "ggc", "ggg", "ggt"),ggc = c("gga", "ggc", "ggg", "ggt"),gga = c("gga", "ggc", "ggg", "ggt"),ggg = c("gga", "ggc", "ggg", "ggt")),

.Names = c("ttt", "ttc", "tta", "ttg", "tct", "tcc", "tca", "tcg", "tat", "tac","taa", "tag", "tgt", "tgc", "tga", "tgg", "ctt", "ctc", "cta","ctg", "cct", "ccc", "cca", "ccg", "cat", "cac", "caa", "cag","cgt", "cgc", "cga", "cgg", "att", "atc", "ata", "atg", "act","acc", "aca", "acg", "aat", "aac", "aaa", "aag", "agt", "agc","aga", "agg", "gtt", "gtc", "gta", "gtg", "gct", "gcc", "gca","gcg", "gat", "gac", "gaa", "gag", "ggt", "ggc", "gga", "ggg"))## Now the check:#currentresult <- syncodons(words(alphabet = s2c("tcag")))stopifnot(identical(stdgencode, currentresult))

synsequence Random synonymous coding sequence generation

Description

Generates a random synonymous coding sequence, according to a certain codon usage bias

Usage

synsequence(sequence, numcode = 1, ucoweight = NULL)

Arguments



ucoweight A list of weights containing the desired codon usage bias as generated by ucoweight

Value

a sequence translating to the same protein sequence as the original one (cf. translate), butcontaining synonymous codons

Author(s)

Leonor Palmeira

104 tablecode

References

citation("seqinr")

See Also

ucoweight

Examples

data(ec999)sequence=ec999[1][[1]]synsequence(sequence,1,ucoweight(sequence))

tablecode to plot genetic code as in textbooks

Description

This function plots a genetic code table as in textbooks, that is following the order T > C > A >G so that synonymous codons are almost always in the same boxes.

Usage

tablecode(numcode = 1, urn.rna = s2c("TCAG"), dia = FALSE, latexfile = NULL,label = latexfile, size = "normalsize", caption = NULL,preaa = rep("", 64), postaa = rep("", 64),precodon = preaa, postcodon = postaa)

Arguments


urn.rna The letters to display codons, use s2c("UCAG") if you want the code in termsof RNA sequence

latexfile The name of a LaTex file if you want to redirect the output

label The label for the LaTeX table

size The LaTex size of characters for the LaTeX table

preaa A sting to insert before the amino-acid in the LaTeX table

postaa A sting to insert after the amino-acid in the LaTeX table

precodon A sting to insert before the codon in the LaTeX table

postcodon A sting to insert after the codon in the LaTeX table

caption The caption of the LaTeX table

dia to produce a yellow/blue plot for slides

toyaa 105

Details

The codon order for preaa, postaa, precodon, and postcodon should be the same as inpaste(paste(rep(s2c("tcag"), each =16), s2c("tcag"), sep = ""), rep(s2c("tcag"),each = 4), sep = "")

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

translate, syncodons

Examples

## Show me the standard genetic code:#

tablecode()

toyaa A toy example of amino-acid counts in three proteins

Description

This is a toy data set to illustrate the importance of metric choice.

Usage

data(toyaa)

Format

A data frame with 3 observations on the following 3 variables:

Ala Alanine counts

Val Valine counts

Cys Cysteine counts

Source

This toy example was inspired by Gautier, C: Analyses statistiques et évolution des séquencesd’acides nucléiques. PhD thesis (1987), Université Claude Bernard - Lyon I.

106 toycodon

References

citation("seqinr")

Examples

data(toyaa)

toycodon A toy example of codon counts in three coding sequences

Description

This is a toy data set to illustrate synonymous and non-synonymous codon usage analyses.

Usage

data(toyaa)

Format

A data frame with 3 observations (coding sequences) for 10 codons.

Source

Created for release 1.0-4 of seqinr’s vignette.

References

citation("seqinr")

Examples

data(toycodon)

translate 107

translate Translate nucleic acid sequences into proteins

Description

This function translates nucleic acid sequences into the corresponding peptide sequence. It cantranslate in any of the 3 forward or three reverse sense frames. In the case of reverse sense, thereverse-complement of the sequence is taken. It can translate using the standard (universal) geneticcode and also with non-standard codes. Ambiguous bases can also be handled.

Usage

translate(seq, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)

Arguments

seq the sequence to translate as a vector of single characters in lower case letters.

frame Frame(s) (0,1,2) to translate. By default the frame 0 is used.

sens Sense to translate: F for forward sense and R for reverse sense.

numcode The ncbi genetic code number for translation. By default the standard geneticcode is used.

NAstring How to translate amino-acids when there are ambiguous bases in codons.

ambiguous If TRUE, ambiguous bases are taken into account so that for instance GGN istranslated to Gly in the standard genetic code.

Details

The following genetic codes are described here. The number preceding each code corresponds tonumcode.

1 standard

2 vertebrate.mitochondrial

3 yeast.mitochondrial

4 protozoan.mitochondrial+mycoplasma

5 invertebrate.mitochondrial

6 ciliate+dasycladaceal

9 echinoderm+flatworm.mitochondrial

10 euplotid

11 bacterial+plantplastid

12 alternativeyeast

13 ascidian.mitochondrial

14 alternativeflatworm.mitochondrial

108 translate

15 blepharism

16 chlorophycean.mitochondrial

21 trematode.mitochondrial

22 scenedesmus.mitochondrial

23 hraustochytrium.mitochondria

Value

translate returns a vector of single characters containing the peptide sequence in the standardone-letter IUPAC code. Termination (STOP) codons are translated by the character ’*’.

Author(s)


References

The genetic codes have been taken from the ncbi taxonomy database: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c. Last update October 05, 2000.The IUPAC one-letter code for aminoacids is described at: http://www.chem.qmul.ac.uk/iupac/AminoAcid/

citation("seqinr")

See Also

Use tolower to change upper case letters into lower case letters. For coding sequences obtainedfrom an ACNUC server with query it’s better to use the function getTrans so that the relevantgenetic code and the relevant frame are automatically used. The genetic codes are given in theobject SEQINR.UTIL, a more human readable form is given by the function tablecode. Useaaa to get the three-letter code for amino-acids.

Examples

#### Toy CDS example invented by Leonor Palmeira:##toycds <- s2c("tctgagcaaataaatcgg")translate(seq = toycds) # should be c("S", "E", "Q", "I", "N", "R")#### Toy CDS example with ambiguous bases:##toycds2 <- s2c("tcngarcarathaaycgn")translate(toycds2) # should be c("X", "X", "X", "X", "X", "X")translate(toycds2, ambiguous = TRUE) # should be c("S", "E", "Q", "I", "N", "R")translate(toycds2, ambiguous = TRUE, numcode = 2) # should be c("S", "E", "Q", "X", "N", "R")#### Real CDS example:##realcds <- read.fasta(file = system.file("sequences/malM.fasta", package ="seqinr"))[[1]]





trimSpace 109

translate(seq = realcds)# Biologically correct, only one stop codon at the endtranslate(seq = realcds, frame = 3, sens = "R", numcode = 6)# Biologically meaningless, note the in-frame stop codons

## Not run:## Need internet connection.## Translation of the following EMBL entry:#### FT CDS join(complement(153944..154157),complement(153727..153866),## FT complement(152185..153037),138523..138735,138795..138955)## FT /codon_start=1## FT /db_xref="FLYBASE:FBgn0002781"## FT /db_xref="GOA:Q86B86"## FT /db_xref="TrEMBL:Q86B86"## FT /note="mod(mdg4) gene product from transcript CG32491-RZ;## FT trans splicing"## FT /gene="mod(mdg4)"## FT /product="CG32491-PZ"## FT /locus_tag="CG32491"## FT /protein_id="AAO41581.1"## FT /translation="MADDEQFSLCWNNFNTNLSAGFHESLCRGDLVDVSLAAEGQIVKA## FT HRLVLSVCSPFFRKMFTQMPSNTHAIVFLNNVSHSALKDLIQFMYCGEVNVKQDALPAF## FT ISTAESLQIKGLTDNDPAPQPPQESSPPPAAPHVQQQQIPAQRVQRQQPRASARYKIET## FT VDDGLGDEKQSTTQIVIQTTAAPQATIVQQQQPQQAAQQIQSQQLQTGTTTTATLVSTN## FT KRSAQRSSLTPASSSAGVKRSKTSTSANVMDPLDSTTETGATTTAQLVPQQITVQTSVV## FT SAAEAKLHQQSPQQVRQEEAEYIDLPMELPTKSEPDYSEDHGDAAGDAEGTYVEDDTYG## FT DMRYDDSYFTENEDAGNQTAANTSGGGVTATTSKAVVKQQSQNYSESSFVDTSGDQGNT## FT EAQVTQHVRNCGPQMFLISRKGGTLLTINNFVYRSNLKFFGKSNNILYWECVQNRSVKC## FT RSRLKTIGDDLYVTNDVHNHMGDNKRIEAAKAAGMLIHKKLSSLTAADKIQGSWKMDTE## FT GNPDHLPKM"choosebank("emblTP")query("trans", "N=AE003734.PE35")getTrans(trans$req[[1]])## Complex transsplicing operations, the correct frame and the correct## genetic code are automatically used for translation into protein.## End(Not run)

trimSpace Trim leading and/or trailing spaces in strings

Description

This function removes from a character vector the longest successive run of space characters startingat the begining of the strings (leading space), or the longest successive run of space characters atthe end of the strings (trailing space), or both (and this is the default behaviour).

Usage

trimSpace(x, leading = TRUE, trailing = TRUE, space = "[:space:]")

110 trimSpace

Arguments

x a character vector

leading logical defaulting to TRUE: should leading spaces be trimed off?

trailing logical defaulting to TRUE: should trailing spaces be trimed off?

space an extended regular expression defining space characters

Details

The default value for the space character definition is large: in addition to the usual space, othercharacter such as the tabulation and newline character are considered as space characters. Seeextended regular expression for a complete list.

Value

a character vector with the same length as x.

Author(s)

J.R. Lobry

References

citation("seqinr").

See Also

Extended regular expressionsare described in regular expression (aka regexp).

Examples

## Simple use:#

stopifnot( trimSpace(" seqinR ") == "seqinR" )

## Basic use, remove space at both ends:#

testspace <- c(" with leading space", "with trailing space ", " with both ")stopifnot(all( trimSpace(testspace) == c("with leading space", "with trailing space", "with both")))

## Remove only leading space:#

stopifnot(all( trimSpace(testspace, trailing = FALSE) == c("with leading space", "with trailing space ", "with both ")))

## Remove only trailing space:#

stopifnot(all( trimSpace(testspace, leading = FALSE) == c(" with leading space", "with trailing space", " with both")))

uco 111

## This should do nothing:#

stopifnot(all( trimSpace(testspace, leading = FALSE, trailing = FALSE) == testspace))

## How to use alternative space characters:#

allspaces <- "\t\n\v\f\r seqinR \t\n\v\f\r"stopifnot(trimSpace(allspaces) == "seqinR")stopifnot(trimSpace(allspaces, space = "\t\n") == "\v\f\r seqinR \t\n\v\f\r")

uco Codon usage indices

Description

uco calculates some codon usage indices: the codon counts eff, the relative frequencies freq orthe Relative Synonymous Codon Usage rscu.

Usage

uco(seq, frame = 0, index = c("eff", "freq", "rscu"), as.data.frame = FALSE,NA.rscu = NA)

Arguments

seq a coding sequence as a vector of chars

frame an integer (0, 1, 2) giving the frame of the coding sequence

index codon usage index choice, partial matching is allowed. eff for codon counts,freq for codon relative frequencies, and rscu the RSCU index

as.data.framelogical. If TRUE: all indices are returned into a data frame.

NA.rscu when an amino-acid is missing, RSCU are no more defined and repported asmissing values (NA). You can force them to another value (typically 0 or 1) withthis argument.

Details

Codons with ambiguous bases are ignored.

RSCU is a simple measure of non-uniform usage of synonymous codons in a coding sequence(Sharp et al. 1986). RSCU values are the number of times a particular codon is observed, relative tothe number of times that the codon would be observed for a uniform synonymous codon usage (i.e.all the codons for a given amino-acid have the same probability). In the absence of any codon usagebias, the RSCU values would be 1.00 (this is the case for sequence cds in the exemple thereafter).

112 uco

A codon that is used less frequently than expected will have an RSCU value of less than 1.00 andvice versa for a codon that is used more frequently than expected.

Do not use correspondence analysis on RSCU tables as this is a source of artifacts (Perriere andThioulouse 2002). Within-aminoacid correspondence analysis is a simple way to study synonymouscodon usage (Charif et al. 2005).

If as.data.frame is FALSE, uco returns one of these:

eff a table of codon counts

freq a table of codon relative frequencies

rscu a numeric vector of relative synonymous codon usage values

If as.data.frame is TRUE, uco returns a data frame with five columns:

aa a vector containing the name of amino-acid

codon a vector containing the corresponding codon

eff a numeric vector of codon counts

freq a numeric vector of codon relative frequencies

rscu a numeric vector of RSCU index

Value

If as.data.frame is FALSE, the default, a table for eff and freq and a numeric vector forrscu. If as.data.frame is TRUE, a data frame with all indices is returned.

Author(s)

D. Charif, J.R. Lobry, G. Perriere

References

citation("seqinr")

Sharp, P.M., Tuohy, T.M.F., Mosurski, K.R. (1986) Codon usage in yeast: cluster analysis clearlydifferentiates highly and lowly expressed genes. Nucl. Acids. Res., 14:5125-5143.

Perriere, G., Thioulouse, J. (2002) Use and misuse of correspondence analysis in codon usage stud-ies. Nucl. Acids. Res., 30:4548-4555.

Charif, D., Thioulouse, J., Lobry, J.R., Perriere, G. (2005) Online Synonymous Codon UsageAnalyses with the ade4 and seqinR packages. Bioinformatics, 21:545-547. http://pbil.univ-lyon1.fr/members/lobry/repro/bioinfo04/.

http://pbil.univ-lyon1.fr/members/lobry/repro/bioinfo04/

http://pbil.univ-lyon1.fr/members/lobry/repro/bioinfo04/

ucoweight 113

Examples

## Show all possible codons:words()

## Make a coding sequence from this:(cds <- s2c(paste(words(), collapse = "")))

## Get codon counts:uco(cds, index = "eff")

## Get codon relative frequencies:uco(cds, index = "freq")

## Get RSCU values:uco(cds, index = "rscu")

## Show what happens with ambiguous bases:uco(s2c("aaannnttt"))

## Use a real coding sequence:rcds <- read.fasta(File = system.file("sequences/malM.fasta", package = "seqinr"))[[1]]uco( rcds, index = "freq")uco( rcds, index = "eff")uco( rcds, index = "rscu")uco( rcds, as.data.frame = TRUE)

## Show what happens with RSCU when an amino-acid is missing:ecolicgpe5 <- read.fasta(file = system.file("sequences/ecolicgpe5.fasta",package="seqinr"))[[1]]uco(ecolicgpe5, index = "rscu")

## Force NA to zero:uco(ecolicgpe5, index = "rscu", NA.rscu = 0)

ucoweight Weight of each synonymous codon

Description

Returns a list containing, for each of the 20 amino acids + STOP codon, the codon usage bias ofeach of the synonymous codon according to a given codon sequence.

Usage

ucoweight(sequence, numcode = 1)

Arguments

sequence A nucleic acids sequencenumcode The genetic code number as in translate

114 words

Value

a list containing, for each of the 20 amino acids and STOP codon (list tags), the weight of eachsynonymous codon (including the original one).

Author(s)

Leonor Palmeira

References

citation("seqinr")

See Also

synsequence

Examples

data(ec999)ucoweight(ec999[1][[1]])

words To get all words from an alphabet.

Description

Generates a vectors of all the words from a given alphabet, with right positions varying faster, forinstance if the alphabet is (c("0","1") and the length is 2 you will obtain c("00","01", "10", "11")

Usage

words(length = 3, alphabet = s2c("acgt"))

Arguments

length the number of characters in the words

alphabet a vector of characters

Value

A vector of string whith length characters.

Author(s)

J.R. Lobry

words.pos 115

References

citation("seqinr")

See Also

kronecker, outer

Examples

## Get all 64 codons:#stopifnot(all(words() ==c("aaa", "aac", "aag", "aat", "aca", "acc", "acg", "act", "aga", "agc", "agg", "agt", "ata", "atc", "atg", "att",

"caa", "cac", "cag", "cat", "cca", "ccc", "ccg", "cct", "cga", "cgc", "cgg", "cgt", "cta", "ctc", "ctg", "ctt","gaa", "gac", "gag", "gat", "gca", "gcc", "gcg", "gct", "gga", "ggc", "ggg", "ggt", "gta", "gtc", "gtg", "gtt","taa", "tac", "tag", "tat", "tca", "tcc", "tcg", "tct", "tga", "tgc", "tgg", "tgt", "tta", "ttc", "ttg", "ttt")))

## Get all codons with u c a g for bases:#words(alphabet = s2c("ucag"))## Get all tetranucleotides:#words(length = 4)## Get all dipeptides:#words(length = 2, alphabet = a()[-1])

words.pos Positions of possibly degenerated motifs within sequences

Description

word.pos searches all the occurences of the motif pattern within the sequence text andreturns their positions. This function is based on regexp allowing thus for complex motif searches.

Usage

words.pos(pattern, text, extended = TRUE, perl = FALSE)

Arguments

pattern character string containing a regular expression to be matched in the given char-acter vector.

text a character vector where matches are sought.

116 write.fasta

extended if ‘TRUE’, extended regular expression matching is used, and if ‘FALSE’ basicregular expressions are used.

perl logical. Should perl-compatible regexps be used if available? Has priority over‘extended’

Details

The regular expressions used are those specified by POSIX 1003.2, either extended or basic, de-pending on the value of the ‘extended’ argument, unless ‘perl = TRUE’ when they are those ofPCRE, ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/.‘perl=TRUE’ will only be available if R was compiled against PCRE: this is detected at configuretime. All Unix and Windows system should have it.

Value

a vector of positions for which the motif pattern was found in the sequence text.

Author(s)

J.R. Lobry

References

citation("seqinr")

See Also

regexpr

Examples

myseq <- "tatagaga"words.pos("t", myseq) # Should be 1 3words.pos("tag", myseq) # Should be 3words.pos("ga", myseq) # Should be 5 7# How to specify ambiguous base ? Look for YpR motifs bywords.pos("[ct][ag]", myseq) # Should be 1 3

write.fasta Write file in fasta format

Description

Writes sequences to a file in FASTA format.

Usage

write.fasta(sequences, names, nbchar = 60, file.out, open = "w")

ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/

write.fasta 117

Arguments

sequences A DNA or protein sequence (i.e. a character vector) or a list of sequences (i.e. alist of character vectors).

names The name(s) of the sequences.

nbchar The number of characters per line (default: 60)

file.out The name of the output file.

open Mode to open the output file, use "w" to write into a new file, use "a" to appendat the end of an already existing file.

Value

NULL

Author(s)

A. Necsulea

References

citation("seqinr")

See Also

read.fasta

Examples

## Read sequences from a FASTA file:ortho <- read.fasta(file = system.file("sequences/ortho.fasta", package ="seqinr"))

## Select only third codon positions:ortho3 <- lapply(ortho, function(x) x[seq(from = 3, to = length(x), by = 3)])

## Write the modified sequences to a file:write.fasta(sequences = ortho3, names = names(ortho3), nbchar = 80, file.out = "ortho3.fasta")

## Read it again from the same file and check that sequences are preserved:ortho3bis <- read.fasta("ortho3.fasta", set.attributes = FALSE)stopifnot(identical(ortho3bis, ortho3))

118 dinucleotides

dinucleotides Statistical over- and under- representation of dinucleotides in a se-quence

Description

These two functions compute two different types of statistics for the measure of statistical din-culeotide over- and under-representation : the rho statistic, and the z-score, each computed for all16 dinucleotides.

Usage

rho(sequence, alphabet = s2c("acgt"))zscore(sequence, simulations = NULL, modele, exact = FALSE, alphabet = s2c("acgt"), ... )

Arguments


simulations If NULL, analytical solution is computed when available (models base andcodon). Otherwise, it should be the number of permutations for the z-scorecomputation

modele A string of characters describing the model chosen for the random generation

exact Whether exact analytical calculation or an approximation should be used

alphabet A vector of single characters.

... Optional parameters for specific model permutations are passed on to permutationfunction.

Details

The rho statistic, as presented in Karlin S., Cardon LR. (1994), can be computed on each of the16 dinucleotides. It is the frequence of dinucleotide xy divided by the product of frequencies ofnucleotide x and nucleotide y. It is equal to 1.00 when dinucleotide xy is formed by pure chance,and it is superior (respectively inferior) to 1.00 when dinucleotide xy is over- (respectively under-)represented.

The zscore statistic, as presented in Palmeira, L., Guéguen, L. and Lobry JR. (2006). The statisticis the normalization of the rho statistic by its expectation and variance according to a given randomsequence generation model, and follows the standard normal distribution. This statistic can becomputed with several models (cf. permutation for the description of each of the models).We provide analytical calculus for two of them: the base permutations model and the codonpermutations model.

The base model allows for random sequence generation by shuffling (with/without replacement)of all bases in the sequence. Analytical computations are available for this model: either as anapproximation for large sequences (cf. Palmeira, L., Guéguen, L. and Lobry JR. (2006)), either asthe exact analytical formulae (cf. Schbath, S. (1995)).

dinucleotides 119

The position model allows for random sequence generation by shuffling (with/without replace-ment) of bases within their position in the codon (bases in position I, II or III stay in position I, IIor III in the new sequence.

The codon model allows for random sequence generation by shuffling (with/without replacement)of codons. Analytical computation is available for this model.

The syncodon model allows for random sequence generation by shuffling (with/without replace-ment) of synonymous codons.

Value

a table containing the computed statistic for each dinucleotide

Author(s)

Leonor Palmeira

References

citation("seqinr")

Karlin S. and Cardon LR. (1994) Computational DNA sequence analysis. Annu Rev Microbiol,48:619-54.


Schbath, S. (1995) Étude asymptotique du nombre d’occurrences d’un mot dans une chaîne deMarkov et application à la recherche de mots de fréquence exceptionnelle dans les séquencesd’ADN. Thèse de l’Université René Descartes, Paris V

See Also

permutation

Examples

sequence=sample(s2c('acgt'),6000,rep=TRUE)rho(sequence)zscore(sequence,modele='base')zscore(sequence,modele='base',exact=TRUE)zscore(sequence,modele='codon')zscore(sequence,1000,modele='syncodon')


Index

∗Topic datasetsaacost, 19aaindex, 21AnoukResult, 2chargaff, 38dinucl, 50ec999, 58EXP, 3prochlo, 80SEQINR.UTIL, 11toyaa, 104toycodon, 105

∗Topic hplotdotchart.uco, 53draw.oriloc, 55plot.SeqAcnucWeb, 77

∗Topic manipchoosebank, 40closebank, 42comp, 43computePI, 44count, 45dist.alignment, 51G+C Content, 5kaks, 70read.alignment, 84read.fasta, 86reverse.align, 92rot13, 94splitseq, 98translate, 106trimSpace, 108uco, 110

∗Topic packageseqinr-package, 97

∗Topic utilitiesa, 17aaa, 18AAstat, 1

alllistranks, 34amb, 35c2s, 37crelistfromclientdata, 46dia.bactgensize, 48dinucleotides, 117dotPlot, 52draw.rearranged.oriloc, 57extract.breakpoints, 59extractseqs, 61gb2fasta, 63gbk2g2, 64gbk2g2.euk, 65get.db.growth, 66get.ncbi, 67GetFromSequence, 8getlistrank, 69getType, 68lseqinr, 71n2s, 72oriloc, 73permutation, 76pmw, 78query, 81readfirstrec, 87readsmj, 89rearranged.oriloc, 90s2c, 95s2n, 96SeqAcnucWeb, 13SeqFastaAA, 14SeqFastadna, 15SeqFrag, 16stresc, 99syncodons, 100synsequence, 102tablecode, 103ucoweight, 112words, 113

120

INDEX 121

words.pos, 114write.fasta, 115

a, 17, 19, 79aaa, 18, 18, 55, 79, 107aacost, 19aaindex, 21AAstat, 1alllistranks, 34alr (alllistranks), 34amb, 35AnoukResult, 2as.SeqAcnucWeb (SeqAcnucWeb), 13as.SeqFastaAA (SeqFastaAA), 14as.SeqFastadna (SeqFastadna), 15as.SeqFrag (SeqFrag), 16

c2s, 9, 37, 79, 95chargaff, 38chartr, 94choosebank, 35, 40, 42, 47, 62, 68, 69, 83,

89, 90clfcd, 82clfcd (crelistfromclientdata), 46closebank, 42comp, 43computePI, 2, 44connection, 42count, 45crelistfromclientdata, 46, 83

density, 49dia.bactgensize, 48dia.db.growth (get.db.growth), 66dinucl, 50dinucleotides, 117dist.alignment, 51dotchart, 55dotchart.uco, 53dotPlot, 52draw.oriloc, 55, 75draw.rearranged.oriloc, 57, 60, 91

ec999, 58EXP, 3exseq (extractseqs), 61extract.breakpoints, 56, 57, 59, 91extractseqs, 61

factor, 96

G+C Content, 5gb2fasta, 63gbk2g2, 64, 65gbk2g2.euk, 65GC (G+C Content), 5gc, 7GC1 (G+C Content), 5GC2 (G+C Content), 5GC3 (G+C Content), 5get.db.growth, 66get.ncbi, 67getAnnot (GetFromSequence), 8getAnnots (GetFromSequence), 8getAttributsocket

(GetFromSequence), 8getExon (GetFromSequence), 8getFrag, 17getFrag (GetFromSequence), 8GetFromSequence, 7getKeyword (GetFromSequence), 8getKeywordsocket

(GetFromSequence), 8getLength (GetFromSequence), 8getlistrank, 62, 69getLocation (GetFromSequence), 8getLocationSocket

(GetFromSequence), 8getName, 83getName (GetFromSequence), 8getNumber.socket

(GetFromSequence), 8getSequence, 83getSequence (GetFromSequence), 8getSequenceSocket

(GetFromSequence), 8getTrans, 107getTrans (GetFromSequence), 8getType, 68, 78getwd, 84glr (getlistrank), 69

image, 53is.SeqAcnucWeb (SeqAcnucWeb), 13is.SeqFastaAA (SeqFastaAA), 14is.SeqFastadna (SeqFastadna), 15is.SeqFrag (SeqFrag), 16

kaks, 3, 70kronecker, 114

122 INDEX

lseqinr, 71

n2s, 72, 96ncbi.fna.url (get.ncbi), 67ncbi.gbk.url (get.ncbi), 67ncbi.ptt.url (get.ncbi), 67ncbi.stats (get.ncbi), 67

oriloc, 56, 60, 63–65, 73, 91outer, 114

parser.socket (SeqAcnucWeb), 13permutation, 76, 117, 118plot.SeqAcnucWeb, 77pmw, 78print.qaw (query), 81prochlo, 80

query, 9, 35, 40, 42, 47, 62, 68, 69, 78, 81,107

read.alignment, 51, 71, 84, 86, 93read.fasta, 86, 93, 116readAnnots.socket

(GetFromSequence), 8readfirstrec, 87, 90readsmj, 89rearranged.oriloc, 56, 57, 60, 75, 90regexp, 109regexpr, 115regular expression, 109rev, 43reverse.align, 92rho (dinucleotides), 117rot13, 94

s2c, 7, 37, 79, 95, 99s2n, 73, 96SeqAcnucWeb, 9, 13SeqFastaAA, 2, 14SeqFastadna, 15SeqFrag, 8, 16seqinr (seqinr-package), 97seqinr-package, 97SEQINR.UTIL, 2, 11, 45, 107socketConnection, 42split, 98splitseq, 98stresc, 99

summary.SeqFastaAA (SeqFastaAA),14

summary.SeqFastadna(SeqFastadna), 15

syncodons, 100, 104synsequence, 77, 100, 102, 113

table, 46tablecode, 103, 107tolower, 7, 36, 107toyaa, 104toycodon, 105translate, 9, 18, 19, 55, 104, 106trimSpace, 108

uco, 55, 110ucoweight, 103, 112unclass, 96

words, 113words.pos, 114write.fasta, 86, 93, 115

zscore, 50, 81zscore (dinucleotides), 117

the seqinr package -...

Documents