crunching molecules and numbers in r

CrunchingMolecules andNumbers in R

Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Crunching Molecules and Numbers in R

Rajarshi Guha

NIH Chemical Genomics Center

238th ACS National Meeting17th August, 2009


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Outline

I Some background on R

I Doing cheminformatics in R


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

R History

S developed by John Chambers at

Bell Labs

1976

S rewritten in C

1988

Licensed to Insightful Corp.

1993

Bought by Insightful Corp for $2M

2004

Bought by TIBCO for $25M

2008

First public release

1993

Created by Ihaka & Gentleman

1991

Released under GPL

1995

R 1.0.0

2000

R 2.9

2009


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

An overview of R

I An environment for statistical computation

I Wide variety of standard and state of the art statisticalmethods built in or accessible via packages

I But also a complete, interpreted programming language

I Well suited for manipulating and operating on datasets -numerical, categorical or a mixture - and of varyingshape

I Impressive visualization facilities (but not veryinteractive)


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

An overview of R

I Syntax is pretty much S-Plus

I Highly cross-platform

I Frequent and regular releases, active development bycore group

I The dev and user community extremely activeI r-help is not just for learning R, you can get a decent

statistics education from the list!

I Used by many top statisticians, many cutting edgetechniques first show up in R


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Usability

I Default mode is a command line like prompt

I GUI’s available

I But learning curve is steep

I Does force you to think about the analysis

I Not a great tool for casual, once-in-a-while usage


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

R primitives

I Numeric, character, list, matrix, data.frame� �> x <- ’Hello World ’

> x <- 1

> x <- c(1,2,3,4,5,6)

> x

[1] 1 2 3 4 5 6

x <- data.frame(MW=runif(5, 10, 50),

hERG=sample(c(’active ’,’inactive ’),

5, TRUE))

> x

MW hERG

1 23.55435 active

2 42.90365 inactive

3 49.35149 active

4 26.85912 active

5 10.01877 active� �


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Matrix oriented programming

I Similar in style to Matlab

I Easily access (multiple) rows, columnsI Vector/matrix indexing is very powerful and key to

efficient R codeI Perform operations on entire rows or columnsI Makes subsetting a trivial operation

I Perfect for QSAR type analyses


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Functional style

I R’s functional paradigms are closely tied to matrixoperations

I apply, lapply, sapply, tapply allow you to easilyoperate on groups of objects

I Elements of a listI Rows and/or columns of a matrixI Subsets of data, using a grouping variable

I Anonymous functions are supported

I Use of these funtional forms can lead to speed upcompared to traditional for loops


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Non-functional style

� �# column std devs

m <- matrix(runif (100*100) , ncol =100)

sds <- numeric(ncol(m))

for (i in 1:ncol(m)) sds[i] <- sd(m[,i])

# mean logP of toxic , non -toxic classes

m <- data.frame(logp=runif (100) ,

toxic=sample(c(’yes ’,’no ’),

100, TRUE)

toxLogP <- 0

nontoxLogP <- 0

for (j in 1:nrow(m)) {

if (m[j,2] = ’yes ’) toxLogP <- toxLogP + m[j,1]

else nontoxLogP <- nontoxLogP + m[j,1]

}� �


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Functional style

� �# column std devs

m <- matrix(runif (100*100) , ncol =100)

apply(m, 2, sd)

# mean logP of toxic , non -toxic classes

m <- data.frame(logp=runif (100) ,

toxic=sample(c(’yes ’,’no ’),

100, TRUE)

by(m, m$toxic , function(x) mean(x$logp ))� �


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Object oriented style

I R supports multiple object oriented mechanismsI Simplest is S3 classes

I Object orientation is in terms of function namesI Easy to work with, not always flexible enough

I S4 classes are much more powerful, but also morecomplex

I Many problems can ignore these as R primitives providesufficient support for attaching meta-data to objects(crude encapsulation)

I Becomes important/useful when writing packages, notfor day to day code


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Interfacing with C & Fortran

I R is interpreted,functional forms help abit

I Very useful to refactorinner loops into C (orFortran)

I Also useful to provide anR interface to pre-existingC/Fortran code

I Can lead to dramaticspeedups

1024 166 79

Bit length

Spe

edup

010

2030

4050

5000 pairwise Tanimoto similaritycalculations, Macbook Pro,

2GHz, 1GB RAM


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Visualization

I R generates publication quality graphics in a variety offormats

I A huge number of statistical visualization methods (2D,3D, OpenGL)

I Extremely powerful display specificationsI core commandsI lattice (a.k.a trellis graphics)

I Based on sound statistical theories

I While standard plots are easy to make, but complexplots do have a learning curve

I Interactivity is limited, though some package do alleviatethis


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Code quality

I It’s not just enough to write code

I RUnit is a package that supports unit testing, analogousto JUnit

I R comes with well defined package structure that can beautomatically checked for various errors

I Packages can be uploaded to CRAN which allows any Ruser to install them directly from R

I Extensive documentation format

I Sweave is an important feature which allows one toinclude R code and associated text in a single document- literate programming or reproducible research


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

The downsides of R

I Memory bound (but can use as much memory as youhave)

I Language inconsistenciesI Indexing starts from 1, but no error if you use 0 as an

indexI See blog posts by Radford Neal (U Toronto)

I Debugging environment not so great (though ESS isgood for Emacs users)


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Cheminformatics programming

I Fundamental requirement is support for core chemicalconcepts

I Representation and manipulation of these concepts

I Flexibility

I Could implement all of this directly in R - lots of wheelswould be reinvented

I We also want such functionality to be R-likeI Writing Java or C in R is not R-like


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

The Chemistry Development Kit

I Open source Java library for cheminformaticsI Wide variety of functionality

I Core chemical concepts (atoms, bonds, molecules)I SMARTS, pharmacophoresI Molecular descriptors and fingerprintsI 2D depictions

I Used in a variety of tools, applications and services

Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2110–2120


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

rcdk - CDK from R

R Programming Environment

rJava

CDK Jmol

rcdk

XML

rpubchem

fingerprint


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

rcdk Motivations

I Have access to cheminformatics functionality fromwithin R

I Support processing of data from chemistry databases

I Not reimplement cheminformatics methods

I Have access to all of this in idiomatic R


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Basic molecular operations - I/O

I Read in molecular file formats support by the CDK

I Files can be local or remote

I Parse SMILES strings

I In contrast to the CDK, rcdk will configure moleculesautomatically (unless instructed not to)

I The resultant molecule objects are Java references, canbe passed to a variety of rcdk functions� �

mols <- load.molecules(c(’abc.sdf ’, ’xyz.smi ’))

mol <- parse.smiles(’c1ccccc1CC (=O)’)

mols <- sapply(c(’CC’, ’CCCC ’, ’CCCNC ’),

parse.smiles)� �


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Basic molecular operations

I Given a molecule, we can extract or add properties

I Get lists of atoms and bonds and then manipulate them

I Currently doesn’t support a lot of molecular graphoperations� �

# get the atoms from a molecule

mol <- parse.smiles (" c1ccccc1C(Cl)(Br)c1ccccc1 ")

atoms <- get.atoms(mol)

# get the coordinate matrix of the molecule

coords <- do.call(’rbind ’,

lapply(atoms , get.point3d ))� �


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Working with fingerprints

I rcdk will generate a variety of fingerprints via the CDK

I Other packages can generate fingerprints

I The fingerprint package suports I/O of fingerprintdata and various similarity operations on fingerprints

I Provides an S4 class representing binary fingerprints� �m1 <- parse.smiles(’c1ccccc1C(COC)N’)

m2 <- parse.smiles(’C1CCCCC1C(COC)N’)

# Calculate fingerprints

fps <- lapply(list(m1 ,m2),

get.fingerprint , type=’maccs ’)

distance(fps[[1]] , fps[[2]] , method=’tanimoto ’)

fps <- fp.read(’fp.txt ’, lf=moe.lf ,

size =166, header=TRUE)

fpsim <- fp.sim.matrix(fps , method=’tanimoto ’)� �


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

rcdk and QSAR

MolecularDescriptors

Machine Learning

Property


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

rcdk and QSAR

I Access to descriptors and fingerprints makes for veryeasy QSAR modeling within R

I Evaluate the descriptors (individually, by type or all)

I Get back a data.frame which can be used as input topretty much any modeling method� �

mols <- load.molecules(’big.sdf ’)

dnames <- get.desc.names(’topological ’)

descs <- eval.desc(mols , dnames)

str(descs)

’data.frame ’: 467 obs. of 180 variables:

$ ATSc1 : num 0.28 0.279 0.279 0.217 0.479 ...

$ ATSc2 : num -0.0777 -0.0851 -0.0845 -0.0587 -0.2356 ...

$ ATSc3 : num -0.05803 -0.04706 -0.04616 -0.0519 0.00129 ...

$ ATSc4 : num -0.00906 0.00279 -0.01147 0.00241 0.00856 ...� �


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Viewing molecules

I While numerical modeling is a fundamental task in thisenvironment, visualization is also important

I Either view structures of individual molecules or tables ofstructure and data

I rcdk supports both (not very well on OS X)� �mol <- parse.smiles(’c1ccccc1C(N)CC ’)’

view.molecule .2d(mol)

smiles <- c("CCC", "CCN", "CCN(C)(C)",

"c1ccccc1Cc1ccccc1",

"C1CCC1CC(CN(C)(C))CC(=O)CC")

mols <- sapply(smiles , parse.smiles)

view.molecule .2d(mols)� �


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Downsides to rcdk

I Can’t save state of Java objects

I Doesn’t take advantage of S4 classes to provide R-siderepresentations of CDK classes

I Incomplete coverage of the CDK API - sometimes needto go down to rJava to perform an operation

I Big datasets are problematic (mainly due to Rlimitations)


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Access to chemical databases

I Useful to be able to transparently access data fromvarious public data sources

I PubChem compound and assays are supported viarpubchem

I Compound access is primarily by CID, while assay datacan be obtained from key word searches

I End up with a data.frame containing all relevant assayinformation (along with meta-data as attributes)

I R can also easily access arbitrary RDBMS’s (Postgres,MySQL, Oracle)


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Access to PubChem� �> dat <- get.cids (1:30)


$ CID : chr "1" "2" "3" "4" ...

$ IUPACName : chr "3-acetyloxy -4-( trimethylazaniumyl)butanoate" "(2-acetyloxy -4-hydroxy -4-oxobutyl)-trimethylazanium" "5,6- dihydroxycyclohexa -1,3-diene -1- carboxylic acid" "1-aminopropan -2-ol" ...

$ CanonicalSmile : chr "CC(=O)OC(CC(=O)[O-])C[N+](C)(C)C" "CC(=O)OC(CC(=O)O)C[N+](C)(C)C" "C1=CC(C(C(=C1)C(=O)O)O)O" "CC(CN)O" ...

$ MolecularFormula : chr "C9H17NO4" "C9H18NO4 +" "C7H8O4" "C3H9NO" ...

$ MolecularWeight : num 203.2 204.2 156.1 75.1 169.1 ...

> find.assay.id(’LDR ’)

[1] 990 1035 1036 1037 1038 1039 1041 1042 1043 1653 1865

> adat <- get.assay (990)

> str(adat)


$ PUBCHEM.SID : int 845800 848472 852502 857608 859584 4254770 4256513 4258900 7972131 7976210 ...

$ PUBCHEM.CID : int 648162 6603466 655127 658956 660889� �


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Bioinformatics in R

I While the focus is on cheminformatics, many problemsinvolve bioinformatics to some degree

I The Bioconductor project provides a wide variety ofpackages

I A lot of it focused on gene expression analysis

I A number of packages provide access to variousbiological databases, annotations etc

I Protein structure analysis is supported in R via Bio3d

I Never have to leave the comfort of R

http://www.bioconductor.org/

Grant, B. et al, Bioinformatics, 2006, 22, 2695–2696

http://mccammon.ucsd.edu/~bgrant/bio3d/

http://www.bioconductor.org/


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Long calculations, big data

I Many statistical methods require long runningcalculations

I BootstrapI Bayesian methods

I Many problems involve large datasets

I A common feature to both scenarios is that they can betrivially parallelized

I As opposed to require parallel version of underlyingalgorithm

I R has good support for both trivial and non-trivialparallelization methods

I See R/parallel for a package that will parallelizeactual R code

Vera, G. et al., BMC Bioinformatics, 2008, 9, 390


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Simple parallelization

I The snow package allows easy use of multiple cores on asingle computer or a cluster of computers

I A simple wrapper over other parallel R libraries

I Can support PVM, MPI

I At the very least you can use all the cores on your ownmachine

http://cran.r-project.org/web/packages/snow/

http://cran.r-project.org/web/packages/snow/


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Serial code - Feature Selection

I Rather than use GA, SA etc, just look at allcombinations

I Inelegant, but no worries about missing the globaloptimum� �

x <- matrix(runif (500*40) , ncol =40)

y <- runif (500)

library(gtools)

combos <- combinations (40, 3)

apply(combos , 1, function(z) {

d <- data.frame(y=y, x=x[,z])

fit <- lm(y~., data=d)

cor(y, fit$fitted )^2

})� �


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Simple parallelization - Feature Selection

I Trivially parallelized� �x <- matrix(runif (500*40) , ncol =40)

y <- runif (500)

library(gtools)

combos <- combinations (40, 3)

library(snow)

cl <- makeSOCKcluster (2)

clusterExport(cl, "x")

clusterExport(cl, "y")

parApply(cl, combos , 1, function(z) {

d <- data.frame(y=y, x=x[,z])

fit <- lm(y~., data=d)

cor(y, fit$fitted )^2

})� �


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Big data scenarios

I The idea behind snow can also be used to handle verylarge datasets

I Simply chunk the data appropriately and papply overthe list of filenames

I Still requires you to perform chunking and keep track ofeverything

I Hadoop is a nice way to avoid all thisI Throw one or more (very) large files at it, let it deal with

chunking and computationI For non-trivial file formats, you need to implement a

chunker

I RHIPE provides access to a Hadoop cluster from within R

http://hadoop.apache.org/core/

http://hadoop.apache.org/core/


Rajarshi Guha

Background

Molecules in R

Chemical Data

ParallelParadigms

Summary

I rcdk successfully integrates cheminformaticsfunctionality into the R environment

I Related packages provide access to other forms ofchemical data (fingerprints) and data sources

I An excellent environment for chemical and biologicaldata mining

crunching molecules and numbers in r

Technology

learning r

crunching molecules

active rhelp

overview of r syntax

ecient r code

matrix subsets of data

frame x x x x

active development