crunching molecules and numbers in r
TRANSCRIPT
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Crunching Molecules and Numbers in R
Rajarshi Guha
NIH Chemical Genomics Center
238th ACS National Meeting17th August, 2009
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Outline
I Some background on R
I Doing cheminformatics in R
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
R History
S developed by John Chambers at
Bell Labs
1976
S rewritten in C
1988
Licensed to Insightful Corp.
1993
Bought by Insightful Corp for $2M
2004
Bought by TIBCO for $25M
2008
First public release
1993
Created by Ihaka & Gentleman
1991
Released under GPL
1995
R 1.0.0
2000
R 2.9
2009
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
An overview of R
I An environment for statistical computation
I Wide variety of standard and state of the art statisticalmethods built in or accessible via packages
I But also a complete, interpreted programming language
I Well suited for manipulating and operating on datasets -numerical, categorical or a mixture - and of varyingshape
I Impressive visualization facilities (but not veryinteractive)
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
An overview of R
I Syntax is pretty much S-Plus
I Highly cross-platform
I Frequent and regular releases, active development bycore group
I The dev and user community extremely activeI r-help is not just for learning R, you can get a decent
statistics education from the list!
I Used by many top statisticians, many cutting edgetechniques first show up in R
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Usability
I Default mode is a command line like prompt
I GUI’s available
I But learning curve is steep
I Does force you to think about the analysis
I Not a great tool for casual, once-in-a-while usage
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
R primitives
I Numeric, character, list, matrix, data.frame� �> x <- ’Hello World ’
> x <- 1
> x <- c(1,2,3,4,5,6)
> x
[1] 1 2 3 4 5 6
x <- data.frame(MW=runif(5, 10, 50),
hERG=sample(c(’active ’,’inactive ’),
5, TRUE))
> x
MW hERG
1 23.55435 active
2 42.90365 inactive
3 49.35149 active
4 26.85912 active
5 10.01877 active� �
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Matrix oriented programming
I Similar in style to Matlab
I Easily access (multiple) rows, columnsI Vector/matrix indexing is very powerful and key to
efficient R codeI Perform operations on entire rows or columnsI Makes subsetting a trivial operation
I Perfect for QSAR type analyses
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Functional style
I R’s functional paradigms are closely tied to matrixoperations
I apply, lapply, sapply, tapply allow you to easilyoperate on groups of objects
I Elements of a listI Rows and/or columns of a matrixI Subsets of data, using a grouping variable
I Anonymous functions are supported
I Use of these funtional forms can lead to speed upcompared to traditional for loops
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Non-functional style
� �# column std devs
m <- matrix(runif (100*100) , ncol =100)
sds <- numeric(ncol(m))
for (i in 1:ncol(m)) sds[i] <- sd(m[,i])
# mean logP of toxic , non -toxic classes
m <- data.frame(logp=runif (100) ,
toxic=sample(c(’yes ’,’no ’),
100, TRUE)
toxLogP <- 0
nontoxLogP <- 0
for (j in 1:nrow(m)) {
if (m[j,2] = ’yes ’) toxLogP <- toxLogP + m[j,1]
else nontoxLogP <- nontoxLogP + m[j,1]
}� �
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Functional style
� �# column std devs
m <- matrix(runif (100*100) , ncol =100)
apply(m, 2, sd)
# mean logP of toxic , non -toxic classes
m <- data.frame(logp=runif (100) ,
toxic=sample(c(’yes ’,’no ’),
100, TRUE)
by(m, m$toxic , function(x) mean(x$logp ))� �
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Object oriented style
I R supports multiple object oriented mechanismsI Simplest is S3 classes
I Object orientation is in terms of function namesI Easy to work with, not always flexible enough
I S4 classes are much more powerful, but also morecomplex
I Many problems can ignore these as R primitives providesufficient support for attaching meta-data to objects(crude encapsulation)
I Becomes important/useful when writing packages, notfor day to day code
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Interfacing with C & Fortran
I R is interpreted,functional forms help abit
I Very useful to refactorinner loops into C (orFortran)
I Also useful to provide anR interface to pre-existingC/Fortran code
I Can lead to dramaticspeedups
1024 166 79
Bit length
Spe
edup
010
2030
4050
5000 pairwise Tanimoto similaritycalculations, Macbook Pro,
2GHz, 1GB RAM
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Visualization
I R generates publication quality graphics in a variety offormats
I A huge number of statistical visualization methods (2D,3D, OpenGL)
I Extremely powerful display specificationsI core commandsI lattice (a.k.a trellis graphics)
I Based on sound statistical theories
I While standard plots are easy to make, but complexplots do have a learning curve
I Interactivity is limited, though some package do alleviatethis
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Code quality
I It’s not just enough to write code
I RUnit is a package that supports unit testing, analogousto JUnit
I R comes with well defined package structure that can beautomatically checked for various errors
I Packages can be uploaded to CRAN which allows any Ruser to install them directly from R
I Extensive documentation format
I Sweave is an important feature which allows one toinclude R code and associated text in a single document- literate programming or reproducible research
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
The downsides of R
I Memory bound (but can use as much memory as youhave)
I Language inconsistenciesI Indexing starts from 1, but no error if you use 0 as an
indexI See blog posts by Radford Neal (U Toronto)
I Debugging environment not so great (though ESS isgood for Emacs users)
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Cheminformatics programming
I Fundamental requirement is support for core chemicalconcepts
I Representation and manipulation of these concepts
I Flexibility
I Could implement all of this directly in R - lots of wheelswould be reinvented
I We also want such functionality to be R-likeI Writing Java or C in R is not R-like
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
The Chemistry Development Kit
I Open source Java library for cheminformaticsI Wide variety of functionality
I Core chemical concepts (atoms, bonds, molecules)I SMARTS, pharmacophoresI Molecular descriptors and fingerprintsI 2D depictions
I Used in a variety of tools, applications and services
Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2110–2120
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
rcdk - CDK from R
R Programming Environment
rJava
CDK Jmol
rcdk
XML
rpubchem
fingerprint
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
rcdk Motivations
I Have access to cheminformatics functionality fromwithin R
I Support processing of data from chemistry databases
I Not reimplement cheminformatics methods
I Have access to all of this in idiomatic R
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Basic molecular operations - I/O
I Read in molecular file formats support by the CDK
I Files can be local or remote
I Parse SMILES strings
I In contrast to the CDK, rcdk will configure moleculesautomatically (unless instructed not to)
I The resultant molecule objects are Java references, canbe passed to a variety of rcdk functions� �
mols <- load.molecules(c(’abc.sdf ’, ’xyz.smi ’))
mol <- parse.smiles(’c1ccccc1CC (=O)’)
mols <- sapply(c(’CC’, ’CCCC ’, ’CCCNC ’),
parse.smiles)� �
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Basic molecular operations
I Given a molecule, we can extract or add properties
I Get lists of atoms and bonds and then manipulate them
I Currently doesn’t support a lot of molecular graphoperations� �
# get the atoms from a molecule
mol <- parse.smiles (" c1ccccc1C(Cl)(Br)c1ccccc1 ")
atoms <- get.atoms(mol)
# get the coordinate matrix of the molecule
coords <- do.call(’rbind ’,
lapply(atoms , get.point3d ))� �
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Working with fingerprints
I rcdk will generate a variety of fingerprints via the CDK
I Other packages can generate fingerprints
I The fingerprint package suports I/O of fingerprintdata and various similarity operations on fingerprints
I Provides an S4 class representing binary fingerprints� �m1 <- parse.smiles(’c1ccccc1C(COC)N’)
m2 <- parse.smiles(’C1CCCCC1C(COC)N’)
# Calculate fingerprints
fps <- lapply(list(m1 ,m2),
get.fingerprint , type=’maccs ’)
distance(fps[[1]] , fps[[2]] , method=’tanimoto ’)
fps <- fp.read(’fp.txt ’, lf=moe.lf ,
size =166, header=TRUE)
fpsim <- fp.sim.matrix(fps , method=’tanimoto ’)� �
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
rcdk and QSAR
MolecularDescriptors
Machine Learning
Property
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
rcdk and QSAR
I Access to descriptors and fingerprints makes for veryeasy QSAR modeling within R
I Evaluate the descriptors (individually, by type or all)
I Get back a data.frame which can be used as input topretty much any modeling method� �
mols <- load.molecules(’big.sdf ’)
dnames <- get.desc.names(’topological ’)
descs <- eval.desc(mols , dnames)
str(descs)
’data.frame ’: 467 obs. of 180 variables:
$ ATSc1 : num 0.28 0.279 0.279 0.217 0.479 ...
$ ATSc2 : num -0.0777 -0.0851 -0.0845 -0.0587 -0.2356 ...
$ ATSc3 : num -0.05803 -0.04706 -0.04616 -0.0519 0.00129 ...
$ ATSc4 : num -0.00906 0.00279 -0.01147 0.00241 0.00856 ...� �
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Viewing molecules
I While numerical modeling is a fundamental task in thisenvironment, visualization is also important
I Either view structures of individual molecules or tables ofstructure and data
I rcdk supports both (not very well on OS X)� �mol <- parse.smiles(’c1ccccc1C(N)CC ’)’
view.molecule .2d(mol)
smiles <- c("CCC", "CCN", "CCN(C)(C)",
"c1ccccc1Cc1ccccc1",
"C1CCC1CC(CN(C)(C))CC(=O)CC")
mols <- sapply(smiles , parse.smiles)
view.molecule .2d(mols)� �
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Downsides to rcdk
I Can’t save state of Java objects
I Doesn’t take advantage of S4 classes to provide R-siderepresentations of CDK classes
I Incomplete coverage of the CDK API - sometimes needto go down to rJava to perform an operation
I Big datasets are problematic (mainly due to Rlimitations)
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Access to chemical databases
I Useful to be able to transparently access data fromvarious public data sources
I PubChem compound and assays are supported viarpubchem
I Compound access is primarily by CID, while assay datacan be obtained from key word searches
I End up with a data.frame containing all relevant assayinformation (along with meta-data as attributes)
I R can also easily access arbitrary RDBMS’s (Postgres,MySQL, Oracle)
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Access to PubChem� �> dat <- get.cids (1:30)
’data.frame ’: 30 obs. of 11 variables:
$ CID : chr "1" "2" "3" "4" ...
$ IUPACName : chr "3-acetyloxy -4-( trimethylazaniumyl)butanoate" "(2-acetyloxy -4-hydroxy -4-oxobutyl)-trimethylazanium" "5,6- dihydroxycyclohexa -1,3-diene -1- carboxylic acid" "1-aminopropan -2-ol" ...
$ CanonicalSmile : chr "CC(=O)OC(CC(=O)[O-])C[N+](C)(C)C" "CC(=O)OC(CC(=O)O)C[N+](C)(C)C" "C1=CC(C(C(=C1)C(=O)O)O)O" "CC(CN)O" ...
$ MolecularFormula : chr "C9H17NO4" "C9H18NO4 +" "C7H8O4" "C3H9NO" ...
$ MolecularWeight : num 203.2 204.2 156.1 75.1 169.1 ...
> find.assay.id(’LDR ’)
[1] 990 1035 1036 1037 1038 1039 1041 1042 1043 1653 1865
> adat <- get.assay (990)
> str(adat)
’data.frame ’: 51 obs. of 9 variables:
$ PUBCHEM.SID : int 845800 848472 852502 857608 859584 4254770 4256513 4258900 7972131 7976210 ...
$ PUBCHEM.CID : int 648162 6603466 655127 658956 660889� �
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Bioinformatics in R
I While the focus is on cheminformatics, many problemsinvolve bioinformatics to some degree
I The Bioconductor project provides a wide variety ofpackages
I A lot of it focused on gene expression analysis
I A number of packages provide access to variousbiological databases, annotations etc
I Protein structure analysis is supported in R via Bio3d
I Never have to leave the comfort of R
http://www.bioconductor.org/
Grant, B. et al, Bioinformatics, 2006, 22, 2695–2696
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Long calculations, big data
I Many statistical methods require long runningcalculations
I BootstrapI Bayesian methods
I Many problems involve large datasets
I A common feature to both scenarios is that they can betrivially parallelized
I As opposed to require parallel version of underlyingalgorithm
I R has good support for both trivial and non-trivialparallelization methods
I See R/parallel for a package that will parallelizeactual R code
Vera, G. et al., BMC Bioinformatics, 2008, 9, 390
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Simple parallelization
I The snow package allows easy use of multiple cores on asingle computer or a cluster of computers
I A simple wrapper over other parallel R libraries
I Can support PVM, MPI
I At the very least you can use all the cores on your ownmachine
http://cran.r-project.org/web/packages/snow/
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Serial code - Feature Selection
I Rather than use GA, SA etc, just look at allcombinations
I Inelegant, but no worries about missing the globaloptimum� �
x <- matrix(runif (500*40) , ncol =40)
y <- runif (500)
library(gtools)
combos <- combinations (40, 3)
apply(combos , 1, function(z) {
d <- data.frame(y=y, x=x[,z])
fit <- lm(y~., data=d)
cor(y, fit$fitted )^2
})� �
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Simple parallelization - Feature Selection
I Trivially parallelized� �x <- matrix(runif (500*40) , ncol =40)
y <- runif (500)
library(gtools)
combos <- combinations (40, 3)
library(snow)
cl <- makeSOCKcluster (2)
clusterExport(cl, "x")
clusterExport(cl, "y")
parApply(cl, combos , 1, function(z) {
d <- data.frame(y=y, x=x[,z])
fit <- lm(y~., data=d)
cor(y, fit$fitted )^2
})� �
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Big data scenarios
I The idea behind snow can also be used to handle verylarge datasets
I Simply chunk the data appropriately and papply overthe list of filenames
I Still requires you to perform chunking and keep track ofeverything
I Hadoop is a nice way to avoid all thisI Throw one or more (very) large files at it, let it deal with
chunking and computationI For non-trivial file formats, you need to implement a
chunker
I RHIPE provides access to a Hadoop cluster from within R
http://hadoop.apache.org/core/
CrunchingMolecules andNumbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
ParallelParadigms
Summary
I rcdk successfully integrates cheminformaticsfunctionality into the R environment
I Related packages provide access to other forms ofchemical data (fingerprints) and data sources
I An excellent environment for chemical and biologicaldata mining