bioinformatics - · pdf filebioinformatics - lecture 10 bioinformatics software support...
TRANSCRIPT
Bioinformatics - Lecture 10
BioinformaticsSoftware support
Martin Saturka
http://www.bioplexity.org/lectures/
EBI version 0.4
Creative Commons Attribution-Share Alike 2.5 License
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Software for bioinformatics
Common computation tools and systems in bioinformatics.numerical, algebraic and statistical software.computation systems specific to bioinformatics.
Main topicsgeneral software- scripting, licenses- mpi, sse, gpgpuscientific tools- emboss, 3D structures- algebra, regression, graphsR system- syntax, statistics- packages, examples
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Open source
IP: copyrights, trademarks, patents
software licensespublic domainBSD, MITLGPL, GPL
multimedia, textsFDLCC - by, sa, nd, nc
open source licensingOpen source initiativewww.opensource.org/licenses/Creative commonscreativecommons.orgsciencecommons.org
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Programming
theoretical systems and actual languages
approachesimperative
most standard programming languagesdeclarative
functional, logic, constraint programming
languagescompiled
low level work: C/C++, Fortraninterpreted
Python, Tcl/Tk, Perl, Ruby, PHP, Lisp
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Floating point
precisionsingle, double, extended precision, double double,quadruple
hardwareFPU: x87, RISC, pipelinesSIMD: altivec, sse, gpgpu
inverted square ’magic’float InvSqrt(float x) {
float xhalf = 0.5f*x;
int i = *(int*)&x; // float → bitsi = 0x5f3759df - (i>>1); // guess on result valuex = *(float*)&i; // bits → floatx = x*(1.5f-xhalf*x*x); // result value adjustingreturn x;} // relative error below 0.002
Martin Saturka www.Bioplexity.org Bioinformatics - Software
MPI
parallel programming
methodsMPI, PVM, threadswww.open-mpi.org
# include <mpi.h>
...
MPI Init(&argc, &argv);
MPI Comm size(MPI COMM WORLD, &ntasks);
MPI Comm rank(MPI COMM WORLD, &id);
...
MPI Send(msg, ln, MPI INT, dest, tag, MPI COMM WORLD);
MPI Recv(msg, ln, MPI INT, MPI ANY SOURCE, tag, MPI COMM WORLD, &st);
...
MPI Finalize();
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Algebraic methods
non-trivial informatical / numerical methods
hashing - data storageperfect hashing
function from a given constant set of strings to an intervalcuckoo hashing
simple implementation, usage of two hash functions
direct minimizationlinear programming
used e.g. for robust (median) regressionquadratic programming
used e.g. for SVM - support vector machines
eigen problemseigen-vectors as linear data approximation
Martin Saturka www.Bioplexity.org Bioinformatics - Software
FFT NLLS
Fast Fourier transform with non-linear least squares fitting
usagebiological cycle / rhytm study
period determination, run description
stepsdetrending
arithmetic mean subtractingnormalization
variation unificationFFT
taking the greatest valueNLLS
(cosine) curve fitting to data
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Data transformations
initial data preparation peak shapes
interpolated original data
detrended and standardized data
adjusting after theFFT and perioddetermination aredone
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Non-linear fitting
regression with a class of non-linear functions
Levenberg-Marquardt methodsmall errors generally approximated by quadraticsiterative method - smooth interpolation betweenthe steepest descent and the inverse Hessian methodsecond derivatives give ’order’ information,first derivatives give the minimizing directiondamped iteration along the first derivatives
softwareimplemented in many packagesR system, GSL library, Octave, etc.
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Algebra
linear algebra, series, groups, symbolic manipulations
linear algebraOctave (www.octave.org), Scilab (www.scilab.org)arpack, lapack, scalapack, blas, atlaswww.netlib.org/lapack/math-atlas.sourceforge.net
algebraMaxima, GAP, Pari-GP, Axiom, R, GSL, NumPymaxima.sourceforge.netwww.gnu.org/software/gsl/
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Graphics
general visualization software
2D / graphs / diagramsGraphviz (www.graphviz.org)GD, Gnuplot, PLplotXFig, Dia
3D / OpenGL graphicsVTK (www.vtk.org)OpenSceneGraph (www.openscenegraph.org)Pov-Ray (www.povray.org)Blender, DataExplorer, Mayavi, OpenInventor, etc.
Martin Saturka www.Bioplexity.org Bioinformatics - Software
VTK / Tcl interface
#!/usr/bin/env wish8.4package require vtk
vtkSphereSource spheresphere SetRadius 1.0sphere SetCenter 1 1 1sphere SetThetaResolution 8sphere SetPhiResolution 8
vtkConeSource conecone SetHeight 3.0cone SetRadius 1.0cone SetResolution 10
vtkPolyDataMapper sphereMappersphereMapper SetInput [sphere GetOutput]vtkPolyDataMapper coneMapperconeMapper SetInput [cone GetOutput]
vtkActor sphereActorsphereActor SetMapper sphereMapper
vtkActor coneActorconeActor SetMapper coneMapper
vtkRenderer renren AddActor sphereActorren AddActor coneActorren SetBackground 0.9 0.9 0.9
vtkRenderWindow renWinrenWin AddRenderer renrenWin SetSize 300 300
vtkRenderWindowInteractor ireniren SetRenderWindow renWinvtkInteractorStyleTrackballCamera style
iren SetInteractorStyle styleiren AddObserver UserEvent \
wm deiconify .vtkInteractiren Initializewm withdraw .
Martin Saturka www.Bioplexity.org Bioinformatics - Software
VTK example
VTK usagesuitable for complex 3D data visualizationinteractive, screen export, many algorithmsC++ libs, interface to Python, Tcl/Tk, Java
Martin Saturka www.Bioplexity.org Bioinformatics - Software
MM/MD
software for molecular modelling
classical MM/MDGromacs (www.gromacs.org)Tinker, NAMD/VMD
molecular visualizationRasMol, Raster3D, VieMol, Garlic, PyMolwww.openrasmol.orgpymol.sourceforge.net
file toolsOpenBabel - data formats conversionopenbabel.sourceforge.net
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Sequences
standard sequence comparison / manipulation tools
Blastwww.ncbi.nlm.nih.gov/blast/blast.wustl.edu
Embossemboss.sourceforge.netthe European Molecular Biology Open Software Suiteusage for:sequence alignment, database search, motif identification,sequence patterns, presentation tools
Clustal X/W, Phylip, Molphy, fastDNAmlmultiple sequence alignment, phylogenies
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Sequence profiles
hidden-stochastic approaches
HMM methodshmmerhmmer.janelia.orgemboss.sourceforge.net/embassy/hmmer/build and calibrate modelsalign and extract sequences
CM methodsRfamrfam.janelia.orgwww.sanger.ac.uk/Software/Rfam/sequence alignments, covariant models
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Expression clustering
Clusterprogram, C library, Python/Perl interfacebonsai.ims.u-tokyo.ac.jp/˜mdehoon/
software/cluster/software.htm
clustering gene expression datak-means clusters, hierarchical clusteringoriginal software by Eisenrana.lbl.gov/EisenSoftware.htm
cluster visualization by treeViewjtreeview.sourceforge.net
LinguaR-system package for gene expression data-miningwww.bioplexity.orgrelation search and clustering
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Open source sites
bioinformatics open-source software sites
common / bioinformatics repositoriesbioinformatics.org
lists on bioinformatics software, databases, newssourceforge.net
general repository of open-source software
common / bioinformatics projectswww.r-project.org
general statistics and microarray analysis softwarewww.open-bio.org
biological sequences oriented scripting tools
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Statistics
statistics methods
branchesexplorative / descriptive statistics
data characteristics, as mean, variance, etc.confirmative / inferential statistics
comparing achieved p-values to α significance (0.05) level
parametric methodswhen we assume a known class of (usually normal)distributions of random errorsexample: Student’s t-test
robust methodstests without assumption of a distributionusually safe, but could be weak on distinguishingexample: quantile tests
Martin Saturka www.Bioplexity.org Bioinformatics - Software
R system
the standard open-source statistics software
descriptionsystem for statistical computingwith many statistical tests, modelling, time-series, etc.graphics with suitable 2D/3D plotsdata models on matrices, arrays, data-framesspecific functionality by CRAN packages
aboutnot for string processing (use Perl/Python/Ruby),not for internal processing of large databases(use respective DBMS)originally S system, now R and S++ systemsused commonly for bioinformatics, biostatistics,econometrics
Martin Saturka www.Bioplexity.org Bioinformatics - Software
R syntax
vectors, matrices, arrays for regular datadata frames: matrix like-structures for database-like tables,i.e. particular columns of possibly different types
z1 <- c(2.3, 3.5, 12.1, 4.9, 8.2)sum(z1)/length(z1); mean(z1); var(z1)z2 <- 2*z1 - 1z3 <- array (c(1:24), c(4,6))z3[1, 3:5] <- NAz3[is.na(z3)] <- 0
f <- function (x1, x2) {x3 <- (x1 * x2)ˆ0.5x3
}f(2,3)
Martin Saturka www.Bioplexity.org Bioinformatics - Software
R statistical models
dependency description formulae
˜ operator for model definitionY ˜ X the Y response depends on XY ˜ X1 + X2 the Y depends on both X1 and X2Y ˜ X1 - X2 the Y depends on X1, not on X2
linear regression of y by x :x <- c(2.3, 3.5, 12.1, 4.9, 8.2)y <- c(4.3, 5.6, 30.0, 12.5, 20.7)y ˜ xclassification analysis of variance:av <- state <- c("one", "one", "two", "one", "two")A <- factor(av)y ˜ Aclassification analysis of covariance:y ˜ A + x
Martin Saturka www.Bioplexity.org Bioinformatics - Software
R example
simple R usage on (statistics) problems
linear regressionlm(formula = y ˜ x)
analysis of varianceaov(formula = y ˜ A)
Student’s t-testt.test(c(0.1, 0.11, 0.9, 0.8), c(2.1, 2.0, 1.5))
graphicsplot(sin, 0, 7)
Martin Saturka www.Bioplexity.org Bioinformatics - Software
R data
methods for reading / writing data
file content:col1 col2 col3 ...
row1 1.2 8.5 -2.0 ...row2 2.2 -6.1 3.2 ......
tabular dataread.table("file", header = TRUE, row.names = 1)
write.table(dataframe)data import
package foreign - for e.g. Octave datarelational databases
packages RPgSQL, RdbiPgSQL, RSQLite, PL/RBioConductor
for microarray data
Martin Saturka www.Bioplexity.org Bioinformatics - Software
R extensions
R packages system
extending Rusage of the R language andcompiled languages C/C++, fortran
extern interfaceZ <- .Fortran("fncnam", ..., PACKAGE="pkg")Z <- .C("functionname", ..., PACKAGE="pkg")
subroutine fncnam(matrix, size1, size2, result)integer size1, size2double precision matrix1(size1,size2), result
...result = 3.14end
Martin Saturka www.Bioplexity.org Bioinformatics - Software
R networking
parallel programming interface for R
R snowsimple network of workstationscan be used with MPI, PVM, sockets
functionslibrary(snow)
cl <- makeCluster(2, type = "MPI")clusterCall(cl, function() Sys.info())clusterEvalQ(cl, library(boot))clusterApply(cl, 1:2, get("+"), 3)stopCluster(cl)
Martin Saturka www.Bioplexity.org Bioinformatics - Software
R packages
the comprehensive R archive network
CRANarchive of packages for R
survival models, time-seriesbootstrapping, samplingvarious clustering methodsdatabase interfacesquadratic programming
bioinformaticsmicroarray processingbioconductor.orgbioplexity.org
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Python
the current standard open-source scripting language
www.python.org
characteristicscan be viewed as a usable Java replacementdynamic, object-oriented, extensible languagegluing tool with many usable packageshigh-level language, not for a low-level work
package interfacesuser interface: TkInter, wxPythondatabase: DBI, SQLAlchemy, SQLObjectalgorithms: Boost library interfacenumber cruncing: NumPy/SciPy, MPI, RPy
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Python example
sample python code - usage of VTK libraries
import vtk
def setImageWrite(self, ftyp, fname):if ("PS" == ftyp):
wobj = vtk.vtkGL2PSExporter()wobj.SetFilePrefix(fname)wobj.SetRenderWindow(self.renWin)wobj.Write()
else:wobj = vtk.vtkPNGWriter()ffname = fname + ".png"w2i = vtk.vtkWindowToImageFilter()w2i.SetInput(self.renWin)w2i.Update()wobj.SetFileName(ffname)wobj.SetInput(w2i.GetOutput())self.renWin.Render()wobj.Write()
return
Martin Saturka www.Bioplexity.org Bioinformatics - Software
RPy
R interface to Python
rpy.sourceforge.net
package descriptionsimple / robust interfaceR objects available in Pythonall the R functions availableR modules available as well
package usage
import rpyrpy.r.t test([0.1, 0.11, 0.9, 0.8], [2.1, 2.0, 1.5])rpy.r.plot(rpy.r.sin, 0, 7)
Martin Saturka www.Bioplexity.org Bioinformatics - Software
Items to remember
Nota bene:
programming approaches
Softwarescripting, algebra systemsmolecular, bioinformatic tools
R systemstatistics models, datapackages, python interface
Martin Saturka www.Bioplexity.org Bioinformatics - Software