introduction to bioinformatics · bioinformatics deals with algorithms, databases and information...

Introduction to Bioinformatics

Lukas MuellerBoyce Thompson Institute

What is bioinformatics?

● Bioinformatics /baɪ.oʊˌɪnfəәrˈmætɪks/ is the application of computer science and information technology to the field of biology and medicine.

Bioinformatics deals with ● algorithms, databases and information systems, web

technologies, artificial intelligence and soft computing, information and computation theory, software engineering, data mining, image processing, modeling and simulation, signal processing, discrete mathematics, control and system theory, circuit theory, and statistics,

● for generating new knowledge of biology and medicine, and improving & discovering new models of computation (e.g. DNA computing, neural computing, evolutionary computing, immuno-computing, swarm-computing, cellular-computing).

Bioinformatics can...● Identify similar sequences● Provide a putative function for a sequence● Assemble sequences (genomes, transcriptomes)● Annotate genomes● Build networks of genes or metabolites● Determine phylogenetic relationships● Mine literature for biological information● Uncover differences between two genomes● Calculate how a protein folds

What can bioinformatics do for me?

● Speed up your research● Enable you to ask new questions

● Majority of projects involve large datasets● Basic knowledge of bioinformatics needed

● Extract information● Transform information● Run analyses● Build hypotheses, etc.

● http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

200 GB / run

The digital revolution

Increase in seq data

L. Stein, Genome Biology, 2010

Web-based bioinformatics

The next step: Running “locally”

● Perform analyses on large datasets● Analyses run faster● Output easier to handle● Chain analyses● More flexible● Better control of parameters● Needs more knowledge about your computer

and tools!

Highly cited bioinfo tools● 1. BLAST (Altschul SF. et al. 1990; 30,202 citations)

● Sequence search my homology/similarity.● 2. CLUSTALW (Thompson JD. et al. 1994; 32,681 citations).

● Multiple sequence alignment.● 3. PAML (Yang ZH, 1997; 2,642 citations)

● “Maximum Likelihood” phylogenetic analysis.● 4. GBROWSE (Stein LD, et al. 2002; 428 citations),

● Genome visualization.● 5. BLAST2GO (Conesa A. et al. 2005; 363 citations),

● Sequence functional multi-annotation.

(continued)● 6. VELVET (Zerbino DR, et al. 2008; 323 citations),

● Sequence assembly by Bruijn Graphs. ● 7. SAMTOOLS (Li H. et al. 2009; 172 citations),

● Multi-sequence alignment processing for NGS.● 8. SOAP2 (Li RQ et al. 2009; 76 citations).

● Sequence assembly (short reads). ● 9. MAKER (Cantarel BL, et al. 2008; 23 citations),

● Genome annotation pipeline.● 10. GALAXY (Goecks J. et al. 2010; 20 citations),

● Genomic analysis platform that integrates several scripts and tools.

Running “pipelines”

Linux

● UNIX-based, free an open source operating system

● Very stable● Adopted for most bioinformatics work

● Installed on laptops, clusters, supercomputers● Can run on your computer!

● Virtualized or native

C, UNIX and Linux

• Ken Thompson and Dennis Ritchie inventors of UNIX at Bell labs in front of PDP-11 early 1970's.

• Linus Torvalds implemented an open source version of UNIX (Linux) while a student in Finland in the 1990s

UNIX – the terminal

● Runs the “shell”● Built-in scripting

shell commands

● Powerful, but text based (CLI)● Automate task, combine commands● Look like gobbledegook:grep Niben /var/log/ftp | grep -i sca | sort -u | wc -l

Scripting

● Scripts: Small programs written by the end-user that control the execution of other programs or perform a simple algorithm

● Extremely flexible● Written in Shell, Perl, Python● You can write them yourself!!!

Perl● Versatile language● Developed since 1980s by Larry Wall● Useful for bioinformatics and web development● Support for objects● Excellent integration of regular expressions (text handling

language)● Vast open source code library (http:/cpan.org/)

● BioPerl● Easy to learn● http://www.perl.org/

Example

.....

● Language designed for statistics● Support for matrix calculations, graphics● Expression analysis, Next-Gen sequence

analysis, Graphics, genome annotation statistics, phylogeny

● Interactive● Bioconductor package

Databases

● Biological data is highly structured● Relational database systems (postgres, mysql)● Database schemas - normalization ● SQL

Transcriptomics and sequence assembly

● RNASeq technology and genome sequencing using next generation sequencing

● Experimental design, multiplexing● Special tools developed

● Sequence preprocessing● Aligners such as bwa, novoalign● Assemblers such as newbler, mira, velvet● Viewers ● File conversions

● Evaluation of assemblies● Structural and functional annotation

Phylogenetics and comparative genomics

● How do sequences/genomes relate to each other?● Align sequences

● ClustalW● Muscle

● Build phylogenetic trees● Parsimony● Neighbor join● Maximum likelyhood

● Analyses● Orthology● Modes of selection● Identification of SNP patterns● Genome duplications

Beyond this course

● BTI Perl Club

● If you have a bioinformatics question, please let us know!

● http://btiplantbioinfocourse.wordpress.com/

introduction to bioinformatics · bioinformatics deals with algorithms, databases and information...

Documents