introduction to bioinformatics · bioinformatics deals with algorithms, databases and information...
TRANSCRIPT
Introduction to Bioinformatics
Lukas MuellerBoyce Thompson Institute
What is bioinformatics?
● Bioinformatics /baɪ.oʊˌɪnfəәrˈmætɪks/ is the application of computer science and information technology to the field of biology and medicine.
Bioinformatics deals with ● algorithms, databases and information systems, web
technologies, artificial intelligence and soft computing, information and computation theory, software engineering, data mining, image processing, modeling and simulation, signal processing, discrete mathematics, control and system theory, circuit theory, and statistics,
● for generating new knowledge of biology and medicine, and improving & discovering new models of computation (e.g. DNA computing, neural computing, evolutionary computing, immuno-computing, swarm-computing, cellular-computing).
Bioinformatics can...● Identify similar sequences● Provide a putative function for a sequence● Assemble sequences (genomes, transcriptomes)● Annotate genomes● Build networks of genes or metabolites● Determine phylogenetic relationships● Mine literature for biological information● Uncover differences between two genomes● Calculate how a protein folds
What can bioinformatics do for me?
● Speed up your research● Enable you to ask new questions
● Majority of projects involve large datasets● Basic knowledge of bioinformatics needed
● Extract information● Transform information● Run analyses● Build hypotheses, etc.
● http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
200 GB / run
The digital revolution
Increase in seq data
L. Stein, Genome Biology, 2010
Web-based bioinformatics
The next step: Running “locally”
● Perform analyses on large datasets● Analyses run faster● Output easier to handle● Chain analyses● More flexible● Better control of parameters● Needs more knowledge about your computer
and tools!
Highly cited bioinfo tools● 1. BLAST (Altschul SF. et al. 1990; 30,202 citations)
● Sequence search my homology/similarity.● 2. CLUSTALW (Thompson JD. et al. 1994; 32,681 citations).
● Multiple sequence alignment.● 3. PAML (Yang ZH, 1997; 2,642 citations)
● “Maximum Likelihood” phylogenetic analysis.● 4. GBROWSE (Stein LD, et al. 2002; 428 citations),
● Genome visualization.● 5. BLAST2GO (Conesa A. et al. 2005; 363 citations),
● Sequence functional multi-annotation.
(continued)● 6. VELVET (Zerbino DR, et al. 2008; 323 citations),
● Sequence assembly by Bruijn Graphs. ● 7. SAMTOOLS (Li H. et al. 2009; 172 citations),
● Multi-sequence alignment processing for NGS.● 8. SOAP2 (Li RQ et al. 2009; 76 citations).
● Sequence assembly (short reads). ● 9. MAKER (Cantarel BL, et al. 2008; 23 citations),
● Genome annotation pipeline.● 10. GALAXY (Goecks J. et al. 2010; 20 citations),
● Genomic analysis platform that integrates several scripts and tools.
Running “pipelines”
Linux
● UNIX-based, free an open source operating system
● Very stable● Adopted for most bioinformatics work
● Installed on laptops, clusters, supercomputers● Can run on your computer!
● Virtualized or native
C, UNIX and Linux
• Ken Thompson and Dennis Ritchie inventors of UNIX at Bell labs in front of PDP-11 early 1970's.
• Linus Torvalds implemented an open source version of UNIX (Linux) while a student in Finland in the 1990s
Linux
UNIX – the terminal
● Runs the “shell”● Built-in scripting
shell commands
● Powerful, but text based (CLI)● Automate task, combine commands● Look like gobbledegook:grep Niben /var/log/ftp | grep -i sca | sort -u | wc -l
Scripting
● Scripts: Small programs written by the end-user that control the execution of other programs or perform a simple algorithm
● Extremely flexible● Written in Shell, Perl, Python● You can write them yourself!!!
Perl● Versatile language● Developed since 1980s by Larry Wall● Useful for bioinformatics and web development● Support for objects● Excellent integration of regular expressions (text handling
language)● Vast open source code library (http:/cpan.org/)
● BioPerl● Easy to learn● http://www.perl.org/
Example
.....
● Language designed for statistics● Support for matrix calculations, graphics● Expression analysis, Next-Gen sequence
analysis, Graphics, genome annotation statistics, phylogeny
● Interactive● Bioconductor package
Databases
● Biological data is highly structured● Relational database systems (postgres, mysql)● Database schemas - normalization ● SQL
Transcriptomics and sequence assembly
● RNASeq technology and genome sequencing using next generation sequencing
● Experimental design, multiplexing● Special tools developed
● Sequence preprocessing● Aligners such as bwa, novoalign● Assemblers such as newbler, mira, velvet● Viewers ● File conversions
● Evaluation of assemblies● Structural and functional annotation
Phylogenetics and comparative genomics
● How do sequences/genomes relate to each other?● Align sequences
● ClustalW● Muscle
● Build phylogenetic trees● Parsimony● Neighbor join● Maximum likelyhood
● Analyses● Orthology● Modes of selection● Identification of SNP patterns● Genome duplications
Beyond this course
● BTI Perl Club
● If you have a bioinformatics question, please let us know!
● http://btiplantbioinfocourse.wordpress.com/