introduction to bioinformatics … · introduction to bioinformatics swiss institute of...
TRANSCRIPT
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Introduction to Bioinformatics
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
SIB and EMBnet Bioinformatics resources for biomedical scientists
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
The Swiss Institute of Bioinformatics
Founded in March 1998Collaborative structure Lausanne - Geneva - BaselGroups at ISREC, Ludwig Institute, Unil, HUG, UniGe, UniBas and soon ETHZ.Several roles: teaching, services, researchCurrently: ~ 160 employees
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Projects at SIB
DatabasesSWISS-PROT, PROSITE, EPD, World-2DPAGE, SWISS-MODELTrEST, TrGEN (predicted proteins), tromer (transcriptome)
SoftwaresMelanie, Deep View, proteomic tools, ESTScan, pftools, Java applets
ServicesWeb servers ExPASy, EMBnet, MyHitsTeaching and helpdesk
ResearchMostly sequence and expression analysis, 3D structure, andproteomic
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Teaching
Master degrees in Bioinformatics (Bologna type): 90 ECTS credits in Unige, Unil and Unibas.EMBnet courses: 4x 1 week per year in Lausanne, Basel, Bern or ZürichPregrade courses in Geneva, Fribourg and Lausanne UniversitiesOther courses at CHUV and EPFLCourses in other countries: Colombia, Cambodia, Peru, …
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Research
New algorithms (faster alignments…)New technology (GRID or cluster computing)New tools (protein analysis, microarrays, confocalmicroscopy)New databases (microarrays, transcriptome, proteome)
Collaborations with lab researchers!
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Three levels of services
Simple web access to softwares and databasesEasy to use for basic occasional research with few sequencesPotentially insecure
Command-line access with a local Unix accountMore powerful (automation) and secure Requires to understand Unix system and frequent practice
Collaboration with SIBAccess to experts in the field (help desk)For projects requiring huge programming or special hardware resources
Help [email protected] or http://www.expasy.org/contact.html
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
SIB’s important sites
Homewww.isb-sib.ch
ExPASy - Expert Protein Analysis Systemwww.expasy.org
MyHits database and toolsmyhits.isb-sib.ch
EMBnet Switzerlandwww.ch.embnet.org
Geneva Bioinformaticswww.genebio.ch
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
SIB home
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Expert Protein Analysis SystemQuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
MyHits http://myhits.isb-sib.ch
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Swiss node http://www.ch.embnet.org
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
EMBnet organisation
European in 1988, now world-wide spread32 country nodes, 8 special nodes.
RoleTraining, education (EMBER)Software development (EMBOSS, SRS)Computing resources (databases, websites, services)Helpdesk and technical supportPublications (EMBnet.news, Briefings in Bioinformatics)
Access: www.embnet.orgEach node with “www.xx.embnet.org” where xx is the country code (e.g., ch for Switzerland)
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
EMBnet home
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
European Molecular Biology Open Software Suite
Free Open Source (for most Unix plateforms)GCG successor (compatible with GCG file format)More than 150 programs (ver. 2.9.0)Easy to install locally
but no interface, requires local databasesUnix command-line only
Interfaces Jemboss, wEMBOSS, www2gcg, w2h… (with account)Pise, EMBOSS-GUI, SRSWWW (no account)Staden, Kaptain, CoLiMate, Jemboss (local)
Access: www.emboss.org or emboss.sourceforge.net
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Other important sites
ExPASy - Expert Protein Analysis Systemwww.expasy.org
EBI - European Bioinformatics Institutewww.ebi.ac.uk
NCBI - National Center for BiotechnologyInformation
www.ncbi.nlm.nih.govSanger - The Sanger Institute
www.sanger.ac.uk
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Bioinformatics: definition
Every application of computer science to biologySequence analysis, images analysis, sample management, population modelling, …
Analysis of data coming from large-scale biologicalprojects
Genomes, transcriptomes, proteomes, metabolomes, etc…
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
The new biology
Traditional biologySmall team working on a specialized topicWell defined experiment to answer precise questions
New « high-throughput » biologyLarge international teams using cutting edge technologydefining the projectResults are given raw to the scientific community withoutany underlying hypothesis
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Example of « high-throughput »
Complete genome sequencingLarge-scale sampling of the transcriptome (EST)Simultaneous expression analysis of thousands of genes (DNA microarrays, SAGE)Large-scale sampling of the proteomeProtein-protein analysis large-scale 2-hybrid (yeast, worm)Large-scale 3D structure production (yeast)Metabolism modellingSimulationsBiodiversity
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Role of bioinformatics
Control and management of the data Analysis of primary data e.g.
Base calling from chromatogramsMass spectra analysisDNA microarrays images analysis
StatisticsDatabase storage and accessResults analysis in a biological context
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
First information: a sequence ?
NucleotideRNA (or cDNA) Genomic (intron-exon)Complete or incomplete?
mRNA with 5’ and 3’ UTR regionsEntire chromosome
ProteinPre/Pro or functional protein?Function predictionPost-translational modifications?Holy Grail: 3D structure?
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Genomes in numbers
Sizes:virus: 103 to 105 ntbacteria: 105 to 107 ntyeast: 1.35 x 107 ntmammals: 108 to 1010 ntplants: 1010 to 1011 nt
Gene number:virus: 3 to 100bacteria: ~ 1000yeast: ~ 7000mammals: ~ 30’000Plants: 30’000-50’000?
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Sequencing projects
« small » genomes (<107): bacteria, virusMany already sequenced (industry excluded)More than 150 microbial genomes already in the public domainMore to come! (one new every two weeks…)
« large » genomes (107-1010) eucaryotes>30 finished (S.cerevisiae, S. Pombe, E. cuniculi, G. theta, C.elegans, D.melanogaster, A. gambiae, P. falciparum, P. yoelii, D. rerio, F. rubripes, A.thaliana, O. sativa (2x), M. musculus, Homo sapiens, P. troglodytes, R. norvegicus, C. familiaris, G. gallus…) Many more to come: cat, elephant, pig, cow, maize (and otherplants), insects, fishes, many pathogenic parasites (Leishmania…)
EST sequencingPartial mRNA sequences ~40x106 sequences in the public domain
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Human genome
Size: 3 x 109 nt for a haploid genomeHighly repetitive sequences 25%, moderately repetitive sequences 25-30%Size of a gene: from 900 to >2’000’000 bases (intronsincluded)Proportion of the genome coding for proteins: 5-7%Number of chromosomes: 22 autosomal, 1 sexual chromosome Size of a chromosome: 5 x 107 to 5 x 108 bases
centromer exons of a gene telomer
regulatory elements repetitive sequences
locus control region
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
How to sequence the human genome?
Consortium « international » approach:Generate genetic maps (meiotic recombination) and pseudogeneticmaps (chromosome hybrids) for indicator sequencesGenerate a physical map based on large clones (BAC or PAC)Sequence enough large clones to cover the genome
« commercial » approach (Celera):Generate random libraries of fixed length genomic clones (2kb and10kb)Sequence both ends of enough clones to obtain a 10x coverageUse computer techniques to reconstitute the chromosomalsequences, check with the public project physical map
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Interpretation of the human draft
All chromosomes considered as finishedEven a genomic sequencedoes not tell you where thegenes are encoded. Thegenome is far from being« decoded »One must combine genomeand transcriptome to have a better idea Last freeze Ncbi34 July, 2003Last freeze Ncbi34 July, 2003
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
The transcriptome
The set of all functional RNAs (tRNA, rRNA, mRNA etc…) that can potentially be transcribed from the genomeThe documentation of the localization (cell type) and conditions under which these RNAs are expressedThe documentation of the biological function(s) of each RNA species
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Public draft transcriptome
Information about the expression specificity and thefunction of mRNAs
« full » cDNA sequences of know function« full » cDNA sequences (HTC), but « anonymous » (e.g. KIAA or DKFZ collections)EST sequences
cDNA libraries derived from many different tissuesRapid random sequencing of the ends of all clones ORESTES sequences
Growing set of expression data (microarrays, SAGE etc…)Increasing evidences for multiple alternative splicing andpolyadenylation
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Example mapping of ESTs and mRNAs
ESTsmRNAs
Computer prediction
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
The proteome
Set of proteins present in a particular cell type under particular conditionsSet of proteins potentially expressed from thegenomeInformation about the specific expression andfunction of the proteins
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Information on the proteome
Separation of a complex mixture of proteins2D PAGE (IEF + SDS PAGE)Capillary chromatography
Individual characterisation of proteinsTryptic peptides signature (MS)Sequencing by chemistry or MS/MS
All post-translational modifications (PTMs) !
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Tridimentional structures
Methods to determine structuresX-ray cristallographyNMR
Data formatAtoms coordinates (except H) in a cartesian space
DatabasesFor proteins and nucleic acids (RSCB, was PDB)Independent databases for sugars and small organicmolecules
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Visualisation of the structures
Secondary structure elementsAlpha helices, beta sheets, other
SoftwaresVarious representations (atoms, bonds, secondary…)Big choice of commercial and free software (e.g., DeepView)
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Sequence information, and so what ?
How to store and organise ?Databases (next lecture)
How to access, search, compare ?Pairwise alignments, dot plots (Tuesday)BLAST searches in db (Tuesday)EST clustering (Wednesday)Multiple Alignments (Wednesday) Patterns, PSI-BLAST, Profiles and HMMs (Thursday)Gene prediction (Thursday) Protein function prediction (Friday)Users problems (Friday)
Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique
LF-2005.02
Thank you