genome annotation and databases genomic dna sequence genomic annotation bio520 bioinformaticsjim...
TRANSCRIPT
Genome Annotation and Databases
Genomic DNA sequence
Genomic annotation
BIO520 Bioinformatics Jim Lund
Reading Ch 9, Ch10
Genome Annotation
• Find known repeats• Search for new repeated sequences• Predict Genes
– BLASTX– Genewise, Fgenes,
Genscan…
• Integrate other data sources.
Accuracy highest in “high homology” class
Genome annotation servers
• Integrate information from several maps– DNA sequence (contigs, quality).– Physical (cytogenetic, STS content).– Genes (show gene annotations and evidence).
• Several prediction programs.• Expressed sequence tags (ESTs, Unigene
clusters)• Evidence (Predicted, confirmed)• Non-coding RNA (ncRNA) transcripts.
– Variation (e.g., SNPs)– Regions of shared synteny.
Data Release• Human genome sequence released under 1996
Bermuda rules– Assembled sequence greater than 1000bp long is
deposited in public database (GenBank/EMBL/DDBJ) every 24 hours
– No patents are filed
• Bermuda principles reaffirmed at January 2003 WT/NIH meeting– Pre-release of data for all “community projects”– Nature 421 , 875 (2003)– NHGRI:
• http://www.genome.gov/page.cfm?pageID=10506376
– WT:• http://www.wellcome.ac.uk/About-us/Policy/Policy-and-
position-statements/WTD002751.htm
• Benefits of Open Data Access supported by OECD report– http://dataaccess.ucsd.edu
Accessing the Genome
• Genomes sequences are becoming available very rapidly– Large and difficult to handle computationally– Everyone expects to be able to access them immediately
• Bench Biologists– Has my gene been sequenced?– What are the genes in this region?– Where are all the GPCRs– Connect the genome to other resources.
• Research Bioinformatics– Give me a dataset of human genomic DNA.– Give me a protein dataset.
Getting information out
• Search/browse to find the gene or region.
• Export formats:– Screen shot– FASTA seq.– Genbank file with features annotated– Feature list (Gff, tab-delimited text)– Pip (plot of sequence identity between
organisms).
Challenges
• Scale and data flow– Presentation, ease of use.
– Engineering problems.
– User interface design.
• Algorithmic– Partly engineering (pre-compute hard
computations, etc.)
– Partly research.
NCBI sequence assembly (sequence chromosome)
• Remove contaminants
• Bin by chromosome arms
• Sequence Layout
• Sequence Building
• Place on chromosomes
http://www.ncbi.nlm.nih.gov/genome/guide/build.shtml
NCBI sequence assembly - a modified greedy approach
Sequence Layout•Curated Finished Regions•Curated assembly instructions •MegaBLAST hits•Consider clone order•BAC chromosome assignment
•annotation•STS markers•personal communication
•Remove conflicting overlaps, redundant BACs
Sequence Building•Consider fragment:fragment sequence overlaps for each BAC pair in
layout•Meld overlapping sequence•Order and Orient (o+o ):
•alignments (mRNA, EST)•BAC annotation•paired plasmid reads
BAC Sequence
Fragments
Assemble
Order
NCBI Contig
NCBI Genome Build Process
Contig Build&
ReleaseAssembly
Input Data: Sequences Curated NTs TPF BLAST hits
Resource Updates
STSClones
AnnotationLocusLink
RefSeq
CollaborationCuration
FTPBLASTInput Resources
Map Viewer
Update:Linksgi’sPrepare for release
Sequences (contig mRNA protein)Analysis & ReviewCorrections for next build
Freeze
LocusLinkGenBank
GenomeScan
dbSNP
Public Release
ExcludeProblemaccessions
What is being annotated?
Genes: By alignment, by prediction
Markers:By ePCR
Clones/Cytogenetic location: By alignment (BAC ends)Variation: By alignment
Phenotype (MIM): Via Gene identification, associated markers
Cytogenetic Position: By annotated BAC-END sequenced clonesBy FISH-mapped clones used in assembly
Feature Method
RefSeq: a reagent for Contig Annotation
GenomeScan
ESTs
TBLASTN
RPSBLAST
RefSeq mRNAs
GenBank mRNAs
RefSeq Advantages:•Separate Gene Families•Not Partial•Means to correct problem sequences
RefSeq process resultsin excluding problemGenBank sequencesfrom annotation pipeline
Potential ProblemsWith ESTs:•Gene Families•Partial•Chimeric•Intron read-through•Linker•Vector•Wrong organism
genome
NCBI: Products of annotation• RefSeqs (transcripts, proteins)• Gene id (LocusID)• features in chromosome coordinates• features in contig (NT accession)
coordinates
Available in:• Map Viewer
– Graphical display– Tabular display– Sequence downloads
• FTP– RefSeqs (contigs, transcripts, proteins)– Mapping Data– LocusLink & Other resources
Query by sequence: Review the alignment
A click away:•Alignments (BLAST hit)•Gene Description (LocusLink)•Report of all features in the region•Contig sequence•Sequence in the region•other mRNAs aligning in the region•Define your own gene model based on alignments in the region
Quality Control - Genome review
• Is the sequence correct?• Is the feature correctly placed?• Is there a feature that should be
placed?• Are the attributes of the feature correct?
Approaches:•In-house analysis & review (manual curation)•Shared information (UCSC/Ensembl)•Solicited review by experts in local regions
Ensembl Annotation pipeline
• Set of high quality gene predictions– From known human mRNAs aligned against genome– From similar protein and mRNAs aligned against genome– From Genscan predictions confirmed via BLAST of Protein,
cDNA, ESTs databases.
• Initial functional annotation from Interpro• Integration with external resources (SNPs, SAGE,
OMIM)• Comparative analysis between mouse/human
– DNA sequence alignment– Protein orthologs
Ensembl gene prediction pipeline
RepeatMasker
Genscan
Blast genscan peptides vProtein,unigene,est,vert mrna
Pmatch all human Proteins and cdnas
MiniGenewiseMiniEst2genome
Genes
DNA
Genome Annotation
The generic structure of an automatic genome annotation pipeline and delivery system
Useful genomic annotation and browser URLs
EBI/Sanger Institute Ensembl Project: http://www.ensembl.org/Homo_sapiens/
NCBI Human Genome Browser:http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?chr=hum_chr.inf&query
The Oak Ridge National Laboratories Genome Channel: http://compbio.ornl.gov/channel/
UCSC Human Genome Browser:
http://genome.ucsc.edu/cgi-bin/hgGateway
The Institute for Genomic Research (TIGR): http://www.tigr.org/
Genome annotation -things still being worked out-
•Annotation servers.•Pro: make genomics information accessible to biologists without expert bioinformatics skills.•Con: makes it difficult to perform large-scale data mining.•Solution: enable more experienced users to retrieve the data they require and to run analyses locally.
•Open annotation systems.•Biologists need to have access to annotations available in the community and to share their own contributions with the community.•A common protocol between systems that enables genome data to be freely exchanged
•AGAVE (Architecture for Genomic Annotation, Visualization and Exchange)•Distributed Annotation System (DAS) projects
Genome annotation servers
• Several ways to find information:– Search by clone, gene, EST, marker.
– Browse sequence.
– BLAST searches.
– Homology, start in one organism, jump to the syntenic region of another.