genome annotation and databases genomic dna sequence genomic annotation bio520 bioinformaticsjim...

30
Genome Annotation and Databases Genomic DNA sequence Genomic annotation IO520 Bioinformatics Jim Lund Reading Ch 9, Ch10

Upload: dylan-wilkins

Post on 28-Dec-2015

241 views

Category:

Documents


1 download

TRANSCRIPT

Genome Annotation and Databases

Genomic DNA sequence

Genomic annotation

BIO520 Bioinformatics Jim Lund

Reading Ch 9, Ch10

Genome Annotation

• Find known repeats• Search for new repeated sequences• Predict Genes

– BLASTX– Genewise, Fgenes,

Genscan…

• Integrate other data sources.

Accuracy highest in “high homology” class

Genome annotation servers

• Integrate information from several maps– DNA sequence (contigs, quality).– Physical (cytogenetic, STS content).– Genes (show gene annotations and evidence).

• Several prediction programs.• Expressed sequence tags (ESTs, Unigene

clusters)• Evidence (Predicted, confirmed)• Non-coding RNA (ncRNA) transcripts.

– Variation (e.g., SNPs)– Regions of shared synteny.

Data Release• Human genome sequence released under 1996

Bermuda rules– Assembled sequence greater than 1000bp long is

deposited in public database (GenBank/EMBL/DDBJ) every 24 hours

– No patents are filed

• Bermuda principles reaffirmed at January 2003 WT/NIH meeting– Pre-release of data for all “community projects”– Nature 421 , 875 (2003)– NHGRI:

• http://www.genome.gov/page.cfm?pageID=10506376

– WT:• http://www.wellcome.ac.uk/About-us/Policy/Policy-and-

position-statements/WTD002751.htm

• Benefits of Open Data Access supported by OECD report– http://dataaccess.ucsd.edu

Accessing the Genome

• Genomes sequences are becoming available very rapidly– Large and difficult to handle computationally– Everyone expects to be able to access them immediately

• Bench Biologists– Has my gene been sequenced?– What are the genes in this region?– Where are all the GPCRs– Connect the genome to other resources.

• Research Bioinformatics– Give me a dataset of human genomic DNA.– Give me a protein dataset.

Getting information out

• Search/browse to find the gene or region.

• Export formats:– Screen shot– FASTA seq.– Genbank file with features annotated– Feature list (Gff, tab-delimited text)– Pip (plot of sequence identity between

organisms).

Challenges

• Scale and data flow– Presentation, ease of use.

– Engineering problems.

– User interface design.

• Algorithmic– Partly engineering (pre-compute hard

computations, etc.)

– Partly research.

NCBI sequence assembly (sequence chromosome)

• Remove contaminants

• Bin by chromosome arms

• Sequence Layout

• Sequence Building

• Place on chromosomes

http://www.ncbi.nlm.nih.gov/genome/guide/build.shtml

NCBI sequence assembly - a modified greedy approach

Sequence Layout•Curated Finished Regions•Curated assembly instructions •MegaBLAST hits•Consider clone order•BAC chromosome assignment

•annotation•STS markers•personal communication

•Remove conflicting overlaps, redundant BACs

Sequence Building•Consider fragment:fragment sequence overlaps for each BAC pair in

layout•Meld overlapping sequence•Order and Orient (o+o ):

•alignments (mRNA, EST)•BAC annotation•paired plasmid reads

BAC Sequence

Fragments

Assemble

Order

NCBI Contig

NCBI Genome Build Process

Contig Build&

ReleaseAssembly

Input Data: Sequences Curated NTs TPF BLAST hits

Resource Updates

STSClones

AnnotationLocusLink

RefSeq

CollaborationCuration

FTPBLASTInput Resources

Map Viewer

Update:Linksgi’sPrepare for release

Sequences (contig mRNA protein)Analysis & ReviewCorrections for next build

Freeze

LocusLinkGenBank

GenomeScan

dbSNP

Public Release

ExcludeProblemaccessions

What is being annotated?

Genes: By alignment, by prediction

Markers:By ePCR

Clones/Cytogenetic location: By alignment (BAC ends)Variation: By alignment

Phenotype (MIM): Via Gene identification, associated markers

Cytogenetic Position: By annotated BAC-END sequenced clonesBy FISH-mapped clones used in assembly

Feature Method

RefSeq: a reagent for Contig Annotation

GenomeScan

ESTs

TBLASTN

RPSBLAST

RefSeq mRNAs

GenBank mRNAs

RefSeq Advantages:•Separate Gene Families•Not Partial•Means to correct problem sequences

RefSeq process resultsin excluding problemGenBank sequencesfrom annotation pipeline

Potential ProblemsWith ESTs:•Gene Families•Partial•Chimeric•Intron read-through•Linker•Vector•Wrong organism

genome

NCBI: Products of annotation• RefSeqs (transcripts, proteins)• Gene id (LocusID)• features in chromosome coordinates• features in contig (NT accession)

coordinates

Available in:• Map Viewer

– Graphical display– Tabular display– Sequence downloads

• FTP– RefSeqs (contigs, transcripts, proteins)– Mapping Data– LocusLink & Other resources

NCBI Map Viewer

NCBI Map Viewer: Tabular report

Genes in regions of conserved synteny

Anchored by human gene

order

Anchored by mouse gene

order

Chromosomal segments in dog conserved with human and mouse

Dog: 38 autosomes + sex chr

Query by sequence: Review the alignment

A click away:•Alignments (BLAST hit)•Gene Description (LocusLink)•Report of all features in the region•Contig sequence•Sequence in the region•other mRNAs aligning in the region•Define your own gene model based on alignments in the region

Quality Control - Genome review

• Is the sequence correct?• Is the feature correctly placed?• Is there a feature that should be

placed?• Are the attributes of the feature correct?

Approaches:•In-house analysis & review (manual curation)•Shared information (UCSC/Ensembl)•Solicited review by experts in local regions

Ensembl Annotation pipeline

• Set of high quality gene predictions– From known human mRNAs aligned against genome– From similar protein and mRNAs aligned against genome– From Genscan predictions confirmed via BLAST of Protein,

cDNA, ESTs databases.

• Initial functional annotation from Interpro• Integration with external resources (SNPs, SAGE,

OMIM)• Comparative analysis between mouse/human

– DNA sequence alignment– Protein orthologs

Ensembl gene prediction pipeline

RepeatMasker

Genscan

Blast genscan peptides vProtein,unigene,est,vert mrna

Pmatch all human Proteins and cdnas

MiniGenewiseMiniEst2genome

Genes

DNA

Genome Annotation

The generic structure of an automatic genome annotation pipeline and delivery system

Detailed ViewGenes, ESTs, CpG etc.100kb

OverviewGenes and Markers1Mb

Chromosome

Configuration

Useful genomic annotation and browser URLs

EBI/Sanger Institute Ensembl Project: http://www.ensembl.org/Homo_sapiens/

NCBI Human Genome Browser:http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?chr=hum_chr.inf&query

The Oak Ridge National Laboratories Genome Channel: http://compbio.ornl.gov/channel/

                    UCSC Human Genome Browser:

http://genome.ucsc.edu/cgi-bin/hgGateway

The Institute for Genomic Research (TIGR): http://www.tigr.org/

               

Genome annotation -things still being worked out-

•Annotation servers.•Pro: make genomics information accessible to biologists without expert bioinformatics skills.•Con: makes it difficult to perform large-scale data mining.•Solution: enable more experienced users to retrieve the data they require and to run analyses locally.

•Open annotation systems.•Biologists need to have access to annotations available in the community and to share their own contributions with the community.•A common protocol between systems that enables genome data to be freely exchanged

•AGAVE (Architecture for Genomic Annotation, Visualization and Exchange)•Distributed Annotation System (DAS) projects

Genome annotation servers

• Several ways to find information:– Search by clone, gene, EST, marker.

– Browse sequence.

– BLAST searches.

– Homology, start in one organism, jump to the syntenic region of another.

UCSC Genome Browser

http://genome.ucsc.edu/cgi-bin/hgGateway