genomic analysis. flowchart get genome sequence – genome assembly find genes translate genes all...

Genomic AnalysisGenomic Analysis

FlowchartFlowchart

• get genome sequence – genome assembly

• find genes• translate genes• all against all, self-comparison• all against all, interproteome• functional classification• synteny analysis• microarrays

• get genome sequence – genome assembly

• find genes• translate genes• all against all, self-comparison• all against all, interproteome• functional classification• synteny analysis• microarrays

ContigsContigs

• Sequences are obtained by genetically engineering pieces of DNA into plasmids

• One sequencing reaction can only resolve a maximum of about 800 base pairs

• Overlapping fragments allows deduction of complete sequences

• Sequences are obtained by genetically engineering pieces of DNA into plasmids

• One sequencing reaction can only resolve a maximum of about 800 base pairs

• Overlapping fragments allows deduction of complete sequences

Fragment Assembly package in GCGFragment Assembly package in GCG• This package of programs allows you

to input fragment sequences, make the contigs, and then edit the final contigs.

• This package of programs allows you to input fragment sequences, make the contigs, and then edit the final contigs.

Contigs: the algorithmContigs: the algorithm

• First, find regions of overlap that contain a minimum number of identities (sliding window with an identity matrix)

• Second, save those overlaps whose identities/overlap ratio meets a threshold criterion (80% in GelMerge)

• First, find regions of overlap that contain a minimum number of identities (sliding window with an identity matrix)

• Second, save those overlaps whose identities/overlap ratio meets a threshold criterion (80% in GelMerge)

Identity/overlap ratioIdentity/overlap ratio

• In order to save the threshold-meeting overlaps, must align them

• This is a global alignment that does not penalize overhanging ends

• So F(i,0) = 0 and F(0,j) = 0 (top row and leftmost column are all 0 so we can start anyplace along the top or left border)

• In order to save the threshold-meeting overlaps, must align them

• This is a global alignment that does not penalize overhanging ends

• So F(i,0) = 0 and F(0,j) = 0 (top row and leftmost column are all 0 so we can start anyplace along the top or left border)

• Start the traceback from the maximum value on the right or bottom border: F(max) = (i,m) or (n,j)

• Start the traceback from the maximum value on the right or bottom border: F(max) = (i,m) or (n,j)

H E A G A W H

0 0 0 0 0 0 0 0

P 0

A 0

W 0 n,m

H E A G A W H

0 0 0 0 0 0 0 0

P 0

A 0

W 0 n,m

• GelMerge then aligns the two pieces (contigs) with the longest overlap and assembles a single piece of DNA from that; this process is repeated until there are no remaining overlaps in the fragment database being used

• GelMerge then aligns the two pieces (contigs) with the longest overlap and assembles a single piece of DNA from that; this process is repeated until there are no remaining overlaps in the fragment database being used

In-class exercise In-class exercise • Open the file called fragments; this

contains truncated regions of the file named geneseq.

• In the editor, select all the sequences.• Select Functions -->Fragment

Assembly--> GelStart; enter a project name and select Begin a new project; select Run

• Open the file called fragments; this contains truncated regions of the file named geneseq.

• In the editor, select all the sequences.• Select Functions -->Fragment

Assembly--> GelStart; enter a project name and select Begin a new project; select Run

In-class exercise, contIn-class exercise, cont

• Go back to Fragment Assembly, select GelEnter; in the green GelEnter of box it should say selected sequences from Editor (make sure all sequences in Editor are still selected); select Enter the selected sequences from main window; Run

• Go back to Fragment Assembly, select GelEnter; in the green GelEnter of box it should say selected sequences from Editor (make sure all sequences in Editor are still selected); select Enter the selected sequences from main window; Run

In-class exercise, contIn-class exercise, cont

• Go back to Fragment Assembly again; select GelMerge; Run.

• Go back to Fragment Assembly again; select GelView; Run

• Now go back and look at options, especially in the GelMerge program; try changing them and seeing what happens.

• Go back to Fragment Assembly again; select GelMerge; Run.

• Go back to Fragment Assembly again; select GelView; Run

• Now go back and look at options, especially in the GelMerge program; try changing them and seeing what happens.

Genome project programsGenome project programs• PHRED: analyses raw sequence to produce a

`base call‘ with an associated `quality score' for each sequence position

• Phred scores reported as 10*log10(p), where p is the probability of the base call being wrong

• q of 20 is 10x q of 30• PHRAP: assembles raw sequence into

sequence contigs and assigns to each position an associated ‘quality score’ for each position in the sequence, based on the Phred scores of the raw sequence reads (same scale as Phred).

• PHRED: analyses raw sequence to produce a `base call‘ with an associated `quality score' for each sequence position

• Phred scores reported as 10*log10(p), where p is the probability of the base call being wrong

• q of 20 is 10x q of 30• PHRAP: assembles raw sequence into

sequence contigs and assigns to each position an associated ‘quality score’ for each position in the sequence, based on the Phred scores of the raw sequence reads (same scale as Phred).

• GigAssembler: merges the information from individual sequenced clones into a draft genome sequence.

• GigAssembler: merges the information from individual sequenced clones into a draft genome sequence.

Chromosomal Map from Mycobacterium tuberculosis (TIGR)

Gene and regulatory region findingGene and regulatory region finding• Sequencing a million base pairs is

relatively easy• Identifying open reading frames

(eukaryotic) in that million base pairs is quite difficult (because of intervening sequences, introns, etc.)

• Identifying regulatory sequences is very difficult – such sequences are short, and can be separated from orf by 50,000 base pairs

• Sequencing a million base pairs is relatively easy

• Identifying open reading frames (eukaryotic) in that million base pairs is quite difficult (because of intervening sequences, introns, etc.)

• Identifying regulatory sequences is very difficult – such sequences are short, and can be separated from orf by 50,000 base pairs

Gene finding by similarityGene finding by similarity

• Screen genomic sequence against known cDNA sequences in database; if you find a significant match, that’s probably an ORF! (usual first step with genomic sequence)

• This will miss lots of genes ...

• Screen genomic sequence against known cDNA sequences in database; if you find a significant match, that’s probably an ORF! (usual first step with genomic sequence)

• This will miss lots of genes ...

Genomic DNA BLAST resultsGenomic DNA BLAST results

• Input: genomic DNA fragment from E. coli

• BLASTX of nr protein database at NCBI

• Output follows

• Input: genomic DNA fragment from E. coli

• BLASTX of nr protein database at NCBI

• Output follows

• This is a pretty trivial example, but you can see how this works for actual unknown genome sequences

• This is a pretty trivial example, but you can see how this works for actual unknown genome sequences

Major methods of gene findingMajor methods of gene finding

• Pattern discrimination• Find metrics that correlate with

usage in coding regions• Generate way to separate

coding/noncoding regions according to that metric

• Others (HMM, neural net, genetic algorithm, …)

• Pattern discrimination• Find metrics that correlate with

usage in coding regions• Generate way to separate

coding/noncoding regions according to that metric

• Others (HMM, neural net, genetic algorithm, …)

ORF patternsORF patterns

• 7 major metrics:• Frame bias: find the frame that matches

codon bias of that organism• Fickett algorithm: amalgam of several

tests involving 3-periodicity of query DNA vs. known 3-periodicity of known coding DNA; and also overall base composition

• 7 major metrics:• Frame bias: find the frame that matches

codon bias of that organism• Fickett algorithm: amalgam of several

tests involving 3-periodicity of query DNA vs. known 3-periodicity of known coding DNA; and also overall base composition

• Fractal dimension: common codons clustered with common codons, or uncommon with uncommon, has low fractal dimension, which is typical of exons

• Coding 6-tuple word preferences: compare occurrence to known coding vs noncoding regions in database

• Coding 6-tuple in-frame preferences: compare occurrence to known in-frame vs. out-of-frame preferences

• Word commonality: exons use rare, introns use common 6-tuples

• Repetitive 6-tuple preferences

• Fractal dimension: common codons clustered with common codons, or uncommon with uncommon, has low fractal dimension, which is typical of exons

• Coding 6-tuple word preferences: compare occurrence to known coding vs noncoding regions in database

• Coding 6-tuple in-frame preferences: compare occurrence to known in-frame vs. out-of-frame preferences

• Word commonality: exons use rare, introns use common 6-tuples

• Repetitive 6-tuple preferences

• Each of these metrics by itself is not very good at predicting ORF’s; integrating all this information is much more likely to be successful

• Such integration is species specific, and also somewhat regionally specific within species; nonetheless very useful

• Each of these metrics by itself is not very good at predicting ORF’s; integrating all this information is much more likely to be successful

• Such integration is species specific, and also somewhat regionally specific within species; nonetheless very useful

Gene prediction in prokaryotes (and yeast)Gene prediction in prokaryotes (and yeast)• Little intergenic DNA, lack of introns,

highly conserved regulatory region patterns make gene prediction easier in prokaryotes

• MM’s (GeneMark) and HMM’s (GeneMark.hmm) work because predictable patterns give reasonable estimates of probabilities for transitions between coding and non-coding regions

• Little intergenic DNA, lack of introns, highly conserved regulatory region patterns make gene prediction easier in prokaryotes

• MM’s (GeneMark) and HMM’s (GeneMark.hmm) work because predictable patterns give reasonable estimates of probabilities for transitions between coding and non-coding regions

In class exercise: GeneMark and GeneMark.hmmIn class exercise: GeneMark and GeneMark.hmm• Go to GeneMark website

http://opal.biology.gatech.edu/GeneMark/• Use text editor to open ecoli_lac_operon.txt file

(Troy: local guest directory; Hartford: my directory); this contains genomic sequence from E. coli

• Use GenMark webserver to get predicted ORFs using both GeneMark and GeneMark.hmm

• Compare outputs; how would you find out if these ORFs correspond to your results from exercise I?

• Go to GeneMark website http://opal.biology.gatech.edu/GeneMark/

• Use text editor to open ecoli_lac_operon.txt file (Troy: local guest directory; Hartford: my directory); this contains genomic sequence from E. coli

• Use GenMark webserver to get predicted ORFs using both GeneMark and GeneMark.hmm

• Compare outputs; how would you find out if these ORFs correspond to your results from exercise I?

GlimmerGlimmer

• Higher order HMM’s • Instead of looking at just the previous

state, use information from the previous n states (e.g., 5th order

• Interpolated HMM’s = IMM’s• Incorporate highest-order information

possible that preserves statistical discrimination

• Glimmer is TIGR’s main gene finding tool

• Higher order HMM’s • Instead of looking at just the previous

state, use information from the previous n states (e.g., 5th order

• Interpolated HMM’s = IMM’s• Incorporate highest-order information

possible that preserves statistical discrimination

• Glimmer is TIGR’s main gene finding tool

Gene finding in eukaryotesGene finding in eukaryotes

• Significant intergenic DNA, less conserved patterns for regulatory regions, significant numbers of introns, more complicated chromosome structure

• Gene finding in eukaryotes significantly more difficult than prokaryotes

• Significant intergenic DNA, less conserved patterns for regulatory regions, significant numbers of introns, more complicated chromosome structure

• Gene finding in eukaryotes significantly more difficult than prokaryotes

Neural NetNeural Net

• Attempts to mimic neural patterns of learning

• Set up network of inputs that give outputs only if threshold is reached (like neurons); thresholds can be reached in a variety of different ways

• The network is a set of “hidden layers” that provide the information for the final output

• Attempts to mimic neural patterns of learning

• Set up network of inputs that give outputs only if threshold is reached (like neurons); thresholds can be reached in a variety of different ways

• The network is a set of “hidden layers” that provide the information for the final output

Simple neural netSimple neural net

Sensor

Sensor

Node output

Node might only give outputif both sensors +; or only ifboth -; or only if one +, one -

More complex neural netMore complex neural net

output

hidden net layers

• Construct network of nodes and connections• Train on sequences with known properties;

adjust weights for connections to optimize for desired outcome on training set

• GRAIL works by using 7 algorithms in a neural net trained on a large set of human sequences with known coding and noncoding regions

• GRAIL won’t work for every human sequence; won’t necessarily work for non-human sequences; nonetheless, works quite well

• Construct network of nodes and connections• Train on sequences with known properties;

adjust weights for connections to optimize for desired outcome on training set

• GRAIL works by using 7 algorithms in a neural net trained on a large set of human sequences with known coding and noncoding regions

• GRAIL won’t work for every human sequence; won’t necessarily work for non-human sequences; nonetheless, works quite well

ExerciseExercise

• Human genomic DNA

• Use GRAIL EXP to find exons

• Compare to GeneMark.hmm

• Human genomic DNA

• Use GRAIL EXP to find exons

• Compare to GeneMark.hmm

• Bayesian methods: use comparison of sequences from fairly close species (mouse and human) -- look for regions that align, ignore the rest

• Based on the idea that those regions that are conserved are likely to be coding or regulatory regions; those that are not conserved are likely not to be

• Bayesian methods: use comparison of sequences from fairly close species (mouse and human) -- look for regions that align, ignore the rest

• Based on the idea that those regions that are conserved are likely to be coding or regulatory regions; those that are not conserved are likely not to be

Regulatory region findingRegulatory region finding

• Again use comparison but this time look in regions outside open reading frame

• This has been done successfully using Bayesian methods

• Again use comparison but this time look in regions outside open reading frame

• This has been done successfully using Bayesian methods

All-against-all self-comparison of proteomeAll-against-all self-comparison of proteome

• Translate all identified ORFs

• BLAST each translated ORF against all other translated ORF w/in that proteome

• Identify paralogs = separate genes that arose by duplication

• Identify gene families

• Translate all identified ORFs

• BLAST each translated ORF against all other translated ORF w/in that proteome

• Identify paralogs = separate genes that arose by duplication


All-against-all interproteome comparisonAll-against-all interproteome comparison

• Like self comparison, only between organisms

• Identify orthologs = genes with same function conserved between species


• Identify conserved domains

• Like self comparison, only between organisms

• Identify orthologs = genes with same function conserved between species


• Identify conserved domains

Functional classificationFunctional classification

• Useful as a precursor to data mining for finding genes related by function, etc.

• Useful as a precursor to data mining for finding genes related by function, etc.

Synteny analysisSynteny analysis

• Arrangement of genes (ORFs) on a chromosome is preserved to a greater or lesser extent depending on the relatedness of the organisms

• Computational analysis of synteny very similar to sequence alignment methods

• Isochores = “long regions of homogeneous base composition”

• 1M base pairs• GC content uniform throughout (differences in GC

content of sliding window would be no more than 1% different than overall GC content of isochore)

• H = high density – rich in genes• L = low density – poor in genes

• Arrangement of genes (ORFs) on a chromosome is preserved to a greater or lesser extent depending on the relatedness of the organisms

• Computational analysis of synteny very similar to sequence alignment methods

• Isochores = “long regions of homogeneous base composition”

• 1M base pairs• GC content uniform throughout (differences in GC

content of sliding window would be no more than 1% different than overall GC content of isochore)

• H = high density – rich in genes• L = low density – poor in genes

Global gene regulationGlobal gene regulation

• Microarray analysis

• Beyond scope of this class

• See discussion in text

• Microarray analysis

• Beyond scope of this class

• See discussion in text

Other molecular biology applicationsOther molecular biology applications

• PCR primer finding– How do you think this algorithm works?

• Restriction enzyme mapping– How do you think this algorithm works?

• PCR primer finding– How do you think this algorithm works?

• Restriction enzyme mapping– How do you think this algorithm works?

genomic analysis. flowchart get genome sequence – genome assembly find genes translate genes all...

Documents

fragment sequences

sequence contigs

selected sequences

raw sequence

regulatory sequences

intervening sequences

draft genome sequence

sequence positionphred