finding genes in the rice genome

32
Finding Genes in the Rice Genome Hao Bailin T-Life Research Center, Fudan University Beijing Genomics Institute , Academia Sinica Institute of Theoretical Physics, Academia Sinica (www.itp.ac.cn/~hao/) On-going work by a team of 10-12 people since August 2001: Zheng Weimou, Xie Huimin, Liu Jinsong, Xu Zhao, Fang Lin, Li Heng, Gao Lei, Jin Jiao, et al. Nothing written

Upload: tuyet

Post on 29-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Finding Genes in the Rice Genome. Hao Bailin T-Life Research Center, Fudan University Beijing Genomics Institute , Academia Sinica Institute of Theoretical Physics, Academia Sinica (www.itp.ac.cn/~hao/) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Finding Genes in the Rice Genome

Finding Genes in theRice Genome

Hao Bailin

T-Life Research Center, Fudan University

Beijing Genomics Institute , Academia Sinica

Institute of Theoretical Physics, Academia Sinica

(www.itp.ac.cn/~hao/)

On-going work by a team of 10-12 people since August 2001: Zheng Weimou, Xie Huimin, Liu Jinsong, Xu Zhao, Fang Lin, Li Heng, Gao Lei, Jin Jiao, et al. Nothing written yet.

Page 2: Finding Genes in the Rice Genome

Two Cultivars of Rice

• Oryza sativa ssp. indica ( 籼稻 )• Oryza sativa ssp. Japonica ( 粳稻 )

The difference was described in Xu Shen’s ( 许慎《说文解字》 ) Chinese Dictionary of East Han Dynasty (~ 2nd Century AD)J.H. Zhang et al. Rice cultivation of Jianhu Remains in

Henan Province, Science J. ( 《科学》杂志 ) , 53( 4 ), 2002 , 3 (in Chinese)

Page 3: Finding Genes in the Rice Genome

cccaatatcttgcttcagcaagatattgggtatttctagctttcctttcttcaaaaattgctatatgttagcagaaaagccttatccattaagagatggaacttcaagagcagctaggtctagagggaagttgtgagcattacgttcgtgcattacttccataccaagattagcacggttgatgatatcagcccaagtattaataacgcgaccttggctatcaactacagattggttgaaattgaatccgtttagattgaaagccatagtactaatacctaaagcagtgaaccaaatccctactacaggccaagcagccaagaagaagtgtaaagaacgagagttgttaaaactagcatattggaagattaatcggccaaaataaccatgagcggccacaatattataagtttcttcctcttgaccaaatctgtaaccctcattagcagattcgttttcagtggtttccctgatcaaactagaggttaccaaggaaccatgcatagcactgaatagggaaccgccgaatacaccagctacacctaacatgtgaaatggatgcataaggatgttatgctctgcctggaatacaatcataaagttgaaagtaccagatattcctaaaggcataccatcagagaaacttccttgaccaatagggtaaatcaagaaaacagcagtagcagctgcaacaggagctgaatatgcaacagcaatccaaggacgcatacccagacggaaactcagttcccactcacgacccatataacaagctacaccaagtaagaagtgtagaacaattagctcataaggaccaccattgtataaccactcatcaacagatgcagcttcccaaattgggtaaaagtgcaatccgatcgccgcagaagtaggaataatggcaccagagataatattgtttccgtaaagtaaagaaccagaaacaggctcacgaataccatcaatatctactggaggggcagcgatgaaggcgataataaatacagaagttgcggtcaataaggtagggatcatcaaaacaccgaaccatccgatgtaaagacggttttcggtgctagttatccagttgcagaagcgaccccacaggcttgtactttcgcgtctctctaaaattgcagtcatggtaagatcttggtttattcaaattgcaaggactcccaagcacacgtattaactagaaagataatagaaggcttgttatttaacagtataatatagactatataccaatgtcaaccaagccagccccgacagttgtatatccatacaacaaaatttaccaaaccaaaaaattttgtaaatgaagtgagtgaaaaatcaaaactcagattgctcctttctagtttccatatgggttgcccgggactcgaacccggaactagtcggatggagtagataattattccttgttacaatagagaaaaaacctctccccaaatcgtgcttgcatttttcattgcacacgactttccctatgtagaaataggctatttctattccgaagaggaagtctactaatttttttagtagtaagttgattcacttactatttattatagtacagagaacatttcagaatggaaactgtgaaagttttaccttgatcatttatcaatcatttctagtttattagttttgtttaatgattaattaagaggattcaccagatcattgatacggagaatatccaaataccaaatacgctcactgtgcgatccacggaaagaaaagtaagttgttttggcgaacatcaaagaaaaaacttgctcttcttccgtaaaaaattcttctaaaaataccgaacccaaccattgcataaaagctcgtaccgtgcttttatgtttacgagctaaagttctagcgcatgaaagtcgaagtatatactttagtcgatacaaagtcttcttttttgaagatccactgtgataatgaaaaagatttctacatatccgaccaaaccgatcaagaatatcccaatccgataaatcggtccaaattggtttactaataggatgccccgatccagtacaaaattgggcttttgctaaagatccaatgagaggagtaacagggactttggtatcgaattttttcatttgagtatctattagaaatgaattctccagcatttgattccttactaacaaagaatttattggtacacttgaaaagtaccccagaaaatcgaagcaagagttttctaattggtttagatggatcctttgcggttgagtccaaaaagagaaagaatattgccacaaacggacaaggtaacatttccatttcttcttcaaaagaagagttccttttgatgcaagaattgcctttccttgatatcgaacataatgcataaggggatccataacgaaccatatggttttccgaaaaaaagcagggtacattaacccaaaatgttccatcttcctagaaaagatgattcgttccagaaaggttccggaagaagttaatcgcaagcaagaagattgtttacgaagaaacaacaagaaaaattcatattctgatacataagagttatataggaaccgaaatagtcttttattttcttttttcaaaataaaaatggatttcattgaagtaataaaactattccaattcgagtagtagttgagaaagaatcgcaataaatgcaaggatggaacatcttggatccggtattgaaggagttgaagcaagatatccaaatggataggatagggtatttctatatgtgctagataatgtaagtgcaaaaatttgtcttctaaaaaaggaaatattgaatgaatagatcgtaaattctgaaactttggtatttctttttcttccggacaagactgttctcgtagcgagaatgggatttctacaacgatcgcaaacccctcagatagaatctgagaataaaactcagaataaaaaaaattgttgtaatccaataatcgatcttggttaggatgattaaccaaattaatccaaaaattctgctgatacattcgaatcattaaccgtttcacaagtagtgaactaaatttcttgttattagaaccaataatttcgacaagttcggaaccatttaatccataatcatgggcaaacacataaatgtactcctgaaagagtagtgggtagacgaaatattgtctaggaaatttaagtttttctgaataaccctcgaatttttccatttgtatttctacttgaatcagagagagagaaatatttctcggtttatcaaatggtgatacatagtacaatatggtcagaacagggtgttgcattttttaatacaaacccctggggaagaaaaggagtctaatccacggatctttttccgctccttttctatccaatttgtttatgtttgttctaattacaaaagagaacaaatcctttatttttgcaggccaattgctcttttgactttgggatacagtctctttatcaatatactgcttcttttacacattcaatccataacatccttttcaatccaaaatcaagaataattaggatttctaaaaaaaaaagaaaaaatcaaaggtctactcataggaaaaccagcttttccctacatcaggcactaatctatttttaacgtctaattagatcagggagttcttccaattaagaagttaagctcgttgctttttgttttaccagaattggagccaggctctatccatttattcattagacccagaaaatcagaatttttttattccattccaaaaatccaaaataagaaattgattttattacgacatgctattttttccattcattacccttgaggatcagtcgcggtcttatagactctaccaagagtctggacgaattttttgcttcatccaaatgtgtaaaagatcatagtcgcacttaaaagccgagtactctaccattgagttagcaacccagataaactaggatcttagatacgatcgaaatccaaaaatcaatggaattacaccgcacacccctgtcaaaatcttaaaatagcaagacattaaaagaaagattttatcaccattgaaaacactcagataccaaaaggaacgggtctggttaaatttcactaaggttaaaagtggcaccaatcacgatcgtaaaattgtcatttttttagcatttttatttaaataaataaataaatcttgtatgagagtacaaacaagagggacaaccctaccatttgagcaaagtgtaggcaaaaaacctaatagggagtgaggataaagagacttatccatctacaaattctagatgttcaatggacctttgtcaatggaaatacaatggtaagaaaaaaattagatagaaaaactcaaaaaaataaaggcttatgttggattggcacgacataaatccagtcaaaaataggattaagaaagaggcaaattatttctaaatagttagacaacaagggatactagtgagcctctcctagttttttattcatttagttcttcaattaactcaaagttctttctttttctttaaagaattccgccttccttaaaatatcagaaacggttcttgtaggttgagcacctttttcaaggaaatagagaatagctggaacatttaaacaagtttgattctttatcggatcataaaaacctacttttcgaagatctcttccttctcttcgagatcgaacatcaattgcaacgattcgatagacagcttattgggatagatgtagataaataaagccccccctagaaacgtataggaggttttctcctcatacggctcgagaatatgacttgcattaatttccgtacagaaaaaacaaatttcatttatactcatgactcaagttgactaattttgattgacagacttgaaagaaaaaaatcctttgaaattttttgagtcgtctctaaactcttttctttgcctcatctcgaacaaattcacttttattccttattccggtccaattctattgttgagacagttgaaaatcgtgtttacttgttcgggaatcctttatctttgatttgtgaaatccttgggtttaaacattacttcgggaattcttattcttttttctttcaaaagagtagcaacatacccttttttcttatttccttcgataaagcatttccctcttctatagaaatcgaatatgagcgattgattctgatagactttaatcaaaagagttttcccatatcttccaaaattggactttcttcttattttaaccttttgatttctatattatttcgatttctatattaagggtagaatgacaaagttggcctaatttattagttttcactaaccctagattctttcccttgataaaaaataaattctgtcctctcgagctccatcgtgtactatttacttagcttacttacaaacaacccagcgaaaattcggttcgggacgaatagaacagactatgtcgagccaagagcattttcattactatggaaaatggtggatagcaaaatccacaatcgatcgtgtccttcaagtcgcacgttgctttctaccacatcgttttaaacgaagttttaacataacattcctctaatttcattgcaaagtgttatagggaattgatccaatatggatggaatcatgaatagtcattagtttcgttttttgtatactaattcaaacttgctttgctatctatggagaaatatgaataaaagaaattaagtatttatcgggaaagactccgcaaagagccaatttatttaaacccatattctatcatatgaatgaaatatagttcgaaaaaagggaataaacaagtttgcttaagacttatttattatggaatttccatcctcaacagaggactcgagatgatcaatccaatcctgaaatgataagagaagaattgactcttctccaacaaataaactatcaacctcccgtttaattaatttaattaatatattagattagcaatctatttttccataccatttttccgtaacaaaactaattaactattaactagttaaactattgcaatgaaaagaaagttttttggtagttatagaattctcgtatttcttcgactcgaataccaaaagaaagaaaaaaatgaagtaaaaaaaacgcatttcctgtaaagtaaaattaaggtctttgcttttacttattttttcttttacctaaaagaagcaactccaaatcaaaattgaatccattctatctaacgagcagttcttatcttatctttaccgggatggatcattctggatatttaaaaaatcgcggatcgagatcgtttttgcttaaccaaagaaagaaaaagaagaaggaaccttttttactaataaaatactataaaaaaaatttatctctatcataaatctatctctaccataaaggaataggtctcgttttttatacaatgttctacgtcaagtttaaaattttttcatgaaaaaaagattttcaatttgactggacttgacactggattatgttttctgagacagaaaatgaacgcattaggactgcatcgaatctaagagtttataagagaaaaaaattctctttaataaactttatgtctcgtgcagaatacaatacgatttcatctttcgtttcatcagaaaaaatctgggacggaaggattcgaacctccgagtaacgggaccaaaacccgctgccttaccacttggccacgccccatttcgggttttatgcgacactaataaacagtattatgtttatttcttattcgtcaatcctacttcaattacataaaaatggggggtattctcttggtaggattctagacatgcgaataatatagaatccaaaaaatgcattgatcattacatggaattctattaagatattatatgaaagtcgaatttcttccactctcatttgagagtgcgaatacaaggaggtattttgtgtttgggaaagtccgaagaaaaaaggattttgaatcctccttttcctttttcccttagaaaaataactcaatcaaaatccaattatctactctacaagaacgaaacgcttgttatgcctaatatacttagtttaacctgtatttgttttaattctgttatttatccgactagttttttcttcgccaaattgcccgaagcttatgccattttcaatccaatcgtggattttatgcctgtcatacctgtactcttttttctattagcctttgtttggcaagctgctgtaagttttcgatgaaatctttactactctgtctgccaaattgaatcatgtattcattctaaaaaaattcgaaaaatggataagagccgagaagtcttatattatgaaccttcgattctaaaattcaaattcttctacattgaatgtatagctgcagcaataaatttggatcagcctttctactccctgcatctacgttgagcaggtatctttaggtaaccgcacaatacctaacctaatttattgataagagtgcttattataaatcaattcttgcaatttttttcaaaaattgatttttgcatttttaggtgtcaaaataaacaaaacccatcctagtggatttgtgtggtaaggaaaaacgggtaatctattccttaaaaaaaaatcttggagattatgtaatgcttactctcaaactttttgtttatacagtagtgatattctttgtttccctctttatctttggattcttatctaatgatccaggacgtaatcctgggcgtgacgagtaaaaatccaaaattttttcttacaaattggatttgtttcatacatttatctacgagaaaatccgggggtcagaattccttccaattcgaaagtcccaaacgatccgagggggcggaaagagagggattcgaaccctcggtacaaaaaaattgtacaacggattagcaatccgccgctttagtccactcagccatctctccccgttccaaatcgaaaggtttccgtgatatgacagaggcaagaaataacgattgcaaaaaatccttcctttttctttcaaaagttcaaaaaaattatattgccaattccattttagttatattcttttttcttaatgttaataaaaaaaagaagaaaattcttcttttttctttctaattctaaaattggatattggctaaaagacaatcagatagattttctcttcagcaggcatttccatataggacttgttataataaaacaagcaggttatagaaaaaaactcttttttttattatttatcaacaaagcaaaaaggggtcttatcaaaccaacccaccccataaaattggaaagaaagataaagtaagtggacctgactccttgaatgaggcctctatccgctattctgatatataaattcgatgtagatgaaattgtataagtggatttttttgtatttccttagacttagaccacgcaaggcaagaatttctcgctatttactatttcatattcttgttactagatgttctataggaataagaagaaatcgcaacccctttccgctacacataaaaatggatttcgaaagtcaatttttcttttcaatatctttactttttttcagaatcctatttttgttcttatacccatgcaatagagagcgagtgggaaaagggaggttactttttttcattttttccttaaaaaataggctttcttggaaataggaatcatggaataatctgaattccaatgtttatttctatagtataagaaaaactaattgaatcaaattcatggatttaccacgacctcggctgtgaccccatagataaaaatgcaaaatttctatcttcgagaccattgaaaaaaggcattgaacgagaaaaaatcgtccacagataatctatcgtatgccttggaagtgatataaggtgctcggaaatggttgaagtaattgaataggaggatcactatgactatagcccttggtagagttactaaagaagaaaatgatttatttgatattatggacgactggttacgaagggaccgttttgtttttgtaggatggtctggcctattgctttttccttgtgcttatttcgctttaggaggttggtttacagggacaacttttgtaacttcttggtatacccatggattggcgagttcctatttggaaggttgcaatttcttaaccgcagcagtttccacccctgccaatagtttagcacactctttgttgctactatggggcccggaagcacaaggggattttactcgttggtgtcaattaggtggtctgtggacttttgttgctctccatggggcttttgcactaataggtttcatgttacgtcaatttgaacttgctcggtctgttcaattgcggccttataatgcaatttcattctctggcccaatcgctgtttttgtttccgtattcctgatttat

ccactggggcaatccggttggttctttgcgccgagttttggcgtagcagcgatatttcgattcatcctcttcttccaaggatttcataattggacgttgaacccatttcatatgatgggagttgccggagtattaggcgcggctctgctatgcgctattcatggggcaaccgtgga

Page 4: Finding Genes in the Rice Genome

Gene-Finding by Computer

Starting from early 1980s:

• “Ab initio” or “de novo” algorithms: GeneMark, GenScan, FgeneSH, Genie, …based on gene-structure models and training data. (Our on-going project: BGF, the BGI Gene Finder)

• Homolog methods based on sequence alignment with known genes in databases

• Mixed approach using both strategy: TwinScan

Page 5: Finding Genes in the Rice Genome

Different Stages of Gene-Finding

• Use all possible existing programs and services on the web with a public-domain or home-made genome viewer

• Write your own gene-finder, trained for the specific organism

• A dream for the time being: design a self-training and self-developing program “for any species” which would improve itself iteratively starting from a few available reads, cDNAs, and ESTs

Page 6: Finding Genes in the Rice Genome

Performance of Gene-Finders in Eukaryote Genomes

• M. Q. Zhang, Nature Review Genetics, 3 (2002) 698-710 (mostly for the human genome):

Nucleotide level: 80% Exon level: 45% Whole gene structure: 20%• FgeneSH and BGF for rice (our tests on 128 cDNA-confirm

ed single-gene genomic sequences): Nucleotide level: 90% Exon level: 60% Whole gene structure: 40%

Page 7: Finding Genes in the Rice Genome

5‘ 3‘

3‘ 5‘

Each strand carries the same amount of information, but different sets of genes.Two strands are equivalent in information content.Two strands are not equivalent in gene content.Biological processing (duplication, transcription) goes from 5’ to 3’. Finding genes on one strand at a time or on two strands at the same time: one-pass or two-pass programs.

Page 8: Finding Genes in the Rice Genome

5’-UTR 3’-UTR

transcribe

Genomic DNA

Pre-mRNA

splice

mRNA

translate

AA seq ( protein primary seq )

fold

Protein fold

start stop

5’ 3’

RNA Pol II +…

splicesome u1u2u4u5u6RNP

ribsome init.

+ elong. factors term.

chaperonine

Page 9: Finding Genes in the Rice Genome

Three Scales of Search• Local: signals with minimal signature (start, stop, sp

licing); movable signals (caps, promoters, polyAs, branching points, some very weak) --- clustering, discrimination analysis, various statistical models

• Intermediate: exons, introns, intergenic --- Markov, semi-Markov, Hidden-Markov models; intron length distribution

• Global: optimal combination of the above --- dynamic programming

Page 10: Finding Genes in the Rice Genome

{()【( . )( . )( . )】()}

Signals:• { transcription start (downstream of promoters)

• } transcription end (upstream of poly-A)

• 【 translation start (ctg, 1/64 in a random seq.)

• 】 translation end (tag, tga, taa, 3/64)

• ( splicing donor site (minimal signal=gt, 1/16)

• ) splicing accepter site (ag, 1/16)

• · branching point (very weak …a…)

Transcription Translation Translation Transcription start start end end

Page 11: Finding Genes in the Rice Genome

{()【( . )( . )( . )】()}

• 【( First exon

• )( Internal exon

• )】 Last exon

• {( Non-coding 5’ exon

• )【 Non-coding 5’ exon

• ( . ) Intron

• 】( Non-coding 3’ exon (rare)

• )} Non-coding 3’ exon (rare)

• }{ Intergenic region

Transcription Translation Translation Transcription start start end end

Page 12: Finding Genes in the Rice Genome

Signal and Sequence Models

• eiid: equal probability independently and identically distributed

• niid: non-equal probability independently and identically distributed

• WWM: Windowed weight matrix, etc.

• MMn: Markov chain model of order n: homogeneous and period-3 MM5 are used in many gene-finders

• Consensus sequence

Page 13: Finding Genes in the Rice Genome

Consensus Sequences• TATAAT ( Pribnov or -10 box ):

T80A95T45A60A50T96

• TTGACA ( -35 box ):

T82T84G78A65C54A45

• CAAT ( CAAT or –75 box ):

GGYCAATCT• TATA ( TATA or Goldberger-Hogness box ):

TATAWAW• ATG ( Transcription start point )

However, in Aful: ATG –76% GTG –22% TTG –2%

Page 14: Finding Genes in the Rice Genome
Page 15: Finding Genes in the Rice Genome

GT-AG Rule for Intron 5’ splicing donor site

exon …A64G73 G100T100A62A68G84T63… …12PyNC65A100G100 N…exon

3’ splicing

acceptor site

Page 16: Finding Genes in the Rice Genome
Page 17: Finding Genes in the Rice Genome

Exon Intron

Arapdopsis

Rice

Human

Exon and intron size distribution

Page 18: Finding Genes in the Rice Genome

Algorithms

• Sequence models and scores for signals

• Dynamic programming: optimal parse

• Hidden Markov Model: geometric distribution of intron lengths

• Semi-Hidden Markov Model: needs sequence-generating models and length probability for each node

• Language theory approach

Page 19: Finding Genes in the Rice Genome

Flow Chart of GenScan

Chris Burge (1996): A 27-state semi-HMM A simpler model: 19-stateA model taking UTR introns into account : 35-state

Page 20: Finding Genes in the Rice Genome

Figure : N, intergenic

region; P,promotor; F,

5’UTR; , single-

exon gene; , initial

exon; phase

k internal exon; ,ter

-minal exon; T, 3’UTR;

A,polyadenylation signal;

and, , phase k

intron. ) strand.

snglE

initE

)20( kEk

termE

)20( kI k

Page 21: Finding Genes in the Rice Genome

Problems: Minor and Major

• Ambiguity symbols (N, W, S, R, …)

• (1-p) at flanking D-type nodes

• Indels and frame-shifts

• Gradient effects in gene structure

• Introns in 5’-UTRs and 3’-UTRs: leading to 35-state Markov Models

• Alternative splicing and sub-optimal paths

• Limit of probabilistic models

• Deterministic approaches

Page 22: Finding Genes in the Rice Genome

Dyck language: A language of nested parentheses

• Many types of parentheses

• Finite depth of nesting

• Context-free language

Our case:

• Only 3 types of parentheses

• Shallow nesting

• Conjecture: may be regular language

Page 23: Finding Genes in the Rice Genome

Two Test Datasets for RiceGene-Finders

• The 28469 japonica full-length cDNAs (Kikuchi et al., Science 301 (18 July 2003)

• Select a high-quality subset without overlaps with publically available cDNAs

• A single-gene set: 500 sequences with one gene in each

• A multi-gene set: 46 sequences with 199 genes in total (at least 4 genes in a sequence)

Page 24: Finding Genes in the Rice Genome

Assessment of Gene-Finders

Test done between 22 July and 2 August 2003

• FgeneSH (trained on monocotyledons)

• GeneMark.hmm

• RiceHMM

• GlimmerR

• GenScan (trained on maize)

• BGF

Page 25: Finding Genes in the Rice Genome

Our Ultimate Goal

• An iterative, self-training, self-improving gene-finder “for any species”, starting from a small number of reads with or without EST, cDNA supports

• Annotaion and re-annotation of the rice genomes

• Plant comparative genomics, especially, that of Gramene and Crucifers

Page 26: Finding Genes in the Rice Genome

tRNA features

• tRNA gene pre-tRNA mature tRNA

• Mature tRNA: 75 – 95 bases

• Cloverleaf like structure

• Five arms: acceptor arm, D arm, anticodon arm, V loop (extra arm), T C arm

Page 27: Finding Genes in the Rice Genome

How many tRNA genes are present in an organism?

• Codon tRNA amino acid

• 61 encoding codons

• 20 amino acids

• Are there 61 species of tRNA with all possible anticodons ?

• Met (M) has one codon but two tRNAs

Page 28: Finding Genes in the Rice Genome

Wobble hypothesis Crick, 1966

• Many tRNAs recognize more than one codon

• Through non-Watson-Crick base pairings

• Less than 61 tRNAs are needed

Page 29: Finding Genes in the Rice Genome

The Modified Wobble Hypothesis(Guthrie & Abelson 1982)

• In eukaryotes, 46 different tRNA species would be enough.

• The modified wobble hypothesis is almost perfectly hold in H. sapiens, S. cerevisiae, A. thaliana, C.elegans whose complete collection of tRNAs are now known.

Page 30: Finding Genes in the Rice Genome

aa codonA C H anti aa codonA C H anti aa codonA C H anti aa codonA C H anti

UUU0 0 0 AAA UCU37 14 10 AGA UAU0 0 1 AUA UGU0 0 0 ACA

UUC16 16 14 GAA UCC1 0 0 GGA UAC76 19 11 GUA UGC15 1330GCA

UUA6 5 8 UAA UCA9 7 5 UGA UAA0 0 1 UUA UGA0 0 0 UCAUUG10 7 6 CAA UCG4 5 4 CGA UAG0 0 1 CUA UGG14 11 7 CCA

CUU11 18 13 AAG CCU16 6 11 AGG CAU0 0 0 AUG CGU9 18 9 ACG

CUC1 0 0 GAG CCC0 0 0 GGG CAC10 17 12 GUG CGC0 1 0 GCG

CUA10 3 2 UAG CCA39 34 10 UGG CAA8 18 11 UUG CGA6 10 7 UCGCUG3 5 6 CAG CCG5 3 4 CGG CAG9 7 21 CUG CGG4 3 5 CCG

AUU20 19 13 AAU ACU10 17 8 AGU AAU0 0 1 AUU AGU 0 0 0 ACU

AUC0 0 1 GAU ACC0 0 0 GGU AAC16 20 33 GUU AGC13 9 7 GCU

AUA5 8 5 UAU ACA8 11 10 UGU AAA13 16 16 UUU AGA9 7 5 UCUAUG23 20 17 CAU ACG6 7 7 CGU AAG18 33 22 CUU AGG8 3 4 CCU

GUU15 19 20 AAC GCU16 21 25 AGC GAU0 0 0 AUC GGU1 0 0 ACC

GUC0 0 0 GAC GCC0 0 0 GGC GAC23 22 10 GUC GGC23 1411GCCGUA7 6 5 UAC GCA10 10 10 UGC GAA12 17 14 UUC GGA12 33 5 UCCGUG8 5 19 CAC GCG7 4 5 CGC GAG13 20 8 CUC GGG5 3 8 CCC

tRNA copies in Arabidopsis, C. elegans, and Human

F

L

I

M

V

S

P

T

A

Y

*

H

Q

N

K

D

E

C

*W

R

S

R

G

*

Page 31: Finding Genes in the Rice Genome

tRNA Genes in the Rice Genome(Found by tRNAScan-SE + BLASTN)

Chromosome Indica (BGI) Japonica/syngenta (IRGSP) 1 85 71 (85) 2 57 59 3 79 68 4 45 46 (41) 5 58 56 6 38 32 7 34 35 8 45 42 9 34 32 10 28 23 (28) 11 23 24 12 38 36 Total 564 (in 382 Mbp) 519 (in 360 Mbp)

Page 32: Finding Genes in the Rice Genome

Chloroplast tRNA genes in ssp. indica and japonica

• 33 tRNA genes found in indica and japonica genome respectively.

• They are completely identical, no mutation is found (E. C. Kemmerer and Ray Wu found two tRNA genes perfectly conserved).

• It is remarkable that in spite of more than 9000 years of separation no mutation could be observed in the chloroplast tRNA genes in the two ssp.