1
DNA Classifications with Self-Organizing Maps (SOMs)
Thanakorn NaennaMark J. EmbrechtsRobert A. Bress
May 2003 IEEE International Workshop on Soft Computing in Industrial Application
2
Presentation Outline
• Introduction to DNA Splice Junctions• Data Collection• Introduction to SOMs• SOM for DNA Splice Junction
Classification• Results• Conclusions
3
4
Human genome in a nutshell
• Human : 23 chromosomes• Chromosomes thousands of genes• Gene info : exons , comments : introns
Splice junction are like /* comment flags */ in C-code• Exons and introns codons• Codon bases
5
DNA Splice Junctions
• DNA billions of nucleotides ( A, C, G, T)• Genes sequences of amino acids (exons) that are often
interrupted by non-coding nucleotides (introns) • <.1% of human DNA is made up of exons• 99% of splice junctions have the same motif, for
– Exon to intron it is GT– Intron to exon it is AG
….GTGAAGGTTAA AGATGTAGAT GT ATTG…
Splice Junction Splice JunctionExonIntron Intron
6
Data Collection: HTML Browser + Perl scripts
BioBrowser
Download HTML ExtractLinks() Download HTML - data
ExtractData()
TranslateData()
7
8
DNA Splice Junction (Cont.)
• A complete gene is made up of different exons• Splice junction identification aids in the discovery of new genes• The dataset used for this study is made up of 1,424 sequences• Data were created ab initio from GENBANK• Each sequence is 32 nucleotides long with regions comprising -15 to +15
nucleotides from the splice-junction
…TGTAAGG AG ACGAGTT…Intron
Splice Junction Exon
Left Regions Splice Junction Right Regions ClassesIntron AG Exon AExon GT Inron B
Unknown AG or GT Unknown C
9
Self-Organizing Maps (SOM) Network
• Unsupervised learning neural network
• Projects high-dimensional input data onto two-dimensional output map
• Preserves the topology of the input data
• Visualizes structures and clusters of the data
c
i 1iw
3iw
4iw
5iw
1cw 2cw
3cw 4cw
5cw
Input layer Output layer
Component 1
Component 3
Component 5
Component 2
Component 4
2iw
10
Use of SOM for DNA Splice Junction Classification Model
SOM
SOM Classification Map
Classification
Class A: intron to exon
Class B: exon to intron
Class C: no transition
Classification
Class A: intron to exon
Class B: exon to intron
Class C: no transition
DNA training set
DNA test set
Neuron identification methods
- Highest frequency class
- Closest neuron
Neuron identification methods
- Highest frequency class
- Closest neuron
A
BC
U-Matrix Map
11
The U-matrix of the DNA Training Set
12
SOM Results for DNA Splice Junction Data
A
B
C
DNA sequences Class A Class B Class C TotalClass A 102 (93%) 2 (2%) 6 (5%) 110Class B 0 (0%) 90 (91%) 9 (9%) 99Class C 4 (2%) 6 (3%) 205 (95%) 215Total 106 98 220 424
Classified to
Confusion matrix of 424-DNA test set
The U-matrix of the DNA training set
13
Conclusions
• SOM is effective in DNA splice junction classification• SOM is powerful visualization for high dimensional data
14
Demo with Analyze Code
• 800 training data, 324 test data (160 features)• 96% correct overall classification on test data
IE FALSE EI
IE 98 0 0FALSE 5 111 3
EI 2 3 102
Confusion Matrix
9186
2000050000
0.90.05
1 // K// L// max_neighborhood// num_its// num_fine_its// alpha_max// alpha_min// LVQ_flag
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA
TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA
TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT
GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG
CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG
GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA
CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC
ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC
ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG
TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA
TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA
CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA
CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA
CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA
CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA
CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA
TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA
CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA
CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA
CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT
ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT
TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA
CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA
TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA
TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT
GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG
CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG
GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA
CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC
ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC
ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG
TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA
TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA
CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA
CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA
CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA
CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA
CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA
TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA
CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA
CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA
CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT
ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT
TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA
CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT
THE END
16
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA
TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA
TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT
GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG
CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG
GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA
CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC
ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC
ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG
TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA
TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA
CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA
CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA
CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA
CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA
CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA
TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA
CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA
CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA
CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT
ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT
TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA
CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT
AAAAGCATTGGGAA
GGTTC
CCGTTGAAC
GGTCAGGTTAGACTA
EXTRACTING KNOWLEDGE
17
NUCLEOTIDES
AA TT
CCGG
• DNA is double-stranded •A & C are Complements
•G & T are Complements
18
AMINO ACIDS
• Sequences of three nucleotides –“CODONS” – code for amino acids
• There are 20 different amino acids
• Amino acids make up the part of DNA known as exons
• Each amino acid can be translated between 1 and 6 different ways
19
PROTEINS
• Proteins are made up of sequences of amino acids• Generally responsible for some biological function
• May have complicated folding patterns that are difficult to predict
20
GENES
• 30,000 – 100,000 genes exist in the human genome
• Most genes have not yet been discovered
• Genes are made up of sequences of amino acids
• Genes are interrupted by non-coding regions of DNA “Introns”
21
CHROMOSOMES
22
READING FRAMES
…ACG TAGAT…
• Reading frames may be difficult to determine
• Reading frames may be shifted by splice junctions
23
GENE STRUCTURE
Start Codon (ATG)
Exon sequence (amino acid string)
Intron sequence (junk DNA)
Stop Codon (3 possible)
24
SPLICE JUNCTIONSSPLICE JUNCTIONS
• Segments of DNA that join coding and non-coding regions