bioinformation technology: case studies in bioinformatics and biocomputing with dna chips byoung-tak...
DESCRIPTION
3 Human Genome Project Genome Health Implications A New Disease Encyclopedia New Genetic Fingerprints New Diagnostics New Treatments Goals Identify the approximate 40,000 genes in human DNA Determine the sequences of the 3 billion bases that make up human DNA Store this information in database Develop tools for data analysis Address the ethical, legal and social issues that arise from genome researchTRANSCRIPT
Bioinformation Technology: Bioinformation Technology: Case Studies in Bioinformatics andCase Studies in Bioinformatics and
Biocomputing with DNA ChipsBiocomputing with DNA Chips
Byoung-Tak ZhangCenter for Bioinformation Technology (CBIT)
Seoul National University
[email protected]://bi.snu.ac.kr/~btzhang
2
OutlineOutline
Bioinformation Technology Bioinformatics
DNA Chip Data Analysis: IT for BT DNA Computing: BT for IT
DNA Computing with DNA Chips Outlook
3
Human Genome ProjectHuman Genome Project
Genome Health Implications
A New DiseaseEncyclopedia
New Genetic Fingerprints
NewDiagnostics
NewTreatments
Goals• Identify the approximate 40,000 genes in human DNA• Determine the sequences of the 3 billion bases that make up human DNA• Store this information in database• Develop tools for data analysis• Address the ethical, legal and social issues that arise from genome research
4
Bioinformatics vs. BiocomputingBioinformatics vs. Biocomputing
BTBTITIT
Bioinformatics
Biocomputing
5
BioinformaticsBioinformatics
6
What is Bioinformatics?What is Bioinformatics?
Bioinformatics vs. Computational Biology Bioinformatik (in German): Biology-based computer scien
ce as well as bioinformatics (in English)
Informatics – computer scienceBio – molecular biology
Bioinformatics – solving problems arising from biology using methodology from computer science.
7
Molecular Biology: Flow of Molecular Biology: Flow of Information Information
DNA RNA Protein Function
DNAPhe Cys LysCysAspCys ArgSerAla
Leu
Protein
ACTGGA AGCTTATC
8
DNA (Gene) RNA ProteinDNA (Gene) RNA Protein
Controlstatement
TATA start
Termination stop
Controlstatement
Ribosomebinding
Gene
Transcription (RNA polymerase)
mRNA
Protein
Translation (Ribosome)
5’ utr 3’ utr
9
Nucleotide and Protein SequenceNucleotide and Protein Sequence
aacctgcgga aggatcattaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc aacacgaacactgtctgaaa gcgtgcagtctgagttgatt gaatgcaatcagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg cggagacccc
gcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgcaacctgcgga aggatcattaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg cggagacccc gcgggcccgc cgcttgtcggccgccggggg ggcgcctctg
cgcttgtcgg ccgccgggggccccccgggc ccgtgcccgccggagacccc aacacgaacactgtctgaaa gcgtgcagtctgagttgatt gaatgcaatcagttaaaact ttcaacaatggatctcttgg aacctgcggaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcg
SQ sequence 1344 BP; 291 A; C; 401 G; 278 T; 0 other
DNA (Nucleotide) Sequence
CG2B_MARGL Length: 388 April 2, 1997 14:55 Type: P Check: 9613 .. 1 MLNGENVDSR IMGKVATRAS SKGVKSTLGT RGALENISNV ARNNLQAGAK KELVKAKRGM TKSKATSSLQ SVMGLNVEPM EKAKPQSPEP MDMSEINSAL EAFSQNLLEG VEDIDKNDFD NPQLCSEFVN DIYQYMRKLE REFKVRTDYM TIQEITERMR SILIDWLVQV HLRFHLLQET LFLTIQILDR YLEVQPVSKN KLQLVGVTSM LIAAKYEEMY PPEIGDFVYI TDNAYTKAQI RSMECNILRR LDFSLGKPLC IHFLRRNSKA GGVDGQKHTM AKYLMELTLP EYAFVPYDPS EIAAAALCLS SKILEPDMEW GTTLVHYSAY SEDHLMPIVQ KMALVLKNAP TAKFQAVRKK YSSAKFMNVS TISALTSSTV MDLADQMC
Protein (Amino Acid) Sequence
10
Some FactsSome Facts
1014 cells in the human body. 3 109 letters in the DNA code in every cell in
your body. DNA differs between humans by 0.2% (1 in 500
bases). Human DNA is 98% identical to that of
chimpanzees. 97% of DNA in the human genome has no known
function.
11
Topics in BioinformaticsTopics in Bioinformatics
Structure analysisStructure analysis Protein structure comparison Protein structure prediction RNA structure modeling
Pathway analysisPathway analysis Metabolic pathway Regulatory networks
Sequence analysisSequence analysis Sequence alignment Structure and function prediction Gene finding
Expression analysisExpression analysis Gene expression analysis Gene clustering
12
Extension of Bioinformatics ConcExtension of Bioinformatics Concept ept Genomics
Functional genomics Structural genomics
Proteomics: large scale analysis of the proteins of an organism
Pharmacogenomics: developing new drugs that will target a particular disease
Microarray: DNA chip, protein chip
13
Applications of BioinformaticsApplications of Bioinformatics
Drug design Identification of genetic risk factors Gene therapy Genetic modification of food crops and animals Biological warfare, crime etc.
Personal Medicine? E-Doctor?
14
Bioinformatics as Information TecBioinformatics as Information Technologyhnology
Bioinformatics
InformationRetrieval
GenBankSWISS-PROT
Hardware
Agent
Machine Learning
Algorithm
Supercomputing
Information filteringMonitoring agent
ClusteringRule discoveryPattern recognition
Sequence alignment
Biomedical text analysis
Database
15
Background of BioinformaticsBackground of Bioinformatics Biological information infra
Biological information management systems Analysis software tools Communication networks for biological research
Massive biological databases DNA/RNA sequences Protein sequences Genetic map linkage data Biochemical reactions and pathways
Need to integrate these resources to model biological reality and exploit the biological knowledge that is being gathered.
16
Structural Genomics
FunctionalGenomics Proteomics Pharmaco-
genomics
AGCTAGTTCAGTACATGGATCCATAAGGTACTCAGTCATTACTGCAGGTCACTTACGATATCAGTCGATCACTAGCTGACTTACGAGAGT
Microarray (Biochip)
Infrastructure of Bioinformatics
Areas and Workflow of BioinformAreas and Workflow of Bioinformaticsatics
17
DNA Chip Data Analysis:DNA Chip Data Analysis:IT for BTIT for BT
18
cDNA MicroarraycDNA Microarray
cDNA clones(probes)
PCR product amplificationpurification
Printing
Microarray
Hybridize target to microarray
mRNA target
Excitation
Laser 1Laser 2
Emission
Scanning
Analysis
Overlay images and normalize
0.1nl/spot
19
The Complete Microarray BioinforThe Complete Microarray Bioinformatics Solutionmatics Solution
DataManagement
Databases
StatisticalAnalysis
ImageProcessing
Automation
DataMining
ClusterAnalysis
20
DNA Chip ApplicationsDNA Chip Applications
Gene discovery: gene/mutated gene Growth, behavior, homeostasis …
Disease diagnosis Cancer classification
Drug discovery: Pharmacogenomics Toxicological research: Toxicogenomics
21
Disease Diagnosis:Disease Diagnosis:Cancer Classification with DNA MicroarrayCancer Classification with DNA Microarray
- cDNA microarray data of 6567 gene expression levels [Khan ’01].
- Filter genes that are correlated to the classification of cancer using PCA and ANN learning.
- Hierarchical clustering of the DNA chip samples based on the filtered 96 genes.
- Disease diagnosis based on DNA chip.
[Fig.] Flowchart of the experimental procedure.
22
Disease Diagnosis:Disease Diagnosis:Hierarchical Clustering Based on Gene Expression LevelsHierarchical Clustering Based on Gene Expression Levels
- Hierarchical clustering of cancer by 96 gene expression levels.
- The relation between gene expression and cancer category.
- Four cancer diagnostic categories
[Fig.] The dendrogram of four cancer clusters and gene expression levels (row: genes, column: samples).
23
AI Methods for DNA Chip Data AI Methods for DNA Chip Data AnalysisAnalysis Classification and prediction
ANNs, support vector machines, etc. Disease diagnosis
Cluster analysis Hierarchical clustering, probabilistic clustering, etc. Functional genomics
Genetic network analysis Differential models, relevance networks, Bayesian netw
orks, etc. Functional genomics, drug design, etc.
24
Cluster AnalysisCluster Analysis
[DNA microarray dataset]
[Gene Cluster 1]
[Gene Cluster 2]
[Gene Cluster 3]
[Gene Cluster 4]
25
Methods for Cluster AnalysisMethods for Cluster Analysis
Hierarchical clustering [Eisen ’98] Self-organizing maps [Tamayo ’99] Bayesian clustering [Barash ’01] Probabilistic clustering using latent variables [Shi
n ’00] Non-negative matrix factorization [Shin ’00] Generative topographic mapping [Shin ’00]
26
Clustering of Cell Cycle-regulated Clustering of Cell Cycle-regulated Genes in Genes in S. cerevisiae S. cerevisiae (the Yeas(the Yeast)t) Identify cell cycle-regulated
genes by cluster analysis. 104 genes are already known to
be cell-cycle regulated. Known genes are clustered into
6 clusters. Cluster 104 known genes and
other genes together. The same cluster
similar functional categories.
[Fig.] 104 known gene expression levels according to the cell cycle(row: time step, column: gene).
27
Probabilistic Clustering Using Probabilistic Clustering Using Latent VariablesLatent Variables
gi: ith gene
zk: kth clustertj: jth time stepp(gi|zk): generating probability of ith gene given kth clustervk=p(t|zk): prototype of kth cluster
)()()|()|()(
i
kkiikki p
zpzpzpzpg
ggg
i j k
kjkikij ztpzpzpgztf ))|()|()(log(),,( gg
j
kjijki vxsimilarity ),( vx
: (*) objective function(maximized by EM)
28
Experimental Result:Experimental Result:Identify Cell Cycle-Regulated GenesIdentify Cell Cycle-Regulated Genes
Clustering result
[Table] Clustering result with -factor arrest data. In 4 clusters, the genes, that have high probability of being cell cycle-regulated, were found.
29
Experimental Result:Experimental Result:Prototype Expression Levels of Found ClustersPrototype Expression Levels of Found Clusters
[Fig.] Prototype expression levels of genes found to be cell cycle-regulated (4 clusters).
• The genes in the same cluster show similar expression patterns during the cell cycle.• The genes with similar expression patterns are likely to have correlated functions.
30
Clustering Using Non-negative Clustering Using Non-negative Matrix Factorization (NMF)Matrix Factorization (NMF)
NMF (non-negative matrix factorization)
r
aaiaii HW
1
)()( WHG
WHG
G : gene expression data matrix
W : basis matrix (prototypes)
H : encoding matrix (in low
dimension)
0,, aiai HWG
NMF as a latent variable model
…
…
h1 hr
g1 g2 gn
W
Whg
h2
31
Experimental Result:Experimental Result:Five Clusters Found by NMFFive Clusters Found by NMF
5 prototype expression levels during the cell cycle.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Time step in cell cycle
Expr
essi
on le
vel
32
Clustering Using GenerativeClustering Using Generative Topographic Mapping (GTM) Topographic Mapping (GTM)• GTM: a nonlinear, parametric mapping y(x;W)
from a latent space to a data space.
y(x;W): mapping
t1
t3
t2
x2
x1
Grid
<Latent space> <Data space>
Visualization
Generation
33
Experimental Result:Experimental Result:Clusters Found by GTMClusters Found by GTM
Three cell cycle-regulated clusters found by GTMCluster center No. of train
Data/ no. in cluster
Correct no. / test data
Overall mean expression levels (Cln/b) of known genes
S/G2 5 / 1 / 2 (.148 .184 -.367 -.044)
S (0.111 –0.333) 5 / 5 5 / 5 (100%) (1.075 1.482 -.233 -.375)
M/G1 c1 c2 c3
(0.111 0.333)(-0.111 –0.111)(0.323 0.1)
13 / 7 / 2 / 2
1 / 60 / 60 / 6
(-.171 -.573 .091 .311)
G2/M c1 c2
(0.111 0.333)(0.111 0.111)
10 / 5 / 3
0 / 53 / 5 (80%)
(-.616 –1.01 1.832 1.596)
G1 c1 c2
(-0.111 0.333)(-0.111 0.111)
35 / 18 / 7
10 / 16 (62%) 0 / 16
(.894 .907 -.766 -.479)
34
Experimental Result:Experimental Result:Comparison with other methodsComparison with other methods
Comparison of prototype expression levelsNo. of selected genes
Mean expression levels by GTM
No. of selected genes by Spellman
Mean expression levels by Spellman
S/G2 92 (.13 -.06 -.1 .01) 121 (.13 .05 -.16 .03)
S 25 (.84 .81 -.42 -.33) 71 (.46 .47 -.43 -.18)
M/G1 c1 c2 c3
1203410
(.82 .65 -.65 -.38)(-.04 -.37 -.01 -.11)(.32 .29 -.3 .05)
113 (-.21 -.61 -.04 .07)
G2/M c1 c2
3360
(-.59 -.96 1.34 1.29)(.08 -.30 .51 .57)
195 (-.32 -.62 .49 .54)
G1 c1 c2
122 74(total = 570)
(.92 .74 -.62 -.33)(.79 .82 -.48 -.34)
300
(total = 800)
(.66 .49 -.55 -.33)
35
Genetic Network AnalysisGenetic Network Analysis
- Discover the complex regulatory interaction among genes.
- Disease diagnosis, pharmacogenomics and toxicogenomics
- Boolean networks
- Differential equations
- Relevance networks [Butte ’97]
- Bayesian networks [Friedman ’00] [Hwang ’00]
[Fig.] Basin of attraction of 12-gene Boolean genetic network model [Somogyi ’96].
36
Bayesian NetworksBayesian Networks
Represent the joint probability distribution among random variables efficiently using the concept of conditional independence.
BA
C D
Enet) Bayes example (by the )|()|(),|()()(
rule)chain (by ),,,|(),,|(),|()|()(),,,,(
CEPBDPBACPBPAPDCBAEPCBADPBACPABPAP
EDCBAP
•A, C and D are independent given B.
•C asserts dependency between A and B.
•A, B and E are independent given C.
An edge denotes the possibility of the causal relationship between nodes.
37
Bayesian Networks LearningBayesian Networks Learning
Dependence analysis [Margaritis ’00] Mutual information and 2 test
Score-based search
• D: data, S: Bayesian network structure
NP-hard problem Greedy search Heuristics to find good massive network structures quick
ly (local to global search algorithm)
n
i
q
j
r
kijk
ijkijk
ijij
iji i NN
Sp
SDpSpSDp
1 1 1 )()(
)()(
)(
)|()(),(
38
The Small Bayesian Network for The Small Bayesian Network for Classification of CancerClassification of Cancer
Zyxin
Leukemiaclass
MB-1
C-mybLTC4STraining error Test error
Bayes nets 0/38 2/34Neural trees 0/38 1/34
RBF networks 0/38 1.3/34
•The Bayesian network was learned by full search using BD (Bayesian Dirichlet) score with uninformative prior [Heckerman ’95] from the DNA microarray data for cancer classification (http://waldo.wi.mit.edu/MPR/).
[Table] Comparison of the classification performance with other methods [Hwang ’00].
39
Large-Scale Bayesian Network Large-Scale Bayesian Network with with 1171 Genes1171 Genes
- Genetic networks for understanding the regulatory interaction among genes and their derivatives
- Pharmacogenomics and Toxicogenomics
[Fig.] The Bayesian network structure constructed from DNA microarray data for cancer classification (partial view).
40
DNA Computing: BT for ITDNA Computing: BT for IT
41
DNA ComputingDNA Computing: BioMolecules a: BioMolecules as Computers Computer
011001101010001 ATGCTCGAAGCT
42
Why DNA Computing?Why DNA Computing?
6.022 1023 molecules / mole Immense, brute force search of all possibilities
Desktop: 109 operations / sec Supercomputer: 1012 operations / sec 1 mol of DNA: 1026 reactions
Favorable energetics: Gibb’s free energy
1 J for 2 1019 operations Storage capacity: 1 bit per cubic nanometer
-1mol 8kcalG
43
HPPHPP
...
......
...ATGATGACGACG
TGCTGC
CGACGA
TAATAAGCAGCA
CGTCGT...
...
...
...... ...
...
...
10
3
2 56
4
SolutionSolution
ATGTGCTAACGAACG
ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGT
TAAACG
CGACGT
TAAACGGCAACG
...
...
...
...
CGACGTAGCCGT
...
...
...
ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGTACGCGTAGCCGT
ACGCGT
......
...
...
...
ACGGCATAAATGTGCACGCGTACGCGAGCATAAATGCGATGCCGT
ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGT
... ... .........
ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGT
...
.........
...
Decoding
Ligation
Encoding
Gel Electrophoresis
Affinity Column
ACGCGAGCATAAATGTGCACGCGT
ACGCGAGCATAAATGCGATGCACGCGT
ACGCGAGCATAAATGTGCACGCGT
ACGCGAGCATAAATGCGATGCACGCGT
2
0 13 4
56
Node 0: ACG Node 3: TAANode 0: ACG Node 3: TAANode 1: CGA Node 4: ATGNode 1: CGA Node 4: ATGNode 2: GCA Node 5: TGCNode 2: GCA Node 5: TGC
Node 6: CGTNode 6: CGT
Flow of DNA ComputingFlow of DNA Computing
PCR(Polymerase
Chain Reaction)
44
Biointelligence on a Chip?Biointelligence on a Chip?
Biological Computer
MolecularElectronics
BioinformationTechnology
Computing Models:The limit of conventional computing models
Computing Devices: The limit of siliconesemiconductor technology
Information Technology
Biotechnology
Biointelligence Chip
45
Intelligent Biomolecular InformatioIntelligent Biomolecular Information Processingn Processing
Bio-Memory Biocomputing
Theoretical Models
S
GFP
Cytochrome c
S
GFP
Cytochrome c
Bio-Processor
Input AInput AController
OutputReaction Chamber
(Calculating)
46
Evolvable Biomolecular HardwarEvolvable Biomolecular Hardwaree
Sequence programmable and evolvable molecular systems have been constructed as cell-free chemical systems using biomolecules such as DNA and proteins.
47
DNA Computers vs. Conventional DNA Computers vs. Conventional ComputersComputers
DNA-based computers Microchip-based computersslow at individual operations fast at individual operations
can do billions of operations simultaneously
can do substantially fewer operations simultaneously
can provide huge memory in small space
smaller memory
setting up a problem may involve considerable preparations
setting up only requires keyboard input
DNA is sensitive to chemical deterioration
electronic data are vulnerable but can be backed up easily
48
Molecular Operators for DNA Molecular Operators for DNA ComputingComputing• Hybridization: complementary pairing of two single-stranded polynucleotides
5’- AGCATCCA –3’
3’- TCGTAGGT –5’+ 5’- AGCATCCA –3’
3’- TGCTAGGT –5’
• Ligation: attaching sticky ends to a blunt-ended molecule
TGACTACGACTG
ATGCATGCTACG + ATGCATGCTGAC
TACGTACGTGAC
sticky end
49
Research GroupsResearch Groups
MIT, Caltech, Princeton University, Bell Labs EMCC (European Molecular Computing Consorti
um) is composed of national groups from 11 European countries
BioMIP Institute (BioMolecular Information Processing) at the German National Research Center for Information Technology (GMD)
Molecular Computer Project (MCP) in Japan Leiden Center for Natural Computation (LCNC)
50
Applications of Biomolecular ComApplications of Biomolecular Computingputing Massively parallel problem solving Combinatorial optimization Molecular nano-memory with fast associative search AI problem solving Medical diagnosis Cryptography Drug discovery Further impact in biology and medicine:
Wet biological data bases Processing of DNA labeled with digital data Sequence comparison Fingerprinting
51
NACST NACST (Nucleotide Acid Computing Simulation Toolkit)(Nucleotide Acid Computing Simulation Toolkit)
GUI
DNA Sequence Generator
Genetic Algorithm
Ligation Unit
PCR Unit
Electrophoresis UnitAffinity Column Unit
Enzyme Unit
NACST Engine Controller
DNA Sequence Optimizer
52
NACSTNACSTOutputs Inputs
53
Combinatorial Problem SolverCombinatorial Problem Solver
1
32
AGCT TAGGP1A P1B
ATGG CATGP2A P2B
CGAT CGAAP3A P3B
10
3
2 5
6
4
3
53
3
7
113
3
9
11
33 7 3
P1B P3A
ATCC GCCT GCTAW13P1B P2A
ATCC ATCA TACCW12
TSP (Traveling Salesman Problem)
Representations
0 1 2 3 4 5 6 0
54
Combinatorial Problem SolverCombinatorial Problem Solver Weight
representation methods
1. Molecules with high G-C content tend to hybridize easily.
2. Molecules with high G-C content tend to be denatured at higher temperature.
3. Molecules with larger population in tube will have more probability to hybridize.
Hybridization/Ligation
PCR/Gel electrophoresis
Affinity chromatography
PCR/Gel electrophoresis
Temperature Gradient Gel Electrophoresis
Graduate PCR
55
Experimental Results for 4-TSPExperimental Results for 4-TSP
Hybridization (37°C)Ligation (16 °C 15hr)
PCR (36 cycle)Gel electrophoresis
(10% polyacrylamide gel)
50 bp markerOligomer mixture
Ligation result
Final PCRresult(140bp)
56
Molecular Theorem ProverMolecular Theorem Prover Resolution refutation method
RQP QTS S TP R
RQ QT
Q
R
nilR is true!
Problem under consideration:
Turn into , add R as
?true , , , ,
RPTSQTSRQP
BA BA
R
RPTSQTSRQP
, , , ,
57
Molecular Theorem ProverMolecular Theorem Prover(Abstract Implementation)(Abstract Implementation)
Implementation 1 Implementation 2
¬S ¬T Q
¬Q ¬P R
P ¬R
TS
¬S ¬T Q¬Q ¬P R
P ¬R
TS
¬S ¬T Q¬Q ¬P R
P ¬RTS
R
¬Q
Q
¬P¬S
¬T ¬R
T SP
58
Molecular Theorem ProverMolecular Theorem Prover(Experiments for Method 1)(Experiments for Method 1)
실험 과정 실험 결과
II. DenaturationII. Denaturation
( 95( 95°C 10 min)°C 10 min)
IV. Polyacrylamide gel Electrophoresis(20%)IV. Polyacrylamide gel Electrophoresis(20%)
( PAGE )( PAGE )
V. Detection of solution V. Detection of solution
: 75bp ds DNA: 75bp ds DNA
III. AnnealingIII. Annealing9595°C 1 min °C 1 min 15 °C : 1°C down/min 15 °C : 1°C down/min
I. I. 각 분자들을 혼합각 분자들을 혼합
100pmol/each 100pmol/each Total 20 Total 20 ulul
200 bp200 bp
20 bp20 bp
11 22 33 44 55 66
20 bp DMA marker (Talara)
Mixture Reaction
59
Solving Logic Problems by Solving Logic Problems by Molecular ComputingMolecular Computing Satisfiability Problem
Find Boolean values for variables that make the given formula true
3-SAT Problem Every NP problems can be see
n as the search for a solution that simultaneously satisfies a number of logical clauses, each composed of three variables.
)or or ( AND )or or ()or or ( AND )or or (
321321
654321
xxxxxxxxxxxx
)()()( 324431 xxxxxx
DNA Computing with DNA ChipsDNA Computing with DNA Chips
61
DNA Chips for DNA ComputingDNA Chips for DNA Computing
I. Make: oligomer synthesis
II. Attach (Immobilized): 5’HS-C6-T15-CCTTvvvvvvvvTTCG-3’
III. Mark: hybridization
IV. Destroy: Enzyme rxn (ex.EcoRI)
V. Unmark * 문제를 만족시키지 않는 모든 stran
d 제거
VI. Readout: N cycle 의 마지막 단계에 해가 남게
되 면 , PCR 로 증폭하여 확인 !
62
Variable Sequences and the Variable Sequences and the Encoding SchemeEncoding Scheme
63
Tree-dimensional Plot and Tree-dimensional Plot and Histogram of the FluorescenceHistogram of the Fluorescence
S3: w=0, x=0, y=1, z=1 S7: w=0, x=1, y=1, z=1 S8: w=1, x=0, y=0, z=0 S9 : w=1, x=0, y=0, z=1
y=1: (w V x V y) 만족 z=1: (w V y V z) 만족 x=0 or y=1: (x V y) 만족 w=0: (w V y) 만족
Four spots with high fluorescence intensity correspond to the four expected solutions.
DNA sequences identified in the readout step via addressed array hybridization.
64
OutlookOutlook
IT gets a growing importance in the advancement of BT. Bioinformatics DNA Microarray Data Mining
IT can benefit much from BT. Biocomputing and Biochips DNA Computing (with DNA Chips)
Bioinformation technology (BIT) is essential as a next-generation information technology. In Silico Biology vs. In Vivo Computing
65
ReferencesReferences [Barash ’01] Barash, Y. and Friedman, N., Context-specific Bayesian
clustering for gene expression data, Proc. of RECOMB’01, 2001. [Butte ’97] Butte, A.J. et al., Discovering functional relationships betw
een RNA expression and chemotherapeutic susceptibility using relevance networks, Proc. Natl Acad. Sci. USA, 94, 1997.
[Eisen ’98] Eisen, M.B. et al., Cluster analysis and display of genome-wide expression patterns, Proc. Natl Acad. Sci. USA, 95, 1998.
[Friedman ’00] Friedman, N. et al, Using Bayesian networks to analyze expression data, Proc. of RECOMB’00, 2000.
[Heckerman ’95] Heckerman, D. et al., Learning Bayesian networks: the combination of knowledge and statistical data, Machine Learning, 20(3), 1995.
[Hwang ’00] Hwang, K.-B. et al., Applying machine learning techniques to analysis of gene expression data: cancer diagnosis, CAMDA’00, 2000.
66
ReferencesReferences [Khan ’01] Khan, J. et al., Classification and diagnostic prediction of c
ancers using gene expression profiling and artificial neural networks, Nature Medicine, 7(6), 2001.
[Margaritis ’00] Margaritis, D. and Thrun, S., Bayesian network induction via local neighborhoods, Proc. of NIPS’00, 2000.
[Shin ’00] Shin, H.-J. et al., Probabilistic models for clustering cell cycle-regulated genes in the yeast, CAMDA’00, 2000.
[Somogyi ’96] Somogyi, R. and Sniegoski, C.A., Modeling the complexity of genetic networks: understanding multigenic and pleiotropic regulation, Complexity, 1(6), 1996.
[Tamayo ’99] Tamayo, P. et al., Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation, Proc. Natl Acad. Sci. USA, 96, 1999.
67
More information atMore information at http://cbit.snu.ac.kr/http://cbit.snu.ac.kr/http://bi.snu.ac.kr/http://bi.snu.ac.kr/