csci 6900/4900 special topics in computer science automata and formal grammars for bioinformatics...

16
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems • sequence comparison • pattern/structure search • pattern/structure recognition • relationship of sequences Algorithm design • optimal algorithms • heuristic algorithms • parallel algorithms Probabilistic models • stochastic finite state automata (HMMs) • stochastic regular grammars • stochastic context-free grammars • more complex grammar models

Upload: rebecca-cain

Post on 02-Jan-2016

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

CSCI 6900/4900 Special Topics in Computer Science

Automata and Formal Grammars for Bioinformatics

Bioinformatics problems

• sequence comparison• pattern/structure search• pattern/structure recognition• relationship of sequences

Algorithm design

• optimal algorithms• heuristic algorithms• parallel algorithms

Probabilistic models

• stochastic finite state automata (HMMs)• stochastic regular grammars• stochastic context-free grammars• more complex grammar models

Page 2: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

Probabilistic modeling and algorithms

M: modeling a family of sequences (e.g. RNA) to capture certain properties Q1, Q2, ….

(1) Each sequence x possesses a property Qk(x) with probability Pk(x)

(2) A probability distribution for each sequence x over the properties, i.e., ∑k Pk(x) = 1 for each given x

(3) The most likely property Q*(x) is one with the highest probability,i.e., Q*(x) = arg maxk { Pk(x) }

(4) Algorithms are designed to find the most likely property for given sequences. But how?

Modeling mechanism

M

Computational linguistic systems can describe desired properties of bio sequences

D (sample, training data)assigning probs

Page 3: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

Outline for the course

• Part 0: molecular biology basics and review of probability theory

• Part 1: pairwise alignment, HMMs, profile-HMMs, gene finding, and multiple alignment (chapters 1-6)potential research projects: efficient HMM algorithms, gene finding

• Part 2: RNA stem-loops, SCFG, secondary structure prediction, structural homology search (chapters 9-10)

potential research projects: efficient SCFG algorithms, pseudoknot prediction, protein secondary structure prediction

• Part 3: phylogeny reconstruction, probabilistic approaches (chapters 7-8)

potential research projects: grammar modeling of evolution

Page 4: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

The ways this course is to be conducted

• To learn new concepts and techniques

Lectures (by the instructor and students)

• To apply learned knowledge to research

Research discussions (lead by students and the instructor)

• To demonstrate learning effectiveness

Presentations of research results (by students)

Page 5: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

The central dogma of molecular biology

Page 6: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

Nucleotides

• Purines Adenine, Guanine

• Pyrimidines Cytosine, Thymine

Building blocks of DNA

Page 7: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

Double helix of DNA

Page 8: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

DNA replication

Page 9: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

Genetic code

Page 10: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

Mutations

(1) synonymous

(2) Missense

(3) nonsense

(4) frame-shift

Page 11: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

RNA synthesis

Page 12: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

RNA synthesis (cont’)

Page 13: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

RNA can fold to itself

Page 14: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

Protein synthesis

Page 15: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

Biological information flow

Genome

AGACGCTGGTATCGCATTAACTAACGGGTTACTCGGATATTACCTTACTATAGGGCGCTATCGCGCGTTAATCTGGTATC

IntronsExons

Gene sequence

Proteinsequence

Proteinstructure

RegulatoryDNA sequence

Sequencefamily

Structurefamily

Protein-DNAinteractions

Protein-protein interactions

Generegulation

Geneexpression

Proteinfunction

Proteinabundance

Cellularrole

Page 16: CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure

What bioinformatics is NOT:

• Not just using a computer to speed up biology• Not just applying computer algorithms to biology• Not just the accountant of genomic data

What bioinformatics is then:

• The creative use of computers to define and solve central biological puzzles

• The computer becomes an hypothesis machine, making predictions to be tested at the bench.