csci6904 genomics and biological computing

30
CSCI6904 Genomics and Biological Computing Seminar 1 – Sequencing a genome in one step. Projects and paper presentations

Upload: mckenzie-coleman

Post on 03-Jan-2016

18 views

Category:

Documents


1 download

DESCRIPTION

CSCI6904 Genomics and Biological Computing. Seminar 1 – Sequencing a genome in one step. Projects and paper presentations. Michael Smith UBC, Nobel 1993 in chemistry. Genetic engineering, four slides about it. It is possible to cut and paste DNA into artificial constructs: Mutate on demand - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSCI6904 Genomics and Biological Computing

CSCI6904

Genomics and Biological Computing

Seminar 1 – Sequencing a genome in one step.

Projects and paper presentations

Page 2: CSCI6904 Genomics and Biological Computing

Genetic engineering, four slides about it.

It is possible to cut and paste DNA into artificial constructs:• Mutate on demand• Make hybrid• Neutralize toxic genes by expressing them in two parts

Michael SmithUBC, Nobel 1993 in chemistry

Page 3: CSCI6904 Genomics and Biological Computing

Ctrl-X Ctrl-V on DNA

PrinciplePalindrome SequenceCut and leave “Sticky ends”Isolate on the basis of sizePaste somewhere useful

Why this exist?To systematically destroy

foreign DNA (Virus, parasites, etc…).

Page 4: CSCI6904 Genomics and Biological Computing

Genetic engineering

PlasmidArtificial construct used tomanipulate sequences.

CloningMake a copy of asegment of DNA

Page 5: CSCI6904 Genomics and Biological Computing

Genetic engineering

MutantSequence in which an Copy error is introduced duringDNA replication.

By the way…Everybody is a mutant

relative to their parents.

Page 6: CSCI6904 Genomics and Biological Computing

Genetic engineering

Site directed mutagenesisMutation can be introduced on demand by using artificial DNA fragments with one/a few error in them.

Page 7: CSCI6904 Genomics and Biological Computing

Genetic engineering

Example

1. Environmental Bioremediation1. PCB degrading bacteria2. Oil eating bacteria3. Plastic degrading bacteria maybe someday.

2. Genetically modified organisms1. Human insulin producing pigs.2. Antifreeze produce.3. Long shelf-life tomatoes.4. High-fat or specialized fat grains.

The bulk of the ethical issues may be largely due to a natural resistance to new technologies.

Page 8: CSCI6904 Genomics and Biological Computing

Polymerase Chain Reaction

PrincipleExponential amplification of a single piece of DNA.

•Forensic science -> DNA evidence.

•Rapid medical diagnistic

•Jurassic Park (dinosaure blood in amber samples)

Page 9: CSCI6904 Genomics and Biological Computing

DNA replication

Needs (in vitro):

-template DNA-Polymerase-Primers-Nucleotides in solution

Page 10: CSCI6904 Genomics and Biological Computing

Polymerase Chain Reaction

PrincipleExponential amplification

Taq

Page 11: CSCI6904 Genomics and Biological Computing

Paper presentation

Problem

Why this is a problem of interest.

Abstraction

Translate your problem into something that has nothing to do with biology anymore. Setting the “specifications” or the problem.

How it was done

Methods, results and conclusions.

Validation

Page 12: CSCI6904 Genomics and Biological Computing

GenoFrag

Problem

Need to amplify rapidly a whole genome to quickly identify the difference. This can be particularly useful in case of outbreak which cause is still unknown coming from a bacteria/virus not know to have such harmful characters.

Example of this would be:

E. coli contamination of watershed (Walkerton, June 2000).

Flesh-eating disease (S. aureus)

SARS

HIV

Influenza

General case of drug resistance.

Page 13: CSCI6904 Genomics and Biological Computing

GenoFrag

Abstraction

Need to be able to scan whole genomes very fast.

Assume that the target organism is very similar to something we already know the genomic sequence.

Target of interest: S. aureus. An ubiquitous, yet sometimes extremely harmful bacteria (mutation, inversion, deletion, transposons,…).

Technique: Long range PCR.

Problem: need a lot of oligonucleotide primer pairs to cover the entire sequence.

Page 14: CSCI6904 Genomics and Biological Computing

GenoFrag

Problem with primers

25 nt long

Given G+C content

No character repeat longer than N

No “hairpins”

Correct Tm.

Self- and inter-

complementarity

Unique

(Representative of all

sub-species)

Sufficiently informative

Equally spaced

Maximally covering all

the dataset

Page 15: CSCI6904 Genomics and Biological Computing

GenoFrag

Problem with coverage

pairs must be roughly equally spaced (9-11Kbp).

Must overlap to some extent.

Minimize the number of pairs.

ALGORITHMS:

Shortest path problem.

Single Traverse Algorithm.

Page 16: CSCI6904 Genomics and Biological Computing

GenoFrag

Problem with primers

25 nt long

Given G+C content

No character repeat longer than N

No “hairpins”

Correct Tm.

Self- and inter-

complementarity

Unique

(Representative of all

sub-species)

Sufficiently informative

Equally spaced

Maximally covering all

the dataset

Page 17: CSCI6904 Genomics and Biological Computing

GenoFrag

The filtered potential primers are reasonably well distributed along the DNA sequence.

Page 18: CSCI6904 Genomics and Biological Computing

GenoFrag

Optimizing

SSP

- 10K fragments

- Maximal coverage

SITA - Equally sized fragments - Maximal coverage

Page 19: CSCI6904 Genomics and Biological Computing

GenoFrag

Graph abstraction of the problem

Page 20: CSCI6904 Genomics and Biological Computing

GenoFrag

Graph abstraction of the problemHence the linearity of the problem.

SSP graph

Page 21: CSCI6904 Genomics and Biological Computing

GenoFrag

Graph abstraction of the problemHence the linearity of the problem.

SITA graph

Page 22: CSCI6904 Genomics and Biological Computing

GenoFrag

Results

Performances

- 40s for 2.8 Mb on 1.6GHz PC

Failed when…

- Presence of large insertions**

- Less general use if omit to

use filter 7.

But otherwise:

- 1-step genome (or part of one) amplification seems possible.

Page 23: CSCI6904 Genomics and Biological Computing

Paper presentation

What to choose:

Try to do a search on a method or field of your choice and add bioinformatics in he query. There will be a selection of papers that will be related to something you will enjoy talking about.

Examples

Bioinformatic resources (database, interface, services)

Large scale projects (Folding@home, BlueGene, HGP, )

API for bioinformatics (Bioperl, Biopython, NCBI tk, etc…)

Machine learning applications (detection, prediction, method validation)

Parallelization computing in bioinformatics

Algorithm and application to a specific question

Theoretical papers and simulators *

Methods and applications *

*would require a bit more background knowledge.

Page 24: CSCI6904 Genomics and Biological Computing

Projects ideas

A protein contact map using Voronoi triangulation

Problem

Proteins have fairly complex 3D chain paths. Many structural bioinformatics methods require the knowledge of which characters are in contact with each other.

Such contact maps are usually implemented as applying a cutoff filter to a distance matrix between arbitrary chosen centroids.

A contact map based on whether there is a shared surface in a Voronoi diagram between two amino-acids would be a nice, general purpose bioinformatic tool.

References

http://bioinformatics.oupjournals.org/cgi/content/abstract/bth365v1

Page 25: CSCI6904 Genomics and Biological Computing

Projects ideas

A lattice-based protein folding simulatorProblem

Protein are computationally expensive to model using computational chemistry methods. Further, these models are empirical and have a rather limited scope which does NOT include protein folding.

There is a 2D/3D abstraction of protein chain that exist, fixing each amino acid to one vertex in a cubic lattice. With this simulation environment, seriously cool experiments can be performed, especially if the environment is efficient.

This type of simulation is actually simple, compared to the simulation environment of phyiscal systems.

References

Page 26: CSCI6904 Genomics and Biological Computing

Projects ideas

Finding phylogenetically informative sites using Matrix decomposition (PCA,ICA,…)

Problem

An in-house phylogenetic application is generating solution pools of phylogenetic trees. This creates matrices of trees columns and sites (datapoints as rows)

There is an interest to figure out which rows are the most informative at discriminating between trees, or whether it is possible to identify clusters of data points showing some level of dependence to each other (They are assumed to be independent). Identify such cluster could be used to identify regions of recombination, for example.

References

Page 27: CSCI6904 Genomics and Biological Computing

Projects ideas

Computing the shortest path of topological rearrangements between two binary tree topologies

Problem

Information about the optimization landscape in phylogeny is scarce. Drawing paths between solutions would allow to plot cross-section of the search space to access the shape of the search space in various locations.

References

Felsenstein, Inferring phylogenies, 2003, Sinauer Eds.

Page 28: CSCI6904 Genomics and Biological Computing

Projects ideas

Sequence Harvesters

Problem

Gathering sequence information from GeneBank is a time consuming task. Typically, one starts with a sequence of interest, query the database with BLAST, chose the laergest set of non-redundant sequences and paste them into a file. The sequence have to be renamed one at the time and checked for duplicates. This could be automated and run in a few seconds (instead of a few hours of fingerwork).

A suggested platform for this project would be Biopython.

References

See www.biopython.org

Page 29: CSCI6904 Genomics and Biological Computing

Projects ideas

Using XML for validating protein structures

Problem

Protein 3D structures are stored in files in the PDB format. This format is regularly abused. As a results, it is hard to parse directly the files. There are cases of missing information, duplicates of atoms, omission of labels, non-standard labels. This project would explore the use of the new PDBML format to use as data source instead of the PDB flat files.

References

See PDBML from www.pdb.org

The main ref isn’t available yet to Dalhousie...

Page 30: CSCI6904 Genomics and Biological Computing

Projects ideas

This is not a closed list!

Text Mining and conceptual Biology.

Gene expression and clustering.

Protein topologies and similarity detection.

Clustering 3D structures.

Sequence patterns near variable regions in proteins sequences.