filling in: ioannis pandis, phd [email protected] co341: introduction to bioinformatics prof. yi-ke...

44
Filling in: Ioannis Pandis, PhD [email protected] k CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo ([email protected])

Upload: lily-newton

Post on 03-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Filling in:

Ioannis Pandis, PhD

[email protected]

CO341: Introduction to Bioinformatics

Prof. Yi-Ke Guo ([email protected])

Page 2: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Sequencing and Genomics

DNA Sequencing

Sequencing Analysis

Gene Expression

Gene Expression Analysis

Functional Genomics

Page 3: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

DNA Structure

Double Helix (Crick & Watson)– 2 coiled matching strands– Backbone of sugar phosphate pairs

Nitrogenous Base Pairs – Roughly 20 atoms in a base– Adenine Thymine [A,T]– Cytosine Guanine [C,G]– Weak bonds (can be broken)– Form long chains called polymers

Read the sequence on 1 strand– GATTCATCATGGATCATACTAAC

Page 4: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Differences in DNA

2% tiny

Roughly 4%

Share

Materia

l

DNA differentiates:– Species/race/gender– Individuals

We share DNA with– Primates,mammals– Fish, plants, bacteria

Genotype– DNA of an individual

Genetic constitution

Phenotype– Characteristics of the

resulting organism Nature and nurture

Page 5: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Genes Chunks of DNA sequence

– Between 600 and 1200 bases long– 22,000 human genes, 100,000 genes in tulips

Large percentage of human genome – termed “junk”: does not code for proteins

“Simpler” organisms such as bacteria– Are much more “evolved” (have hardly any junk)– Viruses have overlapping genes (zipped/compressed)

Often the active part of a gene is split into exons– Separated by introns

Page 6: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Transcription Take one strand of DNA Write out the counterparts to each base

– G becomes C (and vice versa)– A becomes T (and vice versa)

Change Thymine [T] to Uracil [U] You have transcribed DNA into messenger RNA Example:

Start: GGATGCCAATGIntermediate: CCTACGGTTACTranscribed: CCUACGGUUAC

Page 7: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

The Synthesis of Proteins

Instructions for generating Amino Acid sequences– (i) DNA double helix is unzipped– (ii) One strand is transcribed to messenger RNA – (iii) RNA acts as a template

ribosomes translate the RNA into the sequence of amino acids

Amino acid sequences fold into a 3d molecule Gene expression

– Every cell has every gene in it (has all chromosomes)– Which ones produce proteins (are expressed) & when?

Page 8: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Genetic Code

How the translation occurs

Think of this as a function:– Input: triples of three base letters (Codons)– Output: amino acid– Example: ACC becomes threonine (T)

Gene sequences end with: – TAA, TAG or TGA

Page 9: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Example Synthesis

TCGGTGAATCTGTTTGAT Transcribed to:

AGCCACUUAGACAAACUATranslated to:

SHLDKL

Page 10: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Evolution of Genes: Inheritance

Evolution of species– Caused by reproduction and survival of the fittest

But actually, it is the genotype which evolves– Organism has to live with it (or die before reproduction)– Three mechanisms: inheritance, mutation and crossover

Inheritance: properties from parents– Embryo has cells with 23 pairs of chromosomes– Each pair: 1 chromosome from father, 1 from mother– Most important factor in offspring’s genetic makeup

Page 11: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Evolution of Genes: Mutation Genes alter (slightly) during reproduction

– Caused by errors, from radiation, from toxicity– 3 possibilities: deletion, insertion, substitution

Substitution: ACGTTGACTC ACGATGACTT Deletion: ACGTTGACTC ACGTGACTC Insertion: ACGTTGACTC AGCGTTGACTC

– Frameshift: ACGTTGACTC AGCGTTGACTC

Mutations are categorised into:– Neutral or– Deleterious

A single change has a massive effect on translation Causes a different protein conformation

Page 12: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Evolution of Genes: Crossover (Recombination)

DNA sections are swapped – From male and female genetic input to offspring DNA

Page 13: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Sequencing for Medical Study

Phenotype

Genotype

Hypothesis

Test HypothesisBy Genetic Manipulation

Page 14: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Typical Cycle of the Study

Phenotype

Genotype

Hypothesis:

Test HypothesisBy Genetic Manipulation

Two groups:1.Develop

Colorectal cancerAt Young Age

2. Do not

Mutation in APCGene

APC is a Tumor Supressor Gene

Delete APC in MouseControl: Isogenic APC+

Page 15: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Technologies Required

Phenotype

Genotype

Hypothesis

Test HypothesisBy Genetic Manipulation

Observation

?Sequencing?

Reading/Thinking

Gene Deletion/Replacement

In 2005$9 million/genome

Not feasible

Page 16: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

The thing is changing rapidly: Bp/$$ increases exponentially with time

Adapted from Shendure et al 2004

In 1980, the sequencing cost per finished bp ≈ $1.00In 2003, the sequencing cost per finished bp ≈ $0.01

>>> a 100-fold reduction in 20-25 years

Page 17: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

History of DNA Sequencing History of DNA Sequencing

Avery: Proposes DNA as ‘Genetic Material’

Watson & Crick: Double Helix Structure of DNA

Holley: Sequences Yeast tRNAAla

1870

1953

1940

1965

1970

1977

1980

1990

2002

Miescher: Discovers DNA

Wu: Sequences Cohesive End DNA

Sanger: Dideoxy Chain TerminationGilbert: Chemical Degradation

Messing: M13 Cloning

Hood et al.: Partial Automation

• Cycle Sequencing • Improved Sequencing Enzymes

• Improved Fluorescent Detection Schemes

1986

• Next Generation Sequencing• Improved enzymes and chemistry

• Improved image processing

Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)

1

15

150

50,000

25,000

1,500

200,000

50,000,000

Efficiency(bp/person/year)

15,000

100,000,000,000 2008

1928???

Page 18: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

History of DNA Sequencing History of DNA SequencingAdapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)

Griffith's experiment, reported in 1928 by Frederick Griffith

Page 19: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

History of DNA Sequencing History of DNA Sequencing

Avery: Proposes DNA as ‘Genetic Material’

Watson & Crick: Double Helix Structure of DNA

Holley: Sequences Yeast tRNAAla

1870

1953

1940

1965

1970

1977

1980

1990

2002

Miescher: Discovers DNA

Wu: Sequences Cohesive End DNA

Sanger: Dideoxy Chain TerminationGilbert: Chemical Degradation

Messing: M13 Cloning

Hood et al.: Partial Automation

• Cycle Sequencing • Improved Sequencing Enzymes

• Improved Fluorescent Detection Schemes

1986

• Next Generation Sequencing• Improved enzymes and chemistry

• Improved image processing

Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)

1

15

150

50,000

25,000

1,500

200,000

50,000,000

Efficiency(bp/person/year)

15,000

100,000,000,000 2008

Page 20: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Sanger Sequencing(Chain-termination Methods)

DNA is fragmented Cloned to a plasmid

vector Cyclic sequencing

reaction Separation by

electrophoresis Readout with

fluorescent tags

Page 21: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Basics of the “old” technology Clone the fragmented DNA. Generate a ladder of labeled (colored) molecules that

are different by 1 nucleotide. Separate mixture on some matrix. Detect fluorochrome by laser. Interpret peaks as string of DNA. Strings are 500 to 1,000 letters long Assemble all strings into a genome

The Process Is Sequential

Page 22: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

3 ∙ 109 bp

1x coverage

10x coverage

2 ∙ 106 bp/day= 40 years

× 3 ∙ 109 bp

10x coverage × 3 ∙ 109 bp × $0.001/bp = $30 million

That is what old technology take

Page 23: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

New Generation Sequencing

Page 24: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Basics of the “new” technology Get DNA and fragment it Attach all fragments to glass slides. Perform amplification by some form of PCR Sequencing ALL these fragments in PARALLLE using chain

termination or other methods such as pyro-sequencing Extend and amplify signal with some color scheme. Detect fluorochrome by microscopy. Interpret series of spots as short strings of DNA. Strings are 30-300 letters long Multiple images are interpreted as 0.4 to 1.2 GB/run

(1,200,000,000 letters/day). Map or align strings to one or many genome.

Making Millions Short Sequence Reads in Parallel

Page 25: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Technology Overview: Solexa/Illumina Sequencing

http://www.illumina.com/

Page 26: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Immobilize DNA to Surface

Source: www.illumina.com

Page 27: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Sequence Colonies

Page 28: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Sequence Colonies

Page 29: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Call Sequence

Page 30: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

From Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4

Page 31: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Sequence Alighment

Meyerson et al, 2011

Page 32: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

2006: $10 million 2008: $100,000 2009: $10,000 2010: $5,000 2012: $1,000 ??? $100

So, how fast is cost going down?

Page 33: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Informatics Informatics challenge : ample applications

– All the genomics research can be uniformly done through sequencing (with the help of proper assay design)

– Bioinformatics turns the sequencer into universal genomics interpreter

– Not a challenge, rather a big opportunity!!!

For Edison, phonograph was not primarily designed for playing music but …….

Page 34: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

One Stone, Many Birds:NGS May Enable a Uniform Bioinformatics

Mapped Position : Structure/functionality

(Mapping)

BP Variant: SNP & Mutation Pattern

(Detecting)

Read Numbers:Quantified Abundance

(Counting)

Page 35: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Match These Sequences

How do we match this sequence:

gattcagacctagct

With this sequence:

gtcagatcct

Page 36: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Possible Answers

1. gattcagacctagct (no indels) gtcagatcct

2. gattcaga-cctagct (with indels) g-t-cagatcct

3. gattcagacctagc-t (no overhang) gtcagatcct

4. gattcagacctagct (with overhang) gtcagatcct

Page 37: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Sequence Matching Algorithms #1

Without indels Hamming distance Scoring schemes

– Certain changes in sequence more likely Due to chemical properties of the residues

BLAST algorithm– Idea: match local regions and expand– Seven part process

Page 38: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Sequence Matching Algorithms #2

With indels Drawing of Dotplots Dynamic Programming

(getting from A to B)Quickest route to Z + Quickest route from Z

VPFLLMMVLGVPFMMLG

A

B

ZGD

C

E

F

Page 39: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Searching Databases

We have ways to score how well 2 seqs match Now want to use this in databases

– Given a known gene sequence– Which genes in the database are closely related

Have to worry about:– Repeated subsequences biasing matches– Accuracy and significance of matches– Sensitivity and specificity (false + and false -)

Page 40: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Functional Genomics—Transcriptomics

Transcriptome – the complete set of coding and non-coding RNA molecules in a cell at a particular time: Varies between cell types

Transcriptomics – the study of the transcripts in a cell, cell type, organism, etc.

Page 41: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Methods for Transcriptomics Microarray-based:

– High-throughput gene expression profiling– Hybridization of labeled cDNAs to an array of complementary DNA

probes– Measurement of expression levels based on hybridization intensity

Sequence-based:– Full-length cDNA (FLcDNA) sequencing: complete sequencing of

cDNA clone– Expressed sequence tag (EST) sequencing: Single-pass

sequencing of cDNA clone– Serial Analysis of Gene Expression (SAGE):

Short sequence tags at 3’ end of transcript Tags concatenated and sequenced

NGS enables whole transcriptome sequencing : Sequence Census Method

Page 42: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Machine Learning

Machine learning (inductive reasoning)– Automatic proposing of hypotheses based on data– Has many applications in bioinformatics, such as

microarray analysis Example: predictive toxicology

– Given: set of toxic drugs and a set of non-toxic drugs– Given: background information (chemistry, etc.)– Produces: hypothesis why drugs are toxic/toxis

mechanism Overview of machine learning

– Aims, techniques, methodologies, representations Artificial neural networks Support vector machine et.al

Page 43: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

Machine Learning

Larrañaga et al. 2005

Page 44: Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke Guo (yg@ic.ac.uk)

QUESTIONS?The End!