gene expression analysis, dna chips and genetic networks · 3’ gtt cat atg gcg 5’ mrna: 5’...

57
Computational Genomics Ron Shamir & Roded Sharan Fall 2012-13

Upload: others

Post on 26-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Computational Genomics

Ron Shamir & Roded Sharan

Fall 2012-13

• The information science of biology: organize, store, analyze and visualize biological data

• Responds to the explosion of biological data, and builds on the IT revolution

Bioinformatics

Paradigm shift in biological research

Classical biology: focus on a single gene or sub-system. Hypothesis driven

Systems biology: model the behavior of an entire biological system. Hypothesis generating

Large-scale data;

Bioinformatics

Drug discovery via the connectivity map

Lamb et al., Science 2006

Personalized medicine

Administration

• ~5 home assignments as part of a home exam, to be done independently (50%)

• Final exam (50%)

• Classes: Tue 12:15-13:30; Thu 14:45-16

• TA: Yaron Orenstein (Thu 16-17).

Administration (cont.)

• Web page of course: http://www.cs.tau.ac.il/~rshamir/cg/12/

• Includes slides and scribes of previous years on each of the classes.

Lecture 1: Introduction

1. Basic biology

2. Basic biotechnology & computational challenges

1. Basic Biology

•Touches on Chapters 1-8 in ‘The Cell’.

The Cell

• Basic unit of life.

• Carries complete characteristics of the species.

• All cells store hereditary information in DNA.

• All cells transform DNA to proteins, which determine cell’s structure and function.

• Two classes: eukaryotes (with nucleus) and prokaryotes (without).

http://regentsprep.org/Regents/biology/units/organization/cell.gif

Nucleotide Chain Double helix

sugar

phosphate

Nucleotides/ Bases:

Adenine (A),

Guanine (G),

Cytosine (C),

Thymine (T).

Weak hydrogen

bonds between base

pairs

Strong covalent bonds

between sugars

Gregor Mendel

laws of inheritance, “gene”

1866

Watson and Crick

DNA structure 1953

DNA (Deoxy-Ribonucleic acid) • Bases:

– Adenine (A) – Guanine (G) – Cytosine (C) – Thymine (T)

• Bonds: – G - C – A - T

• Oriented from 5’ to 3’. • Located in the cell nucleus

Purines

pyrimidines

DNA and Chromosomes • DNA is packaged

(10000-fold)

• Chromatin: complex of

DNA and proteins that

pack it (histones)

• Chromosome: contiguous

stretch of DNA

• Diploid: two homologous

chromosomes, one from

each parent

• Genome: totality of DNA

material

Replication

Replication

fork

Genes

• Gene: a segment of DNA that specifies a protein.

• The transformation of a gene into a protein is called expression.

• Genes are 3% of human DNA

• The rest - non-coding “junk DNA”:

– RNA elements (62%)

– Regulatory regions

– Retrotransposons

– Pseudogenes

– and more…

Proteins

• Build the cell and drive

most of its functions.

• Polymers of amino-acids

(20 total), linked by

peptide bonds.

• Oriented (from amino to

carboxyl group).

• Fold into 3D structure of

lowest energy.

DNA RNA protein

transcription translation

The hard

disk

One

program

Its output

http://www.ornl.gov/hgmis/publicat/tko/index.htm

RNA (Ribonucleic acid) • Bases:

– Adenine (A) – Guanine (G) – Cytosine (C) – Uracil (U); replaces T

• Oriented from 5’ (phosphate) to 3’ (sugar). • Single-stranded => flexible backbone =>

secondary structure => catalytic role.

Transcription

http://www.iacr.bbsrc.ac.uk/notebook/courses/guide/words/transcriptiongif.htm

Template

The Genetic Code

• Codon - a triplet of bases, codes a specific amino acid (except the stop codons)

• Stop codons - signal termination of the protein synthesis process

• Different codons may code the same amino acid

http://ntri.tamuk.edu/cell/ribosomes.html

Translation

http://biology.kenyon.edu/courses/biol114/Chap05/Chapter05.html#Protein

DNA Protein

transcription translation

RNA

Expression and Regulation

Gene

Transcription factors (TFs) control

transcription by binding to specific DNA

sequence motifs.

The Human Genome: numbers

• 23 pairs of chromosomes • ~3,200,000,000 bases • ~21,000 genes • Gene length: 1000-3000 bases,

spanning 30-40K bases

Model Organisms

• Eukaryotes; increasing complexity • Easy to store, manipulate.

Budding yeast

• 1 cell

• 6K genes

Nematode worm

• 959 cells

• 19K genes

Fruit fly

• vertebrate-like

• 14K genes

mouse

• mammal

• 30K genes

27

Biotechnology - Restriction Enzymes

• Restriction Enzyme: – Natural role: break foreign DNA

entering the cell.

– Ability: • Breaks the phosphodiester bonds of

a DNA upon appearance of a certain cleavage sequence.

• Different cleavage for each enzyme

• Hundreds of different enzymes known.

– Digestion = application of restriction enzymes to a sequence.

Cloning vector (plasmids)

Foreign DNA

Recombinant DNA

Introduction into host cell

Use of antibiotics to

grow recombinant cells

Cloning

5’ 3’

5’ 3’

5’ 3’

5’ 3’

5’

5’

3’

3’

5’ 3’

5’ 3’

5’ 3’

5’ 3’

5’ 3’

5’ 3’

5’ 3’

5’ 3’

5’ 3’

5’ 3’

5’ 3’

5’ 3’

5’ 5’ 3’ 3’

5’

5’ 3’

5’ 3’

5’ 3’

3’

5’ 3’

5’ 3’

5’ 3’

5’ 3’

Denaturation

Annealing

Extension

Cycle 1

Cycle 2

Cycle 3

PCR

30

Biotechnology - Gel Electrophoresis

• Use: “race” digested DNA fragments through electrically charged gel

• Goals: – Separate a mixture of DNA fragments

– Measure length of DNA fragments

• How does it work: – smaller molecule travel faster than larger ones

– same size and shape the same movement speed

http://dlab.reed.edu/projects/vgm/vgm/V

GMProjectFolder/VGM/RED/RED.ISG/

mapping.html

31

32

The Double Digest Problem Given 3 sets of distances {Xi} {Yi}

{Zi}, reconstruct cut sites A1<…<An

and B1<…<Bm s.t. – {Ai-A i-1}={X}, {Bi-B i-1}={Y}

– for C=A U B (ordered), {Ci-C I-1}={Z}

Complexity: NP hard, many heuristics.

33

The Partial Digest Problem •Problem: Given a (multi-) set of distances {|Xi-Xj|} 1 i j n, reconstruct the original series X1,…,Xn

•Complexity: unknown (yet)

34

Sequencing • Sequencing: determining the nucleotide

sequence of a given molecule.

• Classical approach: gel electrophoresis

• Creating DNA strands of different lengths : catalyzing replication in environment with “terminator” A*.

• Repeat separately with C*, G*, T*

• Abilities: reconstructs sequences of 500-1000 nucleotides.

35

36

The Sequence Assembly Problem

• Problem: Given a set of sub- strings, find the shortest (super)string containing all the members of the set.

http://www.ornl.gov/hgmis/graphics/slides/images1.html

37

Gene Structure

38

The Gene Finding Problem

Given a DNA sequence, predict the location of genes (open reading frames) exons and introns.

39

Wild-Type

DNA: 5’ CAA GTA TAC CGC 3'

3’ GTT CAT ATG GCG 5'

mRNA: 5’ CAA GUA UAC CGC 3'

Polypeptide : Gln Val Tyr Arg

Substitution

DNA: 5’ CAC GTA TAC CGC 3'

3’ GTG CAT ATG GCG 5'

RNA: 5’ CAC GUA UAC CGC 3'

Polypeptide: His Val Tyr Arg

In this example of a base substitution, a "C-G" base pair has been substituted for the "A-T" in the original DNA strand. The resulting mRNA transcript has a "C" in place of the original "A". Thus the first three-letter codon is now "CAC" instead of "CAA". The polypeptide encoded by this mRNA is different from the original, because the codon "CAC" codes for the amino acid histidine, whereas the original codon "CAA" codes for glutamine.

Mutations -

Substitution

/MutationFrameset.html1/page2http://morgan.rutgers.edu/morganwebframes/level

Mutations are a

major source of

genetic variation

Mutation does not

always change the

polypeptide

Mutation types:

– substitution

– insertion

– deletion

– rearrangement

40

Wild-Type

DNA: 5’ CAA GTA TAC CGC 3'

3’ GTT CAT ATG GCG 5’

mRNA: 5’ CAA GUA UAC CGC 3'

Polypeptide : Gln Val Tyr Arg

Duplication

DNA: 5’ CAA GTA TAC TAC CGCbv 3’

3’ GTT CAT ATG ATG GCG 5'

RNA: 5’ CAA GUA UAC UAC CGC 3'

Polypeptide: Gln Val Tyr Tyr Arg

In this example, there has been an addition to the DNA molecule. A sequence of three base pairs has been duplicated. The mRNA transcript now contains another "UAC" codon, and the resulting polypeptide is longer by one amino acid -- a second tyrosine has been included.

Mutations -

Duplication

/MutationFrameset.html1/page2http://morgan.rutgers.edu/morganwebframes/level

Mutations are a

major source of

genetic variation

Mutation does not

always change the

polypeptide

Mutation types:

– substitution

– insertion

– deletion

– rearrangement

41

Wild-Type

DNA: 5’ CAA GTA TAC CGC 3'

3’ GTT CAT ATG GCG 5’

mRNA: 5’ CAA GUA UAC CGC 3'

Polypeptide :Gln Val Tyr Arg

Deletion

T-A

DNA: 5’ CAA GAT ACC GC 3'

3’ GTT CTA TGG CG 5’

mRNA: 5’ CAA GAU ACC GC 3'

polypeptide: Gln Asp Thr

In this example, one of the base pairs in the DNA molecule has been deleted. An "T-A" has been lost, as shown above. The mRNA is now missing the "U" that was in position 7. This has caused a shift in the reading frame of the mRNA; thus, a mutation of this type is often referred to as a "frameshift mutation". The second and third amino acids are different from those in the original polypeptide.

Mutations -

Deletion

/MutationFrameset.html1/page2http://morgan.rutgers.edu/morganwebframes/level

Mutations are a

major source of

genetic variation

Mutation does not

always change the

polypeptide

Mutation types:

– substitution

– insertion

– deletion

– rearrangement

42

Sequence Alignment problems

Given two sequences, find their best alignment: Match with insertion/deletion of min cost.

Many variants…

“Workhorse” of Bioinformatics! Key challenge: huge volume of data (more on this later)

43

Before

After

Rearrangement is a change in the order of complete segments along a chromosome.

Mutations -

Rearrangement

Mutations are a

major source of

genetic variation

Mutation does not

always change the

polypeptide

Mutation types:

– substitution

– insertion

– deletion

– rearrangement

44

Genome Rearrangements

M. Ozery-Flato

Challenges: •Reconstruct the evolutionary path of rearrangements •Shortest sequence of rearrangements between two permutations

45

More problems in sequencing data

Solve all the problems above (alignment, gene finding, rearrangements,…) on

really huge datasets

Need to handle practical problems of efficiency – time and space

Need to overcome large noise (false hits) due to data size

DNA Microarrays

Hybridization

• DNA double strands form by “gluing” of complementary single strands

• Complementarity rule: A-T, G-C

ACTCCG

TGAGGC | | | | | |

Gene Expression Arrays

• Assumption: transcription level indicates

gene’s importance in a specific condition.

• Variants enable measuring absolute or

relative levels of a transcript.

Given the expression profiles of normal vs. disease: -Build a classifier to recognize the disease -Cluster disease profiles into sub-classes; or cluster genes into functional groups

Breast cancer treatment

Van’t veer et al., nature’02

Classifier construction P

ati

ents

70-gene signature

No met.

Met.

• Classify to minimize incorrect assignments in the no met. class.

• 17/19 correct predictions on test set.

52

Human Variation • DNA of two human

beings is ~99.9% identical

• Phenotype and disease variation is due these 1/1000 mutations

Challenges: •Associate mutations to specific disease •Deal with huge datasets •Overcome the multiple hypothesis testing problem

Protein interaction detection: yeast two-bybrid

http://www.bioteach.ubc.ca/MolecularBiology/AYeastTwoHybridAssay/

Challenges in network analysis: function prediction

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9

Network distance

GO

sem

an

tic s

imil

ari

ty

Challenges in network analysis:

identifying modules

56

Complexity summary

• ~21,000 genes in the genome

• Hard to identify

• Harder to figure their function

• Even harder to figure how they work together

Promise and problems in comparative genomics (2009)

57

arabidposis c. elegans h. sapiens s. cerevisiae

genes

arabidopsis 26207 0.19 0.24 0.42

c. elegans 19992 0.26 0.38 0.38

h. sapiens 21673 0.30 0.28 0.43

s. cerevisiae 5884 0.21 0.13 0.19

a(x,y)= fraction of genes in genome y that have strict orthologs in

genome x

Source: 10/2009