Download - Class 10. In what ways does DNA sequence matter? Topics to cover: What does 23andme report? carriage of disease-causing mutations disease-risks based on

Class 10. In what ways does DNA sequence matter?

Topics to cover: What does 23andme report?

carriage of disease-causing mutations disease-risks based on SNP analysisterminology – gene, trait , mutation, allele, polymorphism, SNP, carrier status, recessive/dominant, penetranceHow they assay ~106 SNPs at onceBasis for ascribing disease risk – GWAS, statistical tests of association, p-values, odds ratios, what to take seriously

What additional info. is in complete seq.

What do we mean by DNA sequence?..GACATGGGGCAGGAUGAT..TGA....CTGTACCCCGTCCTACTA..ACT..

What is protein, amino acid?

How is info in DNA sequence converted into protein? What is RNA, a codon, stop codon?

GAC ATG GGG CAG GAT GAT..TGA.. DNA

GAC AUG GGG CAG GAU GAU… UGA… mRNA

met gly gln asp glu stop protein

Consequences of some seq. changes

GAC AUG GGG CAG GAU GAU… UGA… wild type RNA met gly gln asp glu stop

a change here would not change aa since gga, ggc, ggt also encode gly; “synonymous” change

GAC AUG GGG CAG UAU GAU… UGA SNP met gly gln tyr glu stop 1 aa subst., “non-synonymous” D

GAC AUG GGG UAG GAU GAU… UGA… SNP met gly stop early termination

GAC AUG GGG CCC CAG GAU GAU… UGA… 3b inser. met gly pro gln asp glu stop 1 aa inser.

GAC AUG GGG C^GA GGA UGA U… 1b insertion -> frame shift met gly arg val stop changes all following aa’s Frame-shifting “Indels” (non-mult. of 3b) can have large effect

What is a gene? Functional unit – usually encodes proteingenome – all the genes/DNA in an organism

Do we get 2 copies of most genes – 1 from each parent?

What is an allele? Sequence variant at some location (locus)

How big is the human genome? ~3x109 bp

How much of it codes for proteins? only ~1.5%, 20K genes

What does the rest of it do? ~3% involved in regulating gene expression Up to 50% “selfish” (? parasitic) DNA, repeated seq. Most of unknown function – even some highly conserved regions

What is a polymorphism? – sequence variant

How much does DNA seq. vary between individuals? single nucleotide polymorphisms (SNPs) ~0.1%, ~106-7 insertions, deletions (indels) ~105

copy # variants (CNVs) ~102

What is a mutation – polymorphism that causes diseasee.g. CF, sickle cell hemoglobin, BRCA

What is a recessive (dominant) mutation? recessive - need 2 mut alleles at same locus for disease dominant - 1 mut allele -> disease

heterozygote: 2 different alleles (at some gene locus) homozygote: mat. and pat. alleles are identical

some mutations always cause disease (fully penetrant), others only in combination with environmental trigger

Are carrier states recessive or dominant mutations? How many carrier states does 23andme report?

What technology does 23andme use for its testing?

Bead array – hybridization – one base extension method

Array lots of ~3mm plastic beads intowells etched in fiberoptic bundle

Each bead has many copies of a singleshort DNA primer; there are ~106 diff.bead types, each w/ diff. primer seq.

Hybridize your DNA to the array; extend the primer DNAon the bead to copy your hybridized template DNA usingDNA polymerase and fluorescent bases, one at a time,to see if A, C, G, or T is the next base in the template

How do you know which bead, with which seq., is where?Company sends you map info. How do they know? Method based on serial hybridization with fluorescent probes, recording which hybridize to which bead, then melting off these probes, but not bead primer dna

If there are 106 bead types, each with diff. seq., impractical to do 106 hybridizations, so make green fluor-labeled and red fluor-labeled version of each probe; make pools of probes; hybridize each pool to beads and record bead color, then melt off and repeat with new pool; serial color signatures reveal which seq. is on which bead after ~ln106 = 19 hybridizations

Example – if 8 different bead types # 0-7

pool 1 pool 2 pool 3

must have seq compl. to

01234567

bead with this hyb. history

circled beadshould haveseq compl. to 2

In general, if k probes, S pools, n bead-types, need to do S = lnkn hybridizations note S goes up slowly, ~ ln (n)

Parity check – make more pools than you need. Now lots of unused code numbers. Choose pool combinations such that most likely errors -> beads with unused codes, so you know these beads were mistyped and you can disregard them Example

pool 4

Note all parity codes sum to0, but any single error wouldwould lead to sum = 1

with 4 pools, 24 = 16 codes; 8 extras are used to spot errors

A polymorphism can be associated with a disease without causing it. How is that possible?

Inheritance – eggs and sperm (germ cells) get one of each pair of chromosomes randomly from a parent

Mom’schr’s #5

M PSNP 987-a b

x

to egg

suppose mutation x arises on a chr. carrying allele b at a nearby polymorphism SNP987 offspring that inherit x (tend to) inherit SNP987 allele b, ->disease has higher freq. in people with allele b vs allele aat SNP987 = “Founder” effect

Minor complication – chromosomes duplicate and recombine during formation of sperm and egg, which sometimes -> x being inherited with allele from other chromosome, but this is rare if x and b are close

x

Snp 987- b a

to egg

M PSnp 987-a b

x

Chance of recombination ~ distance between b and x

xx

details for afficionados

How often does recombination occur (as a function of dist. in bases)? ~1% per Mb (106b) (per generation)

How many generations does it take for probability of assoc. to fall to ½, for mutations/SNPs separated by 1Mb?

.99n = .5 n ~69 generations

How long is association likely maintained? If ~25 yrs/generation, 69gen -> 1700yrs

Implication – such associations can persist for long time (~100,000 years for mutations within 1000 bases of SNP)

Implications for SNP-disease associations

Nearby SNPs may not themselves be responsible for increased risk of disease, just assoc. “markers”

Disease-SNP allele associations may be specific to certain ethnic groups in which mutation arose; there may be associations with different SNP alleles indifferent ethnic groups if disease mutations arose multiple times

Disease-associated SNPs provide locational clues to causative mutations, useful for research

Genome-wide association studies (GWAS) = source of data for attributing disease risk to presence of some variants

Basic idea – search for chr. regions with SNPs with diff. allele freq. (or

genotype freq.) in cases vs controls, e.g.

let A and a designate diff. alleles at some locusassume each person has 2 alleles

genotypeaa aA AA sum A allele freq

dis. cases 45 510 1445 2000 3400/4000=.85 controls 120 960 1920 3000 4800/6000=.80

If you inherit A, are you really more likely to get the disease?

How much do frequencies have to differ to be statistically significant?

Basic idea – see if data are reasonably likely if 2 groups (e.g., cases and controls) do not differ in allele or genotype frequencies

If groups are not really different, you could pool the data and ask if you randomly chose 2 groups (of the size of the cases and controls) from this one population, how likely would the means of the 2 groups differ by as much as you observe. A “t” test gives you this probability. If it is very low, you may have reason to reject the null hypothesis.

The chi sq test is very like the “t” test Chi sq = S(Exp-Obs)2/Exp It’s probability distribution is known for randomly selected groups from a single population. If p(chi sq) < small # a, e.g. a =.05, you might choose to conclude the groups really do differ in allele freq. since these data are unlikely (p<0.05) if groups are the same

Using a = 0.05 as a cut-off means that you’ll make a mistake and declare identical groups different 5% of the time (5% FP rate). You pick the cut-off for whatever error rate you feel appropriate

Complication – if one tests for association with 20 independent things, expect ~1 to have p(chi sq) <.05 even when no assoc. exists (i.e. expect 1 FP).

Testing for assoc with any of ~106 genes, one needs much stricter criterion than a=.05 in order to avoid lots of FP’s

Simplest correction – Bonferroni: divide a by n = # of SNPs tested; e.g. require p(chi sq.) < 0.05/106 ~10-8

in order that probability that any assoc. is FP be < .05

Rationale – prob. that an apparent assoc. is not a FP = 1-p; prob. that n apparent assocs are not FP’s = (1-p)n ~ 1-pn; you can make this ~1 by choosing pn < a i.e. p < a/n

Example chi sq calculation

hypothetical #'s with each genotype

aa aA AA sum dis. cases 45 510 1445 2000 controls 120 960 1920 3000

totals 165 1470 3365 5000

If H0 true, can pool groups for best est. of probabilities p(aa) = 165/5000; p(aA)=1470/5000, p(AA)=3365/5000

Then expected # aa among dis. cases = p(aa)*2000 = 66 Expected # of aA among dis. cases = p(aA)*2000 = 588 Compute remaining expected #’s same way or from totals -> Expected # aa aA AA sum dis. cases 66 588 1346 2000 Controls 99 882 2019 3000 totals 165 1470 3365 5000

Chi sq = S(exp-obs)2/exp = (66-45)2/66 + … = 40.52 p(chi sq, 2df) = 1.59x10-9 (from table, or web) < a = 10-8

So assoc. is “statistically significant”

For confirmation, repeat study in independent groups

Relative risk might be measured as p(D|AA)/p(D) but frequently expressed in terms of “odds ratio”

Odds = p(event)/[1-p(event)] e.g. “2:1” if p(event)=.67

Odds ratio = odds(D|AA)/odds(D) (assume A is hi risk allele)= {p(D|AA)/[1-p(D/AA)]} / {p(D)/[1-p(D)]}

Odds ratios frequently larger than relative risk (so makerisk seem slightly greater). Need to know p(D)in your population for accurate odds ratio

Example of GWAS paper (basis of 23andme disease riskpredictions) Nature 447:661 (2007)

appreciate the magnitude, expense, complexity – and limitations

~100 authors, 106 SNPs testedin each of 17,000 samples (@ $1000)

could study have been done if each test cost $1?

Note most disease risks measured by OR only ~1.2-2

raw

p(chi sq) ORsdis

Large # of SNPs in hit region associated with diseaselends credence to the finding

Example of hit region

Most GWAS assoc. now “confirmed” in repeat studies

But does this prove that disease risk is increased?

What additional type of study might you want?

Prospective studies have not yet been done.

What additional information is in full sequencethat SNP analysis misses?

SNPs were selected based on their being commonin the population, e.g. SNP allele freq. >5%

Full sequence picks up all variants in an individual,many of which (?~10%) will be new

Many of these will be sequencing errors: with 99.99% accuracy expect ~105 errors

Typical mutation counts in sequenced human genomes

They estimate ~10% of variants are false positives

3 individual’s genomes sequenced

Challenges to clinical interpretation

Many variants are new, hence no clinical experience

Even considering only those most likely to have biological effect (e.g. frameshifts in coding seq.), there will be hundreds of mutations per person

Mutations may have no clinical effect because: if heterozygous, other allele provides enough protein; gene is non-essential; other proteins do same thing; gene function is only important in special circumstances

Mutations with no effect in heterozygous parent could affect offspring that inherit 2 variant copies

At observed mutational load, ~15% of pregnancies should carry ominous mutations in both copies of at least 1 gene

p(FS mutation/allele) ~200/40,000 = 0.5% p(couple don’t share gene with FS mutation) = .995200 = .37 p(>1 gene for which both are carriers) = 1-.37 = .63 p(fetus of such a couple gets both mutations) = ¼ p(random fetus has FS muts in both alleles of some gene) = .63/4

Will this cause consternation among parents-to-be?

If <<15% of normal individuals have such double mutations,they may be responsible for spontaneous abortions

If ~15% of normal individuals have such double mutations, theyare usually innocuous, but won’t be sure in any particular case

Bottom line – sequencing will likely provide huge amount of information of uncertain clinical significance for foreseeable future

Rapidly decreasing costs (now ~$1000) will likely make wide-scale genome sequencing inevitable

Summary

Companies like 23andme provide info about carrier status for some common mutations disease risk based on GWAS studies

Testing is based on SNP analysis using random bead arrays clever method of identifying which bead is where

based on pooled hybridization probes – using extrainformation (parity check) to eliminate errors

Disease risk information based on genome-wide association studies (GWAS)

caveats: only associations, not confirmed prospectively

statistical evaluation – chi sq., p-values, correction formultiple comparisons, Bayesian method to get p(hypothesis|data) given p(data|hyp.), odds ratios

Additional information in genome sequence mutation load – uncertain clinical significance

FYI papers and math exercises on Blackboard

Next 3 classes will be on sequencing technologies

!! Advertisement !!

Next semester I will teach class on ethical, legal, social issues in engineering (MAE???), focusing on issues raised by new biomedical technologies like sequencing, high cost of medical technology, FDA regulation, social/economic costs of patents, cost-benefit and cost-effectiveness evaluation of biomedical technology….

Help recruiting students will be greatly appreciated!

Download - Class 10. In what ways does DNA sequence matter? Topics to cover: What does 23andme report? carriage of disease-causing mutations disease-risks based on

Top Related