Class 10. In what ways does DNA sequence matter?
Topics to cover: What does 23andme report?
carriage of disease-causing mutations disease-risks based on SNP analysisterminology – gene, trait , mutation, allele, polymorphism, SNP, carrier status, recessive/dominant, penetranceHow they assay ~106 SNPs at onceBasis for ascribing disease risk – GWAS, statistical tests of association, p-values, odds ratios, what to take seriously
What additional info. is in complete seq.
What do we mean by DNA sequence?..GACATGGGGCAGGAUGAT..TGA....CTGTACCCCGTCCTACTA..ACT..
What is protein, amino acid?
How is info in DNA sequence converted into protein? What is RNA, a codon, stop codon?
GAC ATG GGG CAG GAT GAT..TGA.. DNA
GAC AUG GGG CAG GAU GAU… UGA… mRNA
met gly gln asp glu stop protein
Consequences of some seq. changes
GAC AUG GGG CAG GAU GAU… UGA… wild type RNA met gly gln asp glu stop
a change here would not change aa since gga, ggc, ggt also encode gly; “synonymous” change
GAC AUG GGG CAG UAU GAU… UGA SNP met gly gln tyr glu stop 1 aa subst., “non-synonymous” D
GAC AUG GGG UAG GAU GAU… UGA… SNP met gly stop early termination
GAC AUG GGG CCC CAG GAU GAU… UGA… 3b inser. met gly pro gln asp glu stop 1 aa inser.
GAC AUG GGG C^GA GGA UGA U… 1b insertion -> frame shift met gly arg val stop changes all following aa’s Frame-shifting “Indels” (non-mult. of 3b) can have large effect
What is a gene? Functional unit – usually encodes proteingenome – all the genes/DNA in an organism
Do we get 2 copies of most genes – 1 from each parent?
What is an allele? Sequence variant at some location (locus)
How big is the human genome? ~3x109 bp
How much of it codes for proteins? only ~1.5%, 20K genes
What does the rest of it do? ~3% involved in regulating gene expression Up to 50% “selfish” (? parasitic) DNA, repeated seq. Most of unknown function – even some highly conserved regions
What is a polymorphism? – sequence variant
How much does DNA seq. vary between individuals? single nucleotide polymorphisms (SNPs) ~0.1%, ~106-7 insertions, deletions (indels) ~105
copy # variants (CNVs) ~102
What is a mutation – polymorphism that causes diseasee.g. CF, sickle cell hemoglobin, BRCA
What is a recessive (dominant) mutation? recessive - need 2 mut alleles at same locus for disease dominant - 1 mut allele -> disease
heterozygote: 2 different alleles (at some gene locus) homozygote: mat. and pat. alleles are identical
some mutations always cause disease (fully penetrant), others only in combination with environmental trigger
Are carrier states recessive or dominant mutations? How many carrier states does 23andme report?
What technology does 23andme use for its testing?
Bead array – hybridization – one base extension method
Array lots of ~3mm plastic beads intowells etched in fiberoptic bundle
Each bead has many copies of a singleshort DNA primer; there are ~106 diff.bead types, each w/ diff. primer seq.
Hybridize your DNA to the array; extend the primer DNAon the bead to copy your hybridized template DNA usingDNA polymerase and fluorescent bases, one at a time,to see if A, C, G, or T is the next base in the template
How do you know which bead, with which seq., is where?Company sends you map info. How do they know? Method based on serial hybridization with fluorescent probes, recording which hybridize to which bead, then melting off these probes, but not bead primer dna
If there are 106 bead types, each with diff. seq., impractical to do 106 hybridizations, so make green fluor-labeled and red fluor-labeled version of each probe; make pools of probes; hybridize each pool to beads and record bead color, then melt off and repeat with new pool; serial color signatures reveal which seq. is on which bead after ~ln106 = 19 hybridizations
Example – if 8 different bead types # 0-7
pool 1 pool 2 pool 3
must have seq compl. to
01234567
bead with this hyb. history
circled beadshould haveseq compl. to 2
In general, if k probes, S pools, n bead-types, need to do S = lnkn hybridizations note S goes up slowly, ~ ln (n)
Parity check – make more pools than you need. Now lots of unused code numbers. Choose pool combinations such that most likely errors -> beads with unused codes, so you know these beads were mistyped and you can disregard them Example
pool 4
Note all parity codes sum to0, but any single error wouldwould lead to sum = 1
with 4 pools, 24 = 16 codes; 8 extras are used to spot errors
A polymorphism can be associated with a disease without causing it. How is that possible?
Inheritance – eggs and sperm (germ cells) get one of each pair of chromosomes randomly from a parent
Mom’schr’s #5
M PSNP 987-a b
x
to egg
suppose mutation x arises on a chr. carrying allele b at a nearby polymorphism SNP987 offspring that inherit x (tend to) inherit SNP987 allele b, ->disease has higher freq. in people with allele b vs allele aat SNP987 = “Founder” effect
Minor complication – chromosomes duplicate and recombine during formation of sperm and egg, which sometimes -> x being inherited with allele from other chromosome, but this is rare if x and b are close
x
Snp 987- b a
to egg
M PSnp 987-a b
x
Chance of recombination ~ distance between b and x
xx
details for afficionados
How often does recombination occur (as a function of dist. in bases)? ~1% per Mb (106b) (per generation)
How many generations does it take for probability of assoc. to fall to ½, for mutations/SNPs separated by 1Mb?
.99n = .5 n ~69 generations
How long is association likely maintained? If ~25 yrs/generation, 69gen -> 1700yrs
Implication – such associations can persist for long time (~100,000 years for mutations within 1000 bases of SNP)
Implications for SNP-disease associations
Nearby SNPs may not themselves be responsible for increased risk of disease, just assoc. “markers”
Disease-SNP allele associations may be specific to certain ethnic groups in which mutation arose; there may be associations with different SNP alleles indifferent ethnic groups if disease mutations arose multiple times
Disease-associated SNPs provide locational clues to causative mutations, useful for research
Genome-wide association studies (GWAS) = source of data for attributing disease risk to presence of some variants
Basic idea – search for chr. regions with SNPs with diff. allele freq. (or
genotype freq.) in cases vs controls, e.g.
let A and a designate diff. alleles at some locusassume each person has 2 alleles
genotypeaa aA AA sum A allele freq
dis. cases 45 510 1445 2000 3400/4000=.85 controls 120 960 1920 3000 4800/6000=.80
If you inherit A, are you really more likely to get the disease?
How much do frequencies have to differ to be statistically significant?
Basic idea – see if data are reasonably likely if 2 groups (e.g., cases and controls) do not differ in allele or genotype frequencies
If groups are not really different, you could pool the data and ask if you randomly chose 2 groups (of the size of the cases and controls) from this one population, how likely would the means of the 2 groups differ by as much as you observe. A “t” test gives you this probability. If it is very low, you may have reason to reject the null hypothesis.
The chi sq test is very like the “t” test Chi sq = S(Exp-Obs)2/Exp It’s probability distribution is known for randomly selected groups from a single population. If p(chi sq) < small # a, e.g. a =.05, you might choose to conclude the groups really do differ in allele freq. since these data are unlikely (p<0.05) if groups are the same
Using a = 0.05 as a cut-off means that you’ll make a mistake and declare identical groups different 5% of the time (5% FP rate). You pick the cut-off for whatever error rate you feel appropriate
Complication – if one tests for association with 20 independent things, expect ~1 to have p(chi sq) <.05 even when no assoc. exists (i.e. expect 1 FP).
Testing for assoc with any of ~106 genes, one needs much stricter criterion than a=.05 in order to avoid lots of FP’s
Simplest correction – Bonferroni: divide a by n = # of SNPs tested; e.g. require p(chi sq.) < 0.05/106 ~10-8
in order that probability that any assoc. is FP be < .05
Rationale – prob. that an apparent assoc. is not a FP = 1-p; prob. that n apparent assocs are not FP’s = (1-p)n ~ 1-pn; you can make this ~1 by choosing pn < a i.e. p < a/n
Example chi sq calculation
hypothetical #'s with each genotype
aa aA AA sum dis. cases 45 510 1445 2000 controls 120 960 1920 3000
totals 165 1470 3365 5000
If H0 true, can pool groups for best est. of probabilities p(aa) = 165/5000; p(aA)=1470/5000, p(AA)=3365/5000
Then expected # aa among dis. cases = p(aa)*2000 = 66 Expected # of aA among dis. cases = p(aA)*2000 = 588 Compute remaining expected #’s same way or from totals -> Expected # aa aA AA sum dis. cases 66 588 1346 2000 Controls 99 882 2019 3000 totals 165 1470 3365 5000
Chi sq = S(exp-obs)2/exp = (66-45)2/66 + … = 40.52 p(chi sq, 2df) = 1.59x10-9 (from table, or web) < a = 10-8
So assoc. is “statistically significant”
For confirmation, repeat study in independent groups
Next problem, you want to know p(disease|genotype) but what you observe is freq of SNP genotypes in disease group vs controls, p(genotype|disease)
Bayesian statistics allows you to infer p(disease|genotype) from p(genotype|disease)
Basic Idea: 2 ways to calculate p of disease and genotype AAp(D|AA)p(AA) = p(AA|D)p(D)-> p(D|AA) = p(AA|D) p(D) / p(AA)have to know p(D), p(AA), and p(AA|D) to get p(D|AA)
Relative risk might be measured as p(D|AA)/p(D) but frequently expressed in terms of “odds ratio”
Odds = p(event)/[1-p(event)] e.g. “2:1” if p(event)=.67
Odds ratio = odds(D|AA)/odds(D) (assume A is hi risk allele)= {p(D|AA)/[1-p(D/AA)]} / {p(D)/[1-p(D)]}
Odds ratios frequently larger than relative risk (so makerisk seem slightly greater). Need to know p(D)in your population for accurate odds ratio
Example of GWAS paper (basis of 23andme disease riskpredictions) Nature 447:661 (2007)
appreciate the magnitude, expense, complexity – and limitations
~100 authors, 106 SNPs testedin each of 17,000 samples (@ $1000)
could study have been done if each test cost $1?
Note most disease risks measured by OR only ~1.2-2
raw
p(chi sq) ORsdis
Large # of SNPs in hit region associated with diseaselends credence to the finding
Example of hit region
Most GWAS assoc. now “confirmed” in repeat studies
But does this prove that disease risk is increased?
What additional type of study might you want?
Prospective studies have not yet been done.
What additional information is in full sequencethat SNP analysis misses?
SNPs were selected based on their being commonin the population, e.g. SNP allele freq. >5%
Full sequence picks up all variants in an individual,many of which (?~10%) will be new
Many of these will be sequencing errors: with 99.99% accuracy expect ~105 errors
Typical mutation counts in sequenced human genomes
They estimate ~10% of variants are false positives
3 individual’s genomes sequenced
Challenges to clinical interpretation
Many variants are new, hence no clinical experience
Even considering only those most likely to have biological effect (e.g. frameshifts in coding seq.), there will be hundreds of mutations per person
Mutations may have no clinical effect because: if heterozygous, other allele provides enough protein; gene is non-essential; other proteins do same thing; gene function is only important in special circumstances
Mutations with no effect in heterozygous parent could affect offspring that inherit 2 variant copies
At observed mutational load, ~15% of pregnancies should carry ominous mutations in both copies of at least 1 gene
p(FS mutation/allele) ~200/40,000 = 0.5% p(couple don’t share gene with FS mutation) = .995200 = .37 p(>1 gene for which both are carriers) = 1-.37 = .63 p(fetus of such a couple gets both mutations) = ¼ p(random fetus has FS muts in both alleles of some gene) = .63/4
Will this cause consternation among parents-to-be?
If <<15% of normal individuals have such double mutations,they may be responsible for spontaneous abortions
If ~15% of normal individuals have such double mutations, theyare usually innocuous, but won’t be sure in any particular case
Bottom line – sequencing will likely provide huge amount of information of uncertain clinical significance for foreseeable future
Rapidly decreasing costs (now ~$1000) will likely make wide-scale genome sequencing inevitable
Summary
Companies like 23andme provide info about carrier status for some common mutations disease risk based on GWAS studies
Testing is based on SNP analysis using random bead arrays clever method of identifying which bead is where
based on pooled hybridization probes – using extrainformation (parity check) to eliminate errors
Disease risk information based on genome-wide association studies (GWAS)
caveats: only associations, not confirmed prospectively
statistical evaluation – chi sq., p-values, correction formultiple comparisons, Bayesian method to get p(hypothesis|data) given p(data|hyp.), odds ratios
Additional information in genome sequence mutation load – uncertain clinical significance
FYI papers and math exercises on Blackboard
Next 3 classes will be on sequencing technologies
!! Advertisement !!
Next semester I will teach class on ethical, legal, social issues in engineering (MAE???), focusing on issues raised by new biomedical technologies like sequencing, high cost of medical technology, FDA regulation, social/economic costs of patents, cost-benefit and cost-effectiveness evaluation of biomedical technology….
Help recruiting students will be greatly appreciated!