introduction to genomics cm226: machine learning for...
TRANSCRIPT
Introduction to genomics
CM226: Machine Learning for Bioinformatics
Fall 2018 Sriram Sankararaman
1
What is this course about? Bioinformatics: Answering biological questions using tools from computer science, statistics and mathematics.
Machine Learning: Learning from data
2
Personalized medicine
A biological question
3
Cystic fibrosis
Caused by mutations in the CFTR gene
Ivacaftor is a drug approved for patients with a mutation the protein encoded by CFTR
4
Cancer relapse
Acute Lymphoblastic Leukemia in children
Higher risk of relapse among children with native American ancestry
Yang et al. Nature Genetics 2011
Risk
of r
elap
se
5
Path to personalized medicine
Basic biology
What are the genetic and environmental factors?
Disease prediction
Predict disease risk for an individual from genetic data
6
Path to personalized medicine
Basic biology
What are the genetic and environmental factors?
Disease prediction
Predict disease risk for an individual from genetic data
Problems in machine learning
7
What is this course about? Bioinformatics: Answering biological questions using tools from computer science, statistics and mathematics.
Machine Learning: Learning from data
8
What is this course about?Genomic revolution in biology
2001 Human genome project
2010 1000 genomes project
9
What is this course about?Genomic revolution in biology
Stephens et al. PLoS Biology 2015
Num
ber o
f hum
an g
enom
es
Num
ber o
f bas
es
10
What is this course about?Personal genomics
11
Statistics and computing important in obvious and subtle ways
What does this mean for us?
12
Administrivia
13
Course goalsIdentify biological questions.
Translate these questions into a statistical model.
Perform inference within the model.
Apply the model to data.
Interpret the results
Statistically sound ? Biologically meaningful ?
14
Course goals
For CS/Stats/EE students, identify important quantitative problems in Bioinformatics.
For Bioinformatics/Human Genetics students, learn a new set of tools and understand their strengths and limitations.
15
Enrolment/PTEsCourse over-subscribed.
Students on the waitlist will be enrolled in case some one drops.
I am not giving out PTEs until after problem set 0 submission date.
Expect several students will drop the course.
Last year, several students dropped late in the quarter.
Does require mathematical maturity (more later).
16
Enrolment/PTEsHope that all qualified students will be able to enroll.
Problem set 0 intended to mitigate late drops.
This is a course requirement.
Graded to assess your background but not part of final grade.
Be honest/realistic with yourself about your background.
17
Prerequisites
Some programming experience (R strongly encouraged)
18
PrerequisitesMachine learning builds upon
Probability and Statistics
Linear algebra
Optimization
19
Math background reviewProblem set 0 intended to evaluate if you have the necessary background.
If you can solve the problems with little difficulty, you are in good shape to take this class.
If you have trouble with a few questions, you should expect to fill in the background needed ( we will also cover this material in class).
If you find most of the questions difficult, you should consider other classes to obtain the necessary background.
20
Course formatIntroduce important problems in Bioinformatics.
Use these problems to motivate the study of learning algorithms
Will study different aspects of learning algorithms
Statistical
Computational
Issues that arise in practice
Lectures will switch between studying the methodology and the applications
21
Course formatProblem sets (worth 50% total).
Will include programming and data analyses.
Homeworks are due at 11:59pm on due date.
Late submissions not accepted.
Will use gradescope to manage submissions and grading. Code for the class:MV527B
All solutions must be clearly written or typed. Unreadable answers will not be graded. We encourage using LaTeX to typset answers.
For some assignments, it might be the case that only a subset of the problems will be graded in detail. Students may or may not be told which problems will be graded in advance.
Solutions will be graded on both correctness and clarity.
You are free to discuss homework problems. However, you must write up your own solutions. You must also acknowledge all collaborators.
22
Course formatExams
In-class midterm (20%) Date: November 5
In-class final (30%) Date: December 5
Exams are in class, closed-book and closed-notes and will cover material from the lectures and the problem sets. You will be allowed to carry a one-page cheat sheet.
No alternate or make-up exams will be administered, except for disability/medical reasons documented and communicated to the instructor prior to the exam date. In particular, exam dates and times cannot be changed to accommodate scheduling conflicts with other classes.
Problem set 0 will be graded but not count towards final grade. We will not grade any other homeworks/exams unless you submit problem set 0.
23
ForumsPiazza
Must have already got an email.
Otherwise sign up at: https://piazza.com/ucla/fall2018/cm226/home
Strongly encourage students to post here rather than email course staff directly (you will get a faster response this way)
If you do need to contact the staff privately, Piazza allows you to do this.
If you want to contact us over email, please include CM226 in the subject line.
24
ForumsGradescope
Manage problem sets and exam submissions.
For problem sets, you will need to upload pdfs of your submission to gradescope.
Regrade requests must be made through gradescope within one week after the graded homeworks have been handed out, regardless of your attendance on that day and regardless of any intervening holidays such as Memorial Day.
We reserve the right to regrade all problems for a given regrade request.
25
Tentative class syllabusSee class website:
http://web.cs.ucla.edu/~sriram/courses/cm226.fall-2018/html/index.html
26
Office hoursMy Office hours: Tuesday 1-2 pm or by appointment@Engineering VI, 296B
TA: Tevfik Dincer Office hours: Thursday 2-4pm, Location TBD
27
Questions?
28
Goals for this class and the next
Introduction to genomics
Introduction to statistics
29
Crash-course in genomicsHow does the genome code for function?
How is the genome passed on from parent to child ?
How does the genome change when it is passed on ?
What can we learn from genetic variation ?
How do we read the genome ?
30
OutlineHow does the genome code for function?
How is the genome passed on from parent to child ?
How does the genome change when it is passed on ?
What can we learn from genetic variation ?
How do we read the genome ?
31
Trait/Phenotype
Trait/phenotype: Any observable that is inherited
Height, eye color, disease status, cellular measurements, IQ
Instructions that modulate traits found in the genome
Galton 1886
32
How the genome codes for function?
33
OutlineHow does the genome code for function?
How is the genome passed on from parent to child ?
How does the genome change when it is passed on ?
What can we learn from genetic variation ?
How do we read the genome ?
34
Typical human cell has 46 chromosomes
22 pairs
1 pair of sex chromosomes
Genetics and inheritance
The chromosome painting collective35
Genetics and inheritanceOne member of each pair comes from the father (paternal) and the other from the mother (maternal)
The chromosome painting collective36
OutlineHow does the genome code for function?
How is the genome passed on from parent to child ?
How does the genome change when it is passed on ?
What can we learn from genetic variation ?
How do we read the genome ?
37
Causes of genetic variation
DNA not always inherited accurately
Mutations: changes in DNA
Changes at a single base (nucleotide)
Can have more complex changes
38
More definitions
SNP: single nucleotide change across individuals
Allele: set of bases at a SNP
Genotype: sequence of alleles along the SNPs of an individual
Genotype 1: (1,CT),(2,GG) Genotype 2: (1,TT), (2,GA)
39
A T C C T T A G G A
A T C T T T C A G A
SNP 1 SNP 2
A T C T T T C A G A
A T C T T T C A A A
Individual 1
Individual 2
Maternal
Paternal
Base/Nucleotide
Single Nucleotide Polymorphism (SNP)
Form the basis of most genetic analyses
Easy to study in high-throughput (million at a time)
Common
In humans, a SNP every ~1000 bases
A T C C T T A G G A
A T C T T T C A G A
SNP 1 SNP 2
A T C T T T C A G A
A T C T T T C A A A
Individual 1
Individual 2
Maternal
Paternal
40
Single Nucleotide Polymorphism (SNP)
0 0 1 0
SNP 1 SNP 2
1 01 1
Genotype 1
Genotype 2
Maternal
Paternal
41
0=C 1=T
0=G 1=A
Genetic inheritanceMendel’s first law
11 00
1
1 0
0
10
42
P(Transmitting allele 1)=1 P(1)=0
P(1)=0.5
Father Mother
Child
Mendel’s first law
11 X 00
10
11 X 10
1011
10 X 10
1011 01
0.5 0.5
0.25 0.250.25
1.0
Genetic inheritance
43
Father Mother
Child
00
0.25
Father Mother
Father Mother
Mendel’s second law
01 1011
? ? ? ?
SNP 1 SNP 2
Genetic inheritance
44
00
0 01 1
GenotypeMaternal
Paternal
What are the probabilities of each combination ?
Mendel’s second law
01 1011
0.50 0.50 0.00 0.00
SNP 1 SNP 2
Genetic inheritance
45
00
0 01 1
GenotypeMaternal
Paternal
Mendel’s second law
01 1011
0.25 0.25 0.25 0.25
SNP 1 SNP 2
Genetic inheritance
46
00
0 01 1
GenotypeMaternal
Paternal
Mendel’s second law
Not quite
01 1011
0.45 0.45 0.05 0.05
SNP 1 SNP 2
Genetic inheritance
47
00
0 01 1
GenotypeMaternal
Paternal
Mendel’s second law
Not quite. Recombination
Genetic inheritance
48
0 0 0 0
0 0 001
1 1
1
111
1
Back to genetic inheritanceMendel’s second law
Positions nearby inherited together.
The chromosome painting collective49
Back to genetic inheritanceMutation and recombination produce genetic variation
Mutation produces differences
Recombination shuffles these differences
50
Genetic inheritanceProbabilistic model
Can write probability P(Genotype of child | Genotype of parents)
Can use this to write more complex models
P (Genotype of individual | Genotype of sibling)
P( Genotype of individual | Genotype of grandparent)
51
OutlineHow does the genome code for function?
How is the genome passed on from parent to child ?
How does the genome change when it is passed on ?
What can we learn from genetic variation ?
How do we read the genome ?
52
What can we learn from genetic variation ?
Evolution and history
Biological function and disease
53
What can we learn from genetic variation ?
54
How people are related ?
What can we learn from genetic variation ?
History of human populations
https://blog.23andme.com/23andme-and-you/genetics-101/a-beautiful-ancestry-painting/
55
Genome-wide Association Studies (GWAS)
56
OutlineHow does the genome code for function?
How is the genome passed on from parent to child ?
How does the genome change when it is passed on ?
What can we learn from genetic variation ?
How do we read the genome ?
57
Reading our genomesGoal: Determine the sequence of bases along each chromosome
Many technologies
Break up the genome into smaller fragments
Read each fragment
Put back together
Computationally hard
58
The human genome projectGoals
Sequence an accurate reference human genome
Draft published in 2001
High-quality version completed in 2003
Cost: ~$3 billion.
Time: ~13 years.
Other outcome: Power of computing
59
Reading the genomeHuman genome provides a reference
Humans share most of their genome (~99.9% )
Can focus on reading the differences
Simpler problem
60
Reading the genome
High-throughput genotyping array
Idea: DNA molecules bind (hybridize)
Nucleotides bind to their complementary bases
A and T hybridize, C and G hybridize
Can be used to get the genotype at a chosen set of SNPs
T A A G T A C G G A
T A A T T A C A G A
61
Maps of genetic variation
Goals: Describe common patterns of genetic variation in human populations
Phase 1: Genotyped ~1 million SNPs from 270 individuals in 4 populations. Captured most SNPs with a frequency of >5%.
Phase 3: 7 additional populations included
All data publicly available.
International HapMap Project
62
Reading the genomeLimitations of genotyping
Can only read SNPs that are on the array
Biased by how these SNPs are chosen
63
Reading the genomeTechnologies: Illumina, IonTorrent, PacBio
Can read only small pieces of the genome (~100 bases)
But can read hundreds of thousands of pieces in parallel
Use the reference human genome to find the locations of the pieces (and to infer mutations)
High-throughput (or next-gen) sequencing
64
Reading the genomeCost of genome sequencing
65
Maps of genetic variation
2500 individuals from 26 populations
Discover ~90 million SNPs
Includes >99% of SNPs with frequency >1%
All data publicly available
1000 Genomes Project
66
Maps of genetic variation
Many more such efforts underway
Example: Simons Genome Diversity Project : 260 genomes from 127 populations
Also publicly available
67
Other interesting data
EXAC data: Exomes from ~60,000 individuals
Also publicly available
UK Biobank: 500,000 individuals with 200 phenotypes
Not publicly available
68
Computational and statistical problems
Recurring theme:
How to model a biological problem?
69
Recurring theme:
How to scale up algorithms to massive datasets?
Computational and statistical problems
70
Recurring theme:
Tradeoffs
Statistical accuracy
Computational efficiency
Constraints from technology
Constraints such as privacy
Computational and statistical problems
71
Learning the mapping from genotype to phenotype Association tests
Correcting for confounding
Multiple hypothesis testing
Risk prediction
Causality
Population genomics Inference of human history
Admixture inference
Computational and statistical problems
72
Regression Linear, Logistic, Ridge
Latent variable models Clustering, Mixture models
PCA
Admixture models
Directed Graphical Models Hidden Markov Models Kernels Deep Learning
Computational and statistical problems
73
This classOverview of the biological questions Basic concepts in genetics Genomes are inherited according to well-known probabilistic models
Genomes change
Genetic variation forms the starting point for inference.
Possible inferences: history, disease risk
Advances in technology allow us to sequence large numbers of genomes
74
Next classBasic concepts in statistics
Given data, how do we perform inference ?
What are the types of inference?
How do we derive procedures to perform inference?
How do we evaluate these procedures ?
75
Thank you !
76