introduction to genomics cm226: machine learning for...

76
Introduction to genomics CM226: Machine Learning for Bioinformatics Fall 2018 Sriram Sankararaman 1

Upload: others

Post on 21-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Introduction to genomics

CM226: Machine Learning for Bioinformatics

Fall 2018 Sriram Sankararaman

1

Page 2: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

What is this course about? Bioinformatics: Answering biological questions using tools from computer science, statistics and mathematics.

Machine Learning: Learning from data

2

Page 3: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Personalized medicine

A biological question

3

Page 4: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Cystic fibrosis

Caused by mutations in the CFTR gene

Ivacaftor is a drug approved for patients with a mutation the protein encoded by CFTR

4

Page 5: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Cancer relapse

Acute Lymphoblastic Leukemia in children

Higher risk of relapse among children with native American ancestry

Yang et al. Nature Genetics 2011

Risk

of r

elap

se

5

Page 6: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Path to personalized medicine

Basic biology

What are the genetic and environmental factors?

Disease prediction

Predict disease risk for an individual from genetic data

6

Page 7: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Path to personalized medicine

Basic biology

What are the genetic and environmental factors?

Disease prediction

Predict disease risk for an individual from genetic data

Problems in machine learning

7

Page 8: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

What is this course about? Bioinformatics: Answering biological questions using tools from computer science, statistics and mathematics.

Machine Learning: Learning from data

8

Page 9: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

What is this course about?Genomic revolution in biology

2001 Human genome project

2010 1000 genomes project

9

Page 10: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

What is this course about?Genomic revolution in biology

Stephens et al. PLoS Biology 2015

Num

ber o

f hum

an g

enom

es

Num

ber o

f bas

es

10

Page 11: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

What is this course about?Personal genomics

11

Page 12: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Statistics and computing important in obvious and subtle ways

What does this mean for us?

12

Page 13: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Administrivia

13

Page 14: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Course goalsIdentify biological questions.

Translate these questions into a statistical model.

Perform inference within the model.

Apply the model to data.

Interpret the results

Statistically sound ? Biologically meaningful ?

14

Page 15: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Course goals

For CS/Stats/EE students, identify important quantitative problems in Bioinformatics.

For Bioinformatics/Human Genetics students, learn a new set of tools and understand their strengths and limitations.

15

Page 16: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Enrolment/PTEsCourse over-subscribed.

Students on the waitlist will be enrolled in case some one drops.

I am not giving out PTEs until after problem set 0 submission date.

Expect several students will drop the course.

Last year, several students dropped late in the quarter.

Does require mathematical maturity (more later).

16

Page 17: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Enrolment/PTEsHope that all qualified students will be able to enroll.

Problem set 0 intended to mitigate late drops.

This is a course requirement.

Graded to assess your background but not part of final grade.

Be honest/realistic with yourself about your background.

17

Page 18: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Prerequisites

Some programming experience (R strongly encouraged)

18

Page 19: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

PrerequisitesMachine learning builds upon

Probability and Statistics

Linear algebra

Optimization

19

Page 20: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Math background reviewProblem set 0 intended to evaluate if you have the necessary background.

If you can solve the problems with little difficulty, you are in good shape to take this class.

If you have trouble with a few questions, you should expect to fill in the background needed ( we will also cover this material in class).

If you find most of the questions difficult, you should consider other classes to obtain the necessary background.

20

Page 21: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Course formatIntroduce important problems in Bioinformatics.

Use these problems to motivate the study of learning algorithms

Will study different aspects of learning algorithms

Statistical

Computational

Issues that arise in practice

Lectures will switch between studying the methodology and the applications

21

Page 22: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Course formatProblem sets (worth 50% total).

Will include programming and data analyses.

Homeworks are due at 11:59pm on due date.

Late submissions not accepted.

Will use gradescope to manage submissions and grading. Code for the class:MV527B

All solutions must be clearly written or typed. Unreadable answers will not be graded. We encourage using LaTeX to typset answers.

For some assignments, it might be the case that only a subset of the problems will be graded in detail. Students may or may not be told which problems will be graded in advance.

Solutions will be graded on both correctness and clarity.

You are free to discuss homework problems. However, you must write up your own solutions. You must also acknowledge all collaborators.

22

Page 23: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Course formatExams

In-class midterm (20%) Date: November 5

In-class final (30%) Date: December 5

Exams are in class, closed-book and closed-notes and will cover material from the lectures and the problem sets. You will be allowed to carry a one-page cheat sheet.

No alternate or make-up exams will be administered, except for disability/medical reasons documented and communicated to the instructor prior to the exam date. In particular, exam dates and times cannot be changed to accommodate scheduling conflicts with other classes.

Problem set 0 will be graded but not count towards final grade. We will not grade any other homeworks/exams unless you submit problem set 0.

23

Page 24: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

ForumsPiazza

Must have already got an email.

Otherwise sign up at: https://piazza.com/ucla/fall2018/cm226/home

Strongly encourage students to post here rather than email course staff directly (you will get a faster response this way)

If you do need to contact the staff privately, Piazza allows you to do this.

If you want to contact us over email, please include CM226 in the subject line.

24

Page 25: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

ForumsGradescope

Manage problem sets and exam submissions.

For problem sets, you will need to upload pdfs of your submission to gradescope.

Regrade requests must be made through gradescope within one week after the graded homeworks have been handed out, regardless of your attendance on that day and regardless of any intervening holidays such as Memorial Day.

We reserve the right to regrade all problems for a given regrade request.

25

Page 26: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Tentative class syllabusSee class website:

http://web.cs.ucla.edu/~sriram/courses/cm226.fall-2018/html/index.html

26

Page 27: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Office hoursMy Office hours: Tuesday 1-2 pm or by appointment@Engineering VI, 296B

TA: Tevfik Dincer Office hours: Thursday 2-4pm, Location TBD

27

Page 28: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Questions?

28

Page 29: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Goals for this class and the next

Introduction to genomics

Introduction to statistics

29

Page 30: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Crash-course in genomicsHow does the genome code for function?

How is the genome passed on from parent to child ?

How does the genome change when it is passed on ?

What can we learn from genetic variation ?

How do we read the genome ?

30

Page 31: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

OutlineHow does the genome code for function?

How is the genome passed on from parent to child ?

How does the genome change when it is passed on ?

What can we learn from genetic variation ?

How do we read the genome ?

31

Page 32: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Trait/Phenotype

Trait/phenotype: Any observable that is inherited

Height, eye color, disease status, cellular measurements, IQ

Instructions that modulate traits found in the genome

Galton 1886

32

Page 33: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

How the genome codes for function?

33

Page 34: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

OutlineHow does the genome code for function?

How is the genome passed on from parent to child ?

How does the genome change when it is passed on ?

What can we learn from genetic variation ?

How do we read the genome ?

34

Page 35: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Typical human cell has 46 chromosomes

22 pairs

1 pair of sex chromosomes

Genetics and inheritance

The chromosome painting collective35

Page 36: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Genetics and inheritanceOne member of each pair comes from the father (paternal) and the other from the mother (maternal)

The chromosome painting collective36

Page 37: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

OutlineHow does the genome code for function?

How is the genome passed on from parent to child ?

How does the genome change when it is passed on ?

What can we learn from genetic variation ?

How do we read the genome ?

37

Page 38: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Causes of genetic variation

DNA not always inherited accurately

Mutations: changes in DNA

Changes at a single base (nucleotide)

Can have more complex changes

38

Page 39: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

More definitions

SNP: single nucleotide change across individuals

Allele: set of bases at a SNP

Genotype: sequence of alleles along the SNPs of an individual

Genotype 1: (1,CT),(2,GG) Genotype 2: (1,TT), (2,GA)

39

A T C C T T A G G A

A T C T T T C A G A

SNP 1 SNP 2

A T C T T T C A G A

A T C T T T C A A A

Individual 1

Individual 2

Maternal

Paternal

Base/Nucleotide

Page 40: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Single Nucleotide Polymorphism (SNP)

Form the basis of most genetic analyses

Easy to study in high-throughput (million at a time)

Common

In humans, a SNP every ~1000 bases

A T C C T T A G G A

A T C T T T C A G A

SNP 1 SNP 2

A T C T T T C A G A

A T C T T T C A A A

Individual 1

Individual 2

Maternal

Paternal

40

Page 41: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Single Nucleotide Polymorphism (SNP)

0 0 1 0

SNP 1 SNP 2

1 01 1

Genotype 1

Genotype 2

Maternal

Paternal

41

0=C 1=T

0=G 1=A

Page 42: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Genetic inheritanceMendel’s first law

11 00

1

1 0

0

10

42

P(Transmitting allele 1)=1 P(1)=0

P(1)=0.5

Father Mother

Child

Page 43: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Mendel’s first law

11 X 00

10

11 X 10

1011

10 X 10

1011 01

0.5 0.5

0.25 0.250.25

1.0

Genetic inheritance

43

Father Mother

Child

00

0.25

Father Mother

Father Mother

Page 44: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Mendel’s second law

01 1011

? ? ? ?

SNP 1 SNP 2

Genetic inheritance

44

00

0 01 1

GenotypeMaternal

Paternal

What are the probabilities of each combination ?

Page 45: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Mendel’s second law

01 1011

0.50 0.50 0.00 0.00

SNP 1 SNP 2

Genetic inheritance

45

00

0 01 1

GenotypeMaternal

Paternal

Page 46: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Mendel’s second law

01 1011

0.25 0.25 0.25 0.25

SNP 1 SNP 2

Genetic inheritance

46

00

0 01 1

GenotypeMaternal

Paternal

Page 47: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Mendel’s second law

Not quite

01 1011

0.45 0.45 0.05 0.05

SNP 1 SNP 2

Genetic inheritance

47

00

0 01 1

GenotypeMaternal

Paternal

Page 48: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Mendel’s second law

Not quite. Recombination

Genetic inheritance

48

0 0 0 0

0 0 001

1 1

1

111

1

Page 49: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Back to genetic inheritanceMendel’s second law

Positions nearby inherited together.

The chromosome painting collective49

Page 50: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Back to genetic inheritanceMutation and recombination produce genetic variation

Mutation produces differences

Recombination shuffles these differences

50

Page 51: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Genetic inheritanceProbabilistic model

Can write probability P(Genotype of child | Genotype of parents)

Can use this to write more complex models

P (Genotype of individual | Genotype of sibling)

P( Genotype of individual | Genotype of grandparent)

51

Page 52: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

OutlineHow does the genome code for function?

How is the genome passed on from parent to child ?

How does the genome change when it is passed on ?

What can we learn from genetic variation ?

How do we read the genome ?

52

Page 53: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

What can we learn from genetic variation ?

Evolution and history

Biological function and disease

53

Page 54: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

What can we learn from genetic variation ?

54

How people are related ?

Page 55: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

What can we learn from genetic variation ?

History of human populations

https://blog.23andme.com/23andme-and-you/genetics-101/a-beautiful-ancestry-painting/

55

Page 56: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Genome-wide Association Studies (GWAS)

56

Page 57: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

OutlineHow does the genome code for function?

How is the genome passed on from parent to child ?

How does the genome change when it is passed on ?

What can we learn from genetic variation ?

How do we read the genome ?

57

Page 58: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Reading our genomesGoal: Determine the sequence of bases along each chromosome

Many technologies

Break up the genome into smaller fragments

Read each fragment

Put back together

Computationally hard

58

Page 59: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

The human genome projectGoals

Sequence an accurate reference human genome

Draft published in 2001

High-quality version completed in 2003

Cost: ~$3 billion.

Time: ~13 years.

Other outcome: Power of computing

59

Page 60: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Reading the genomeHuman genome provides a reference

Humans share most of their genome (~99.9% )

Can focus on reading the differences

Simpler problem

60

Page 61: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Reading the genome

High-throughput genotyping array

Idea: DNA molecules bind (hybridize)

Nucleotides bind to their complementary bases

A and T hybridize, C and G hybridize

Can be used to get the genotype at a chosen set of SNPs

T A A G T A C G G A

T A A T T A C A G A

61

Page 62: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Maps of genetic variation

Goals: Describe common patterns of genetic variation in human populations

Phase 1: Genotyped ~1 million SNPs from 270 individuals in 4 populations. Captured most SNPs with a frequency of >5%.

Phase 3: 7 additional populations included

All data publicly available.

International HapMap Project

62

Page 63: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Reading the genomeLimitations of genotyping

Can only read SNPs that are on the array

Biased by how these SNPs are chosen

63

Page 64: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Reading the genomeTechnologies: Illumina, IonTorrent, PacBio

Can read only small pieces of the genome (~100 bases)

But can read hundreds of thousands of pieces in parallel

Use the reference human genome to find the locations of the pieces (and to infer mutations)

High-throughput (or next-gen) sequencing

64

Page 65: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Reading the genomeCost of genome sequencing

65

Page 66: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Maps of genetic variation

2500 individuals from 26 populations

Discover ~90 million SNPs

Includes >99% of SNPs with frequency >1%

All data publicly available

1000 Genomes Project

66

Page 67: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Maps of genetic variation

Many more such efforts underway

Example: Simons Genome Diversity Project : 260 genomes from 127 populations

Also publicly available

67

Page 68: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Other interesting data

EXAC data: Exomes from ~60,000 individuals

Also publicly available

UK Biobank: 500,000 individuals with 200 phenotypes

Not publicly available

68

Page 69: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Computational and statistical problems

Recurring theme:

How to model a biological problem?

69

Page 70: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Recurring theme:

How to scale up algorithms to massive datasets?

Computational and statistical problems

70

Page 71: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Recurring theme:

Tradeoffs

Statistical accuracy

Computational efficiency

Constraints from technology

Constraints such as privacy

Computational and statistical problems

71

Page 72: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Learning the mapping from genotype to phenotype Association tests

Correcting for confounding

Multiple hypothesis testing

Risk prediction

Causality

Population genomics Inference of human history

Admixture inference

Computational and statistical problems

72

Page 73: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Regression Linear, Logistic, Ridge

Latent variable models Clustering, Mixture models

PCA

Admixture models

Directed Graphical Models Hidden Markov Models Kernels Deep Learning

Computational and statistical problems

73

Page 74: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

This classOverview of the biological questions Basic concepts in genetics Genomes are inherited according to well-known probabilistic models

Genomes change

Genetic variation forms the starting point for inference.

Possible inferences: history, disease risk

Advances in technology allow us to sequence large numbers of genomes

74

Page 75: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Next classBasic concepts in statistics

Given data, how do we perform inference ?

What are the types of inference?

How do we derive procedures to perform inference?

How do we evaluate these procedures ?

75

Page 76: Introduction to genomics CM226: Machine Learning for ...sriram/courses/cm226.fall-2018/slides/intro… · Introduction to genomics CM226: Machine Learning for Bioinformatics Fall

Thank you !

76