sam danziger institute for genomics and bioinformatics department of biomedical engineering

40
Choosing where to look next Choosing where to look next in a mutation sequence space: in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering University of California, Irvine www.SamDanziger.com Rainer Brachmann Department of Medicine Richard Lathrop Department of Computer Science Jue Zeng Department of Medicine University of California, Irvine

Upload: tanner-mcclure

Post on 30-Dec-2015

33 views

Category:

Documents


0 download

DESCRIPTION

Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants. Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering University of California, Irvine www.SamDanziger.com. Jue Zeng - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Choosing where to look next in Choosing where to look next in a mutation sequence space:a mutation sequence space:

Active Learning of informative p53 cancer rescue mutants

Sam DanzigerInstitute For Genomics and Bioinformatics

Department of Biomedical Engineering

University of California, Irvine

www.SamDanziger.com

Rainer BrachmannDepartment of Medicine

Richard LathropDepartment of Computer Science

Jue ZengDepartment of Medicine

University of California, Irvine

Page 2: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

OutlineOutline Overview: Computer Guided DiscoveryOverview: Computer Guided Discovery Problem: Cancer and p53Problem: Cancer and p53 Results: Best Active LearningResults: Best Active Learning Next: Future ExperimentsNext: Future Experiments

Page 3: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Computer Guided DiscoveryComputer Guided DiscoveryOf “Active” Mutant ProteinsOf “Active” Mutant Proteins

Known Known MutantsMutants

Other Possible Other Possible MutantsMutants

• Starting Point: A biomedically important protein with some known mutants.• Problem: Find novel mutant proteins with an “Active” phenotype.• Naive Solution: Make and test all other possible mutants in the wet lab.

Page 4: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Why Use Computers?Why Use Computers?

Spiral Galaxy M101

http://hubblesite.org/

~10^9 stars~10^9 stars.

Known Mutants

Known Mutants: ~10^2~10^2Assuming up to 5 mutants in 200 residuesHow Many Mutants are There?: ~10^11~10^11

Page 5: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

A Better Solution: Active Learning A Better Solution: Active Learning Pick the best unknown mutants to knowPick the best unknown mutants to know

Example MExample M

……

Example N+4Example N+4

Example N+3Example N+3

Example N+2Example N+2

Example N+1Example N+1

UnknownUnknown

Example N

Example 3

Example 2

Example 1

Known

Classifier

Train the Classifier

Choose an Example to

Label

Training Set Add the New ExampleTo Training Set

Page 6: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

An Example of Active Learning:An Example of Active Learning:Minimum Marginal HyperplaneMinimum Marginal Hyperplane

Should unknown Mutant 1Mutant 1 or Mutant 2Mutant 2 be added to the training set?

Select Mutant 2Mutant 2

ACTIVEACTIVE

INACTIVEINACTIVE

Known Active

Known Inactive

11

22Unknown Mutant 111

Unknown Mutant 222

Page 7: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Another Example: Another Example: Maximum CuriosityMaximum Curiosity

Should Mutant 1Mutant 1 or Mutant 2Mutant 2 be added to the training set?

Cross-validator

Training Set + Mutant 1Mutant 1 (Active)

Training Set + Mutant 1Mutant 1 (Inactive)

.0411.0411

.0276

-.6014

.0309

Training Set

Change in correlation coefficient

Cross-validator

Cross-validator

Cross-validator

Training Set + Mutant 2Mutant 2 (Active)

Training Set + Mutant 2Mutant 2 (Inactive)

Select Mutant 1Mutant 1

Page 8: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

A Third Example:A Third Example:Entropic TradeoffEntropic Tradeoff

Known Active

Known Inactive

Unclassified

ACTIVEACTIVE

INACTIVEINACTIVE

OKOK

OKOK

OKOK

SelectedUnclassified

OKOK

Page 9: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Which is the Best Active Which is the Best Active Learning Method?Learning Method?

TYPE ITYPE I: : Select mutants that most improve the classifier if correctly predicted.Select mutants that most improve the classifier if correctly predicted. Maximum CuriosityMaximum Curiosity Composite ClassifierComposite Classifier Improved Composite ClassifierImproved Composite Classifier

TYPE IITYPE II: : Select mutants that most improve the classifier.Select mutants that most improve the classifier. Additive CuriosityAdditive Curiosity Additive Bayesian SurpriseAdditive Bayesian Surprise

TYPE IIITYPE III: : Common methods taken from the literature.Common methods taken from the literature. Minimum Marginal HyperplaneMinimum Marginal Hyperplane Maximum EntropyMaximum Entropy

TYPE IVTYPE IV: : Variations on methods from the literature.Variations on methods from the literature. Maximum Marginal HyperplaneMaximum Marginal Hyperplane Minimum EntropyMinimum Entropy Entropic TradeoffEntropic Tradeoff

TYPE CTYPE C: : ControlsControls Non-iterated PredictionNon-iterated Prediction Predict All InactivePredict All Inactive Random (30 trials)Random (30 trials)

Page 10: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

OutlineOutline Overview: Computer Guided DiscoveryOverview: Computer Guided Discovery Problem: Cancer and p53Problem: Cancer and p53 Results: Best Active LearningResults: Best Active Learning Next: Future ExperimentsNext: Future Experiments

Page 11: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

The Problem: p53 and CancerThe Problem: p53 and Cancer p53 mutations occur in ~50% of human cancersp53 mutations occur in ~50% of human cancers

Tumor Suppressor Tumor Suppressor Protein.Protein.

Receives upstream Receives upstream signals indicating signals indicating cellular stress.cellular stress.

Acts as a transcription Acts as a transcription factor in the cancer factor in the cancer suppression pathway.suppression pathway.

p53 core domain bound to DNAImage Generated with UCSF Chimera

Cho, Y.,  Gorina, S.,  Jeffrey, P.D.,  Pavletich, N.P. Crystal structure of a p53 tumor suppressor-DNA complex:

understanding tumorigenic mutations. Science v265 pp.346-355 , 1994

Page 12: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

The p53 Cancer PathwayThe p53 Cancer Pathway

David W. Meek: http://www.dundee.ac.uk/biomedres/meek.htm

Page 13: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

N C

Core domain for DNA binding Tetramerization

102-292 324-355

Transactivation

1-42

The Concept of “Cancer Rescue”:The Concept of “Cancer Rescue”:Second-site Suppressor MutationsSecond-site Suppressor Mutations

175

245

248249

273

282

Cancer mutation prevalence data from the IARC p53 database: http://www-p53.iarc.fr/

235+240

Page 14: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Ultimate GoalUltimate Goal

Inactive p53

Cancer Mutant

Engineered Small

MoleculeDrug

+ =

Functionally Active

Rescued p53

Advance medical practice by revealing p53 mutant functional properties across p53’s mutation sequence space.

Intermediate GoalIntermediate Goal

Find novel p53 Cancer Rescue Mutants.

Immediate GoalImmediate Goal

Page 15: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Evaluating Cancer Rescue Evaluating Cancer Rescue Mutants in the Wet LabMutants in the Wet Lab

A Yeast containing an inactive inactive p53 cancerp53 cancer mutant

will not growwill not grow.

A Yeast containing an active active p53 cancer rescuep53 cancer rescue mutant

will growwill grow.

INACTIVEINACTIVE ACTIVEACTIVE

Baroni, T.E., Wang, T., Qian, H., Dearth, L.R., Truong, L.N., Zeng, J., Denes, Baroni, T.E., Wang, T., Qian, H., Dearth, L.R., Truong, L.N., Zeng, J., Denes, A.E., Chen, S.W. and Brachmann, R.K. (2004) A global suppressor motif for A.E., Chen, S.W. and Brachmann, R.K. (2004) A global suppressor motif for

p53 cancer mutants. p53 cancer mutants. Proc Natl Acad Sci U S AProc Natl Acad Sci U S A, 101, 4930-5., 101, 4930-5.

Page 16: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

In VitroIn Vitro Phenotype Phenotype

Page 17: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

In a Nutshell In a Nutshell

Cancer Rescue MutantsCancer Rescue Mutants

Use Active Use Active Learning to select Learning to select the p53 mutants the p53 mutants that will be the that will be the most informative.most informative.

Test the predictions Test the predictions in-vitro.in-vitro.Build classifiers of putative p53 cancer rescue mutants.

ExperimentExperiment

ModelModel

Find all p53 Find all p53 cancer rescue cancer rescue

mutantsmutants

KnowledgeKnowledge

Page 18: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

OutlineOutline Overview: Computer Guided DiscoveryOverview: Computer Guided Discovery Problem: Cancer and p53Problem: Cancer and p53 Results: Best Active LearningResults: Best Active Learning Next: Future ExperimentsNext: Future Experiments

Page 19: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

The Active Learning Tradeoff:The Active Learning Tradeoff:

How Fast Does It Learn?How Fast Does It Learn?

Page 20: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

The Active Learning Tradeoff:The Active Learning Tradeoff:

How Accurate On The Chosen?How Accurate On The Chosen?204 Predicts 57204 Predicts 57

TypeType MethodMethod AccuracyAccuracyCorrelation Correlation

CoefficientCoefficientStudent-T Student-T

II Maximum CuriosityMaximum Curiosity 77.19% +/- 5.61%77.19% +/- 5.61% .5255.5255 0.00%0.00%

II Composite ClassifierComposite Classifier 70.18% +/- 6.11%70.18% +/- 6.11% .4447.4447 100.0%100.0%

II Improved Composite ClassifierImproved Composite Classifier 71.93% +/- 6.00%71.93% +/- 6.00% .4637.4637 100.0%100.0%

IIII Additive CuriosityAdditive Curiosity 73.68% +/- 5.88%73.68% +/- 5.88% .3857.3857 99.81%99.81%

IIII Additive Bayesian SurpriseAdditive Bayesian Surprise 73.68% +/- 5.88%73.68% +/- 5.88% .4342.4342 99.81%99.81%

IIIIIIMinimum Marginal Minimum Marginal

HyperplaneHyperplane64.91% +/- 6.38%64.91% +/- 6.38%

.2845.2845100.0%100.0%

IIIIII Maximum EntropyMaximum Entropy 64.91% +/- 6.38%64.91% +/- 6.38% .2845.2845 100.0%100.0%

IVIVMaximum Marginal Maximum Marginal

HyperplaneHyperplane78.95% +/- 5.45%78.95% +/- 5.45%

.3699.369990.42%90.42%

IVIV Minimum EntropyMinimum Entropy 77.19% +/- 5.61%77.19% +/- 5.61% .3406.3406 0.00%0.00%

IVIV Entropic TradeoffEntropic Tradeoff 80.70 % +/- 5.27%80.70 % +/- 5.27% .4860.4860 99.89%99.89%

CC Non-iterated PredictionNon-iterated Prediction 56.14% +/- 6.63%56.14% +/- 6.63% .2530.2530 100.0%100.0%

CC Predict All InactivePredict All Inactive 80.70% +/- 5.27%80.70% +/- 5.27% .0000.0000 99.89%99.89%

CC Random (30 trials)Random (30 trials)74.39% +/- 3.87%74.39% +/- 3.87%

.3550 +/- .0992.3550 +/- .099299.24% 99.24%

+/- 2.89%+/- 2.89%

Page 21: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

The TradeoffThe Tradeoff

How Fast Does It Learn?

How

Acc

ura

te o

n t

he C

hose

n?

Sum?Sum?Length + WidthLength + Width

Geometric Distance?Geometric Distance?

Area?Area?

Length * WidthLength * Width

SolutionSolution: Average Score of All Three Metrics: Average Score of All Three Metrics

Maximum Curiosity

Entropic Tradeoff

Minimum Marginal Hyperplane

Page 22: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

The Overall BestThe Overall Best

RankRank MethodMethodAverage Average

ScoreScore

11 Maximum CuriosityMaximum Curiosity 6.116.11

22 Entropic TradeoffEntropic Tradeoff 5.565.56

33 Random (30 trials)Random (30 trials) 5.505.50

44 Minimum EntropyMinimum Entropy 4.444.44

55Maximum Marginal Maximum Marginal

HyperplaneHyperplane3.223.22

66 Maximum EntropyMaximum Entropy 3.223.22

77 Additive Bayesian SurpriseAdditive Bayesian Surprise 2.892.89

88Minimum Marginal Minimum Marginal

HyperplaneHyperplane2.332.33

99 Additive CuriosityAdditive Curiosity 1.891.89

Page 23: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

How Fast Does It Learn?How Fast Does It Learn?The Three Previous ExamplesThe Three Previous Examples

Page 24: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

How Accurate On The Chosen?How Accurate On The Chosen? The Three Previous Examples The Three Previous Examples

204 Predicts 57204 Predicts 57

TypeType MethodMethod AccuracyAccuracyCorrelation Correlation

CoefficientCoefficientStudent-T Student-T

II Maximum CuriosityMaximum Curiosity77.19% +/- 77.19% +/-

5.61%5.61%.5255.5255

0.00%0.00%

IIIIIIMinimum Marginal Minimum Marginal

HyperplaneHyperplane64.91% +/- 64.91% +/-

6.38%6.38%.2845.2845

100.0%100.0%

IVIV Entropic TradeoffEntropic Tradeoff80.70 % +/- 80.70 % +/-

5.27%5.27%.4860.4860

99.89%99.89%

CC Non-iterated PredictionNon-iterated Prediction56.14% +/- 56.14% +/-

6.63%6.63%.2530.2530

100.0%100.0%

CC Predict All InactivePredict All Inactive80.70% +/- 80.70% +/-

5.27%5.27%.0000.0000

99.89%99.89%

CC Random (30 trials)Random (30 trials)74.39% +/- 74.39% +/-

3.87%3.87%.3550 +/- .0992.3550 +/- .0992

99.24% 99.24% +/- 2.89%+/- 2.89%

Page 25: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Why Does Random Do So Well?Why Does Random Do So Well?

Tong, S. and D. Koller (2002). "Support vector machine active learning with applications to text classification." The Journal of Machine Learning Research 2: 45-66.

Very Few Examples

Page 26: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

OutlineOutline Overview: Computer Guided DiscoveryOverview: Computer Guided Discovery Problem: Cancer and p53Problem: Cancer and p53 Results: Best Active LearningResults: Best Active Learning Next: Future ExperimentsNext: Future Experiments

Page 27: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Exploring New p53 RegionsExploring New p53 Regions Each new p53 region potentially Each new p53 region potentially

introduces new rescue mechanisms.introduces new rescue mechanisms. New pools of mutants restart the New pools of mutants restart the

Active Learning problem.Active Learning problem.

113-124

281-289

p53 Core Domain

N

C

175

245

248 273

282

Page 28: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Most Interesting or Most Interesting or Most Interesting Active?Most Interesting Active?

Which Finds More Active Cancer Rescue Mutants?Which Finds More Active Cancer Rescue Mutants?

Iteration 1

Iteration 2

Iteration 3

Select The Most Interesting

Select The Most Interesting Active

Iteration 1

Iteration 2

Iteration 3

Known Mutants

Page 29: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

ConclusionConclusion

TheoryTheory

Find Cancer Rescue Mutants

KnowledgeKnowledge

ExperimentExperiment

Page 30: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Pierre BaldiPierre Baldi

Jonathan ChenJonathan Chen

Hiroto SaigoHiroto Saigo

S. Joshua SwamidassS. Joshua Swamidass

Baldi LabBaldi LabRainer BrachmannRainer Brachmann

Jue ZengJue Zeng

Brachmann LabBrachmann Lab

Richard LathropRichard Lathrop

Gabe MoothartGabe Moothart

Lathrop LabLathrop Lab

Ying WangYing Wang

Leuke LabLeuke Lab

Ray LuoRay Luo

Qiang LuQiang Lu

Luo LabLuo Lab

AcknowledgmentsAcknowledgments

FundingFundingNational Institute of Health ( p53: CA112560 ), National Institute of Health ( p53: CA112560 ), UCI Office of Research and Graduate Studies, UCI Office of Research and Graduate Studies,

UCI Institute for Genomics and Bioinformatics ( BIT: LM007443 ), UCI Institute for Genomics and Bioinformatics ( BIT: LM007443 ), US Department of Energy (DOE)US Department of Energy (DOE)

Page 31: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Questions?Questions?

TheoryTheory

Find Cancer Rescue Mutants

KnowledgeKnowledge

ExperimentExperiment

Page 32: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Most Interesting RegionMost Interesting Region

Scan the p53 core domain to find the Scan the p53 core domain to find the most interesting region.most interesting region.

Page 33: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Create All Single Point Mutations in Create All Single Point Mutations in a Region a Region in-vitroin-vitro??

CODA*: Assemble p53 using thermodynamicallyoptimized oligonucleotides.

Allow all possible mutationswithin a region.

Assemble mutated regionwith cancer mutants to lookfor rescue mutants.

*http://www.codagenomics.com/

Page 34: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Knowledge Representation: Knowledge Representation: Homology ModelingHomology Modeling

Modeling done using Amber™ with zinc ion characteristics tuned by Dr. Qiang Lu working in Dr. Ray Lui’s lab.Modeling done using Amber™ with zinc ion characteristics tuned by Dr. Qiang Lu working in Dr. Ray Lui’s lab.

1. Take a wild type crystal structure of the protein in question.

2. Substitute one or more amino acids to mutate the protein.

3. Apply simulated physical laws to determine an energy function.

4. Minimize the energy of the new mutant protein.

Page 35: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Knowledge Representation: Knowledge Representation: FeaturesFeatures

Simulated Structure -> String of NumbersSimulated Structure -> String of Numbers

1d1d: Sequence Mutation Features: Sequence Mutation Features s1ds1d: Sequence Similarity Features: Sequence Similarity Features 2d2d: Surface Map Features: Surface Map Features 3d3d: Atomic Position Features: Atomic Position Features 4d4d: “Time Dependant” Stability : “Time Dependant” Stability

InformationInformation

Page 36: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

What is Machine Learning?What is Machine Learning?

Training: Set the parameters (W) with n features.

Testing: Use the parameters (W) to predict unclassified examples

WW11

WW22

……

WWnn

FF1111 FF1212 …… FF1n1n

FF2121 …… …… ……

…… …… …… ……

FFm1m1 …… …… FFmnmn

Example 1Example 1

Example 2Example 2

……

Example mExample m

Class 1Class 1

Class 2Class 2

……

Class mClass m

Unknown Unknown FF1111 FF1212 …… FF1n1n

WW11

WW22

……

WWnn

PredictionPrediction

Page 37: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Modeling: Modeling: How To Use ItHow To Use It

BiologyComputer Generated Structure

Make a protein and test it in-

vitro

PRO: Real

CON: Slow

Predict a protein

structure in-silico

PRO: Fast

CON: Inaccurate, what does it tell us?

Machine

Learning

Use Homology

Modeling to guide

biological research

Page 38: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Maximum CuriosityMaximum Curiosity

Choose a mutant Choose a mutant from the from the testtest set set that has not been that has not been considered yet. considered yet. Assume the Assume the chosenchosen is “Active” or is “Active” or “Inactive”“Inactive”

Crossvalidate the trainingtraining set with the chosenchosen

mutant and record the correlation coefficient.

Start with a trainingtraining set of examples with known classes and an unclassed testtest set.

ModelModel

Find the Find the Mutants that Mutants that Most Improve Most Improve the Training the Training

SetSetKnowledgeKnowledge

ExperimentExperiment

Page 39: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Exploring New p53 RegionsExploring New p53 Regions

Each new p53 region potentially Each new p53 region potentially introduces new rescue mechanisms.introduces new rescue mechanisms.

New pools of mutants restart the New pools of mutants restart the Active Learning problem.Active Learning problem.

113-124 281-289

p53 Core Domain

Page 40: Sam Danziger Institute For Genomics and Bioinformatics Department of Biomedical Engineering

Primary CollaboratorsPrimary Collaborators

Dr. Rainer Brachmann

School of Medicine

Dr. Richard LathropSchool of Information and

Computer Science

Jue ZengSchool of Medicine