sam danziger institute for genomics and bioinformatics department of biomedical engineering

Choosing where to look next in Choosing where to look next in a mutation sequence space:a mutation sequence space:

Active Learning of informative p53 cancer rescue mutants

Sam DanzigerInstitute For Genomics and Bioinformatics

Department of Biomedical Engineering

University of California, Irvine

www.SamDanziger.com

Rainer BrachmannDepartment of Medicine

Richard LathropDepartment of Computer Science

Jue ZengDepartment of Medicine

University of California, Irvine

OutlineOutline Overview: Computer Guided DiscoveryOverview: Computer Guided Discovery Problem: Cancer and p53Problem: Cancer and p53 Results: Best Active LearningResults: Best Active Learning Next: Future ExperimentsNext: Future Experiments

Computer Guided DiscoveryComputer Guided DiscoveryOf “Active” Mutant ProteinsOf “Active” Mutant Proteins

Known Known MutantsMutants

Other Possible Other Possible MutantsMutants

• Starting Point: A biomedically important protein with some known mutants.• Problem: Find novel mutant proteins with an “Active” phenotype.• Naive Solution: Make and test all other possible mutants in the wet lab.

Why Use Computers?Why Use Computers?

Spiral Galaxy M101

http://hubblesite.org/

~10^9 stars~10^9 stars.

Known Mutants

Known Mutants: ~10^2~10^2Assuming up to 5 mutants in 200 residuesHow Many Mutants are There?: ~10^11~10^11

A Better Solution: Active Learning A Better Solution: Active Learning Pick the best unknown mutants to knowPick the best unknown mutants to know

Example MExample M

……

Example N+4Example N+4

UnknownUnknown

Example N

Example 3

Example 2

Example 1

Classifier

Train the Classifier

Choose an Example to

Training Set Add the New ExampleTo Training Set

An Example of Active Learning:An Example of Active Learning:Minimum Marginal HyperplaneMinimum Marginal Hyperplane

Should unknown Mutant 1Mutant 1 or Mutant 2Mutant 2 be added to the training set?

Select Mutant 2Mutant 2

ACTIVEACTIVE

INACTIVEINACTIVE

Known Active

Known Inactive

22Unknown Mutant 111

Unknown Mutant 222

Another Example: Another Example: Maximum CuriosityMaximum Curiosity

Should Mutant 1Mutant 1 or Mutant 2Mutant 2 be added to the training set?

Cross-validator

Training Set + Mutant 1Mutant 1 (Active)

Training Set + Mutant 1Mutant 1 (Inactive)

.0411.0411

-.6014

Training Set

Change in correlation coefficient

Cross-validator

Training Set + Mutant 2Mutant 2 (Active)

Training Set + Mutant 2Mutant 2 (Inactive)

Select Mutant 1Mutant 1

A Third Example:A Third Example:Entropic TradeoffEntropic Tradeoff

Known Active

Known Inactive

Unclassified

ACTIVEACTIVE

INACTIVEINACTIVE

SelectedUnclassified

Which is the Best Active Which is the Best Active Learning Method?Learning Method?

TYPE ITYPE I: : Select mutants that most improve the classifier if correctly predicted.Select mutants that most improve the classifier if correctly predicted. Maximum CuriosityMaximum Curiosity Composite ClassifierComposite Classifier Improved Composite ClassifierImproved Composite Classifier

TYPE IITYPE II: : Select mutants that most improve the classifier.Select mutants that most improve the classifier. Additive CuriosityAdditive Curiosity Additive Bayesian SurpriseAdditive Bayesian Surprise

TYPE IIITYPE III: : Common methods taken from the literature.Common methods taken from the literature. Minimum Marginal HyperplaneMinimum Marginal Hyperplane Maximum EntropyMaximum Entropy

TYPE IVTYPE IV: : Variations on methods from the literature.Variations on methods from the literature. Maximum Marginal HyperplaneMaximum Marginal Hyperplane Minimum EntropyMinimum Entropy Entropic TradeoffEntropic Tradeoff

TYPE CTYPE C: : ControlsControls Non-iterated PredictionNon-iterated Prediction Predict All InactivePredict All Inactive Random (30 trials)Random (30 trials)

The Problem: p53 and CancerThe Problem: p53 and Cancer p53 mutations occur in ~50% of human cancersp53 mutations occur in ~50% of human cancers

Tumor Suppressor Tumor Suppressor Protein.Protein.

Receives upstream Receives upstream signals indicating signals indicating cellular stress.cellular stress.

Acts as a transcription Acts as a transcription factor in the cancer factor in the cancer suppression pathway.suppression pathway.

p53 core domain bound to DNAImage Generated with UCSF Chimera

Cho, Y., Gorina, S., Jeffrey, P.D., Pavletich, N.P. Crystal structure of a p53 tumor suppressor-DNA complex:

understanding tumorigenic mutations. Science v265 pp.346-355 , 1994

The p53 Cancer PathwayThe p53 Cancer Pathway

David W. Meek: http://www.dundee.ac.uk/biomedres/meek.htm

Core domain for DNA binding Tetramerization

102-292 324-355

Transactivation

The Concept of “Cancer Rescue”:The Concept of “Cancer Rescue”:Second-site Suppressor MutationsSecond-site Suppressor Mutations

248249

Cancer mutation prevalence data from the IARC p53 database: http://www-p53.iarc.fr/

235+240

Ultimate GoalUltimate Goal

Inactive p53

Cancer Mutant

Engineered Small

MoleculeDrug

Functionally Active

Rescued p53

Advance medical practice by revealing p53 mutant functional properties across p53’s mutation sequence space.

Intermediate GoalIntermediate Goal

Find novel p53 Cancer Rescue Mutants.

Immediate GoalImmediate Goal

Evaluating Cancer Rescue Evaluating Cancer Rescue Mutants in the Wet LabMutants in the Wet Lab

A Yeast containing an inactive inactive p53 cancerp53 cancer mutant

will not growwill not grow.

A Yeast containing an active active p53 cancer rescuep53 cancer rescue mutant

will growwill grow.

INACTIVEINACTIVE ACTIVEACTIVE

Baroni, T.E., Wang, T., Qian, H., Dearth, L.R., Truong, L.N., Zeng, J., Denes, Baroni, T.E., Wang, T., Qian, H., Dearth, L.R., Truong, L.N., Zeng, J., Denes, A.E., Chen, S.W. and Brachmann, R.K. (2004) A global suppressor motif for A.E., Chen, S.W. and Brachmann, R.K. (2004) A global suppressor motif for

p53 cancer mutants. p53 cancer mutants. Proc Natl Acad Sci U S AProc Natl Acad Sci U S A, 101, 4930-5., 101, 4930-5.

In VitroIn Vitro Phenotype Phenotype

In a Nutshell In a Nutshell

Cancer Rescue MutantsCancer Rescue Mutants

Use Active Use Active Learning to select Learning to select the p53 mutants the p53 mutants that will be the that will be the most informative.most informative.

Test the predictions Test the predictions in-vitro.in-vitro.Build classifiers of putative p53 cancer rescue mutants.

ExperimentExperiment

ModelModel

Find all p53 Find all p53 cancer rescue cancer rescue

mutantsmutants

KnowledgeKnowledge

The Active Learning Tradeoff:The Active Learning Tradeoff:

How Fast Does It Learn?How Fast Does It Learn?

The Active Learning Tradeoff:The Active Learning Tradeoff:

How Accurate On The Chosen?How Accurate On The Chosen?204 Predicts 57204 Predicts 57

TypeType MethodMethod AccuracyAccuracyCorrelation Correlation

CoefficientCoefficientStudent-T Student-T

II Maximum CuriosityMaximum Curiosity 77.19% +/- 5.61%77.19% +/- 5.61% .5255.5255 0.00%0.00%

II Composite ClassifierComposite Classifier 70.18% +/- 6.11%70.18% +/- 6.11% .4447.4447 100.0%100.0%

II Improved Composite ClassifierImproved Composite Classifier 71.93% +/- 6.00%71.93% +/- 6.00% .4637.4637 100.0%100.0%

IIII Additive CuriosityAdditive Curiosity 73.68% +/- 5.88%73.68% +/- 5.88% .3857.3857 99.81%99.81%

IIII Additive Bayesian SurpriseAdditive Bayesian Surprise 73.68% +/- 5.88%73.68% +/- 5.88% .4342.4342 99.81%99.81%

IIIIIIMinimum Marginal Minimum Marginal

HyperplaneHyperplane64.91% +/- 6.38%64.91% +/- 6.38%

.2845.2845100.0%100.0%

IIIIII Maximum EntropyMaximum Entropy 64.91% +/- 6.38%64.91% +/- 6.38% .2845.2845 100.0%100.0%

IVIVMaximum Marginal Maximum Marginal

HyperplaneHyperplane78.95% +/- 5.45%78.95% +/- 5.45%

.3699.369990.42%90.42%

IVIV Minimum EntropyMinimum Entropy 77.19% +/- 5.61%77.19% +/- 5.61% .3406.3406 0.00%0.00%

IVIV Entropic TradeoffEntropic Tradeoff 80.70 % +/- 5.27%80.70 % +/- 5.27% .4860.4860 99.89%99.89%

CC Non-iterated PredictionNon-iterated Prediction 56.14% +/- 6.63%56.14% +/- 6.63% .2530.2530 100.0%100.0%

CC Predict All InactivePredict All Inactive 80.70% +/- 5.27%80.70% +/- 5.27% .0000.0000 99.89%99.89%

CC Random (30 trials)Random (30 trials)74.39% +/- 3.87%74.39% +/- 3.87%

.3550 +/- .0992.3550 +/- .099299.24% 99.24%

+/- 2.89%+/- 2.89%

The TradeoffThe Tradeoff

How Fast Does It Learn?

Sum?Sum?Length + WidthLength + Width

Geometric Distance?Geometric Distance?

Area?Area?

Length * WidthLength * Width

SolutionSolution: Average Score of All Three Metrics: Average Score of All Three Metrics

Maximum Curiosity

Entropic Tradeoff

Minimum Marginal Hyperplane

The Overall BestThe Overall Best

RankRank MethodMethodAverage Average

ScoreScore

11 Maximum CuriosityMaximum Curiosity 6.116.11

22 Entropic TradeoffEntropic Tradeoff 5.565.56

33 Random (30 trials)Random (30 trials) 5.505.50

44 Minimum EntropyMinimum Entropy 4.444.44

55Maximum Marginal Maximum Marginal

HyperplaneHyperplane3.223.22

66 Maximum EntropyMaximum Entropy 3.223.22

77 Additive Bayesian SurpriseAdditive Bayesian Surprise 2.892.89

88Minimum Marginal Minimum Marginal

HyperplaneHyperplane2.332.33

99 Additive CuriosityAdditive Curiosity 1.891.89

How Fast Does It Learn?How Fast Does It Learn?The Three Previous ExamplesThe Three Previous Examples

How Accurate On The Chosen?How Accurate On The Chosen? The Three Previous Examples The Three Previous Examples

204 Predicts 57204 Predicts 57

TypeType MethodMethod AccuracyAccuracyCorrelation Correlation

CoefficientCoefficientStudent-T Student-T

II Maximum CuriosityMaximum Curiosity77.19% +/- 77.19% +/-

5.61%5.61%.5255.5255

0.00%0.00%

IIIIIIMinimum Marginal Minimum Marginal

HyperplaneHyperplane64.91% +/- 64.91% +/-

6.38%6.38%.2845.2845

100.0%100.0%

IVIV Entropic TradeoffEntropic Tradeoff80.70 % +/- 80.70 % +/-

5.27%5.27%.4860.4860

99.89%99.89%

CC Non-iterated PredictionNon-iterated Prediction56.14% +/- 56.14% +/-

6.63%6.63%.2530.2530

100.0%100.0%

CC Predict All InactivePredict All Inactive80.70% +/- 80.70% +/-

5.27%5.27%.0000.0000

99.89%99.89%

CC Random (30 trials)Random (30 trials)74.39% +/- 74.39% +/-

3.87%3.87%.3550 +/- .0992.3550 +/- .0992

99.24% 99.24% +/- 2.89%+/- 2.89%

Why Does Random Do So Well?Why Does Random Do So Well?

Tong, S. and D. Koller (2002). "Support vector machine active learning with applications to text classification." The Journal of Machine Learning Research 2: 45-66.

Very Few Examples

Exploring New p53 RegionsExploring New p53 Regions Each new p53 region potentially Each new p53 region potentially

introduces new rescue mechanisms.introduces new rescue mechanisms. New pools of mutants restart the New pools of mutants restart the

Active Learning problem.Active Learning problem.

113-124

281-289

p53 Core Domain

248 273

Most Interesting or Most Interesting or Most Interesting Active?Most Interesting Active?

Which Finds More Active Cancer Rescue Mutants?Which Finds More Active Cancer Rescue Mutants?

Iteration 1

Iteration 2

Iteration 3

Select The Most Interesting

Select The Most Interesting Active

Iteration 1

Iteration 2

Iteration 3

Known Mutants

ConclusionConclusion

TheoryTheory

Find Cancer Rescue Mutants

KnowledgeKnowledge

Pierre BaldiPierre Baldi

Jonathan ChenJonathan Chen

Hiroto SaigoHiroto Saigo

S. Joshua SwamidassS. Joshua Swamidass

Baldi LabBaldi LabRainer BrachmannRainer Brachmann

Jue ZengJue Zeng

Brachmann LabBrachmann Lab

Richard LathropRichard Lathrop

Gabe MoothartGabe Moothart

Lathrop LabLathrop Lab

Ying WangYing Wang

Leuke LabLeuke Lab

Ray LuoRay Luo

Qiang LuQiang Lu

Luo LabLuo Lab

AcknowledgmentsAcknowledgments

FundingFundingNational Institute of Health ( p53: CA112560 ), National Institute of Health ( p53: CA112560 ), UCI Office of Research and Graduate Studies, UCI Office of Research and Graduate Studies,

UCI Institute for Genomics and Bioinformatics ( BIT: LM007443 ), UCI Institute for Genomics and Bioinformatics ( BIT: LM007443 ), US Department of Energy (DOE)US Department of Energy (DOE)

Questions?Questions?

TheoryTheory

Find Cancer Rescue Mutants

KnowledgeKnowledge

Most Interesting RegionMost Interesting Region

Scan the p53 core domain to find the Scan the p53 core domain to find the most interesting region.most interesting region.

Create All Single Point Mutations in Create All Single Point Mutations in a Region a Region in-vitroin-vitro??

CODA*: Assemble p53 using thermodynamicallyoptimized oligonucleotides.

Allow all possible mutationswithin a region.

Assemble mutated regionwith cancer mutants to lookfor rescue mutants.

*http://www.codagenomics.com/

Knowledge Representation: Knowledge Representation: Homology ModelingHomology Modeling

Modeling done using Amber™ with zinc ion characteristics tuned by Dr. Qiang Lu working in Dr. Ray Lui’s lab.Modeling done using Amber™ with zinc ion characteristics tuned by Dr. Qiang Lu working in Dr. Ray Lui’s lab.

1. Take a wild type crystal structure of the protein in question.

2. Substitute one or more amino acids to mutate the protein.

3. Apply simulated physical laws to determine an energy function.

4. Minimize the energy of the new mutant protein.

Knowledge Representation: Knowledge Representation: FeaturesFeatures

Simulated Structure -> String of NumbersSimulated Structure -> String of Numbers

1d1d: Sequence Mutation Features: Sequence Mutation Features s1ds1d: Sequence Similarity Features: Sequence Similarity Features 2d2d: Surface Map Features: Surface Map Features 3d3d: Atomic Position Features: Atomic Position Features 4d4d: “Time Dependant” Stability : “Time Dependant” Stability

InformationInformation

What is Machine Learning?What is Machine Learning?

Training: Set the parameters (W) with n features.

Testing: Use the parameters (W) to predict unclassified examples

……

FF1111 FF1212 …… FF1n1n

FF2121 …… …… ……

…… …… …… ……

FFm1m1 …… …… FFmnmn

Example 1Example 1

Example 2Example 2

……

Example mExample m

Class 1Class 1

Class 2Class 2

……

Class mClass m

Unknown Unknown FF1111 FF1212 …… FF1n1n

……

PredictionPrediction

Modeling: Modeling: How To Use ItHow To Use It

BiologyComputer Generated Structure

Make a protein and test it in-

PRO: Real

CON: Slow

Predict a protein

structure in-silico

PRO: Fast

CON: Inaccurate, what does it tell us?

Machine

Learning

Use Homology

Modeling to guide

biological research

Maximum CuriosityMaximum Curiosity

Choose a mutant Choose a mutant from the from the testtest set set that has not been that has not been considered yet. considered yet. Assume the Assume the chosenchosen is “Active” or is “Active” or “Inactive”“Inactive”

Crossvalidate the trainingtraining set with the chosenchosen

mutant and record the correlation coefficient.

Start with a trainingtraining set of examples with known classes and an unclassed testtest set.

ModelModel

Find the Find the Mutants that Mutants that Most Improve Most Improve the Training the Training

SetSetKnowledgeKnowledge

Exploring New p53 RegionsExploring New p53 Regions

Each new p53 region potentially Each new p53 region potentially introduces new rescue mechanisms.introduces new rescue mechanisms.

New pools of mutants restart the New pools of mutants restart the Active Learning problem.Active Learning problem.

113-124 281-289

p53 Core Domain

Primary CollaboratorsPrimary Collaborators

Dr. Rainer Brachmann

School of Medicine

Dr. Richard LathropSchool of Information and

Computer Science

Jue ZengSchool of Medicine

sam danziger institute for genomics and bioinformatics department of biomedical engineering

example of active learning

activetraining set mutant

inactiveselect mutant

maximum curiosityshould

cancer p53 mutations

best unknown mutants

best active learningnext

p53 cancer pathwaydavid

Documents

applied bioinformatics in...

maize genetics, genomics, bioinformatics workshop

bioinformatics - genomics and post-genomics - f. dardel, f....

bioinformatics, genomics, and proteomics

bioinformatics medical genomics and...

genomics, bioinformatics, and pathology

statistical bioinformatics genomics transcriptomics...

bioinformatics: making sense of functional genomics data

bioinformatics fgcz functional genomics center zurich ·...

2015 nrf-managed bioinformatics and functional genomics

center for genomics and bioinformatics

doug brutlag 2011 bioinformatics genomics, bioinformatics

statistical genomics and bioinformatics workshop: genetic...

genomics, bioinformatics and the revolution in biology

http:// x4mo4kpdtem. genomics and bioinformatics

meeting the bioinformatics challenges of functional genomics

genomics and bioinformatics strategies in the study …

bioinformatics and functional genomics wrapup › biol4230...

bioinformatics - stellenbosch universitypevsner j....

genomics and bioinformatics