pairwise learning for synthetic biology

PAIRWISE LEARNING FOR SYNTHETIC BIOLOGY

Michiel Stock @michielstock

1

KERMIT

The many applications of pairwise learning in bioscience engineering

2

De Clercq et al. (2015)

Costello et al. (2014) Stock et al. (2014)

Gonnelli et al. (2015) Van Peer et al. (2016)

Stock et al. (2013) Stock et al. (2017)

OUTLINE OF THIS TALK

1. Explaining pairwise learning through an accessible example

2. Show how pairwise learning ties with synthetic biology

3. Case study: design of P450 proteins

4. Machine learning and synthetic biology in a broader context

3

( , ,2)PAIRWISE DATA

4Any resemblance to actual persons, living or dead, or actual events is purely coincidental.

person

beer

rating (1-5)( , ,5)( , ,2)

( , ,1)( , ,2)

( , ,4)

…

BEER RATINGS DATASET

5Any resemblance to actual persons, living or dead, or actual events is purely coincidental.

Y

PRIOR KNOWLEDGE ON THE PERSONS

6

G

H

I

J

F

E

D

C

BA

K

knows

female

male

likes dark beers

likes bitter flavors

Any resemblance to actual persons, living or dead, or actual events is purely coincidental.

e.g. an annotated social network

PRIOR KNOWLEDGE ON THE BEERS

7

beers

e.g. description, composition

PAIRWISE PREDICTION BASED ON FEATURES

8

A pairwise prediction model:

f( , ) = h(�( ), ( ))

descriptions

data fitting penalty

minf

X

(i,j)2T

L(yij , fij) + �||f ||H

Find f by solving the following optimisation problem:

model complexity

penaltyobservations

FROM OBSERVATIONS TO PREDICTIONS

9

Y F

Predicting

missing values

What people have rated

pers

ons

beers

Ratings predicted by the model

FOUR SETTINGS FOR PAIRWISE PREDICTION

10

A

Setting A: same persons, same beers

B

Setting B: new persons, same beersC

Setting C: same persons, new beers

D

Setting D: new persons, new beers

beers

persons

new beers

new persons

HARDER

RANKING OF BEERS

11

For a given person, find the most relevant beers in a database:

= arg max f( , )Solve the following optimization problem:

S =

2 S

DESIGNING BEERS

12

= arg max f( , )2 S

What if we want to make a new beer?

space of all beers

PAIRWISE LEARNING FOR SYNTHETIC BIOLOGY

13

network inference

protein-ligand prediction

protein design

IN SILICO ENGINEERING OF P450 ENZYMES

➤ P450: superclass of oxidoreductases enzymes.

➤ Occurs throughout the tree of life.

➤ Huge medical and industrial importance!

➤ Focus: optimizing CYP52M1 to catalyze the hydroxylation of palmitic acid (production of biosurfactants).

14Master thesis dissertation of ir. Laurentijn Tilleman

WORKFLOW

15

2

In deze thesis wordt een workflow voorgesteld voor het optimaliseren van de sequentie van cy-

tochroom P450s op basis van deze paarsgewijze interacties (Figuur 1.1). Hiervoor wordt eerst

een beschrijving gegeven van wat cytochroom P450s zijn en welke toepassingen zij hebben

in de industriele biotechnologie en de medische sector. Daarna worden de reeds bestaande

modellen besproken voor het voorspellen van deze paarsgewijze interacties. In een volgend

hoofdstuk wordt beschreven welke data gebruikt kunnen worden voor de machine learning

modellen en waarvan deze afkomstig zijn. Voordat de modellen kunnen worden opgesteld

wordt een overzicht gegeven van welke features gebruikt kunnen worden voor de sequenties

en de structuren van de chemische componenten. In het hoofdstuk over de verschillende mo-

dellen, worden verschillende machine learning modellen vergeleken met zowel elkaar als met

de literatuur. Op basis van deze modellen wordt dan een optimalisatie uitgevoerd van een

cytochroom P450 besproken in de literatuurstudie.

Figuur 1.1: De verschillende stappen van de workflow voor het optimaliseren van

cytochroom P450s zoals voorgesteld in deze thesis.

Data collection Feature representation Model building

Sensitivity

Optimization

ligands

prot

eins

interaction

Master thesis dissertation of ir. Laurentijn Tilleman

E( | ) = f( , ) + TP ( )

OPTIMIZE PROTEIN USING SIMULATED ANNEALING

➤ Construct prior distribution P based on alignment.

➤ With decreasing temperature (T), sample proteins x with probability proportional to

16

protein

substrate

pairwise prediction

score

temperature (likely protein vs high predicted score)

prior based on alignment


SAMPLES FROM THE EQUILIBRIUM DISTRIBUTION

17

HOOFDSTUK 7 OPTIMALISATIE VAN EEN CYTOCHROOM P450 71

Figuur 7.6: Mutual information circos voor 1000 geoptimaliseerde CYP52M1

sequenties gemaakt via MISTIC. De buitenste cirkel geeft voor elke positie het

aminozuur van de consensussequentie weer. De gekleurde rechthoeken in de tweede

cirkel geven aan hoe geconserveerd elke positie is. Een blauwe kleur geeft een lage

conservering, een rode kleur geeft een hoge conservering. De posities met een

mutual information groter dan 6.5 zijn verbonden. De top 5 % is verbonden met

een rode lijn, de top 30 % met een zwarte lijn en de overige 70 % met een grijze

lijn. Het histogram in de derde cirkel geeft aan met hoeveel posities deze positie

gecorreleerd is.

Study the distribution of the optimized proteins.


THE TRIANGLE OF DATA-DRIVEN SYNTHETIC BIOLOGY

18

Molecular techniques

High-throughput

screening

Mathematical modelling and optimisation

Concept borrowed from Yves Briers

ACKNOWLEDGEMENTS

➤ Thanks to Wouter Naessens for aiding with the beer example.

➤ P450 engineering case study is work mainly done by Laurentijn Tilleman

19

pairwise learning for synthetic biology

Science