understanding the language of virus proteins to automatically detect drug resistance

Betty Cheng, Jaime CarbonellLanguage Technologies Institute, School of Computer Science

Carnegie Mellon University

HIV & Drug Resistance Phenotype Prediction Models Machine Learning Language of Proteins Document Classification of HIV

Genotypes Comparison to state-of-the-art & human

experts Other Area of Application: GPCR Conclusions

Drug resistance is an obstacle in treatment and control of many infectious diseases

33.2 million living with AIDS in 2007 2.1 million died from AIDS in 2007 High mutation rate of HIV leads to quasi-

species of virus strains inside each patient

25% diversity

4 %

Currently ~25 drugs in 4 main drug classes Treatments with 3+ drugs (HAART) used to

cover as many virus strains as possible in quasi-species Personalized Medicine Trial-and-error not an option due to cross resistance

Goal: Optimize treatment to take longest for virus population to develop resistance

Current: Phenotype predicted from genotype test results to identify resistance present now

Problem: Predict resistance (high/low/none) to each drug given patient’s HIV genotype

Example: Rega and ANRS systems If at least Z of <list of mutations> are present, then

predict resistance level Y to drug X. Example: HIVdb

Sum the penalty scores from each mutation.

Advantage: easy to understand reason for prediction

Disadvantage: impossible to maintain as more data and drugs become available

Find db sequence most similar to test sequence at all selected mutation positions

Does not interpolate between partial matches

Example: VirtualPhenotypeTM [from Virco]

Advantage: no rules to maintain Disadvantages:

Human experts still needed to identify mutation positions Large amount of data needed to ensure a db match

Systems can “learn” by detecting patterns in training data and deduction

Enables knowledge discovery Varies in the type of features and learning

algorithm Features:

Presence of mutation Mutation Structure-based

Maintenance is just re-running learning algorithm on new data

Takes minutes ~ hours to train, seconds ~ minutes to test a sequence

Sufficient for Protease Inhibitors

Majority of studies

Glass-box alg. allows knowledge discovery Black-box alg. more tolerant of extra features Existing systems trade-off between black-box

systems and expert-selected mutations

Decision tree for EFV(Beerenwinkel, ‘02)

Neural Network:27 Mutations

Classify document by topic based on words Trade-off between using all English words or

select keyword Chi-square feature selection found to be best

at selecting keywords in text [Yang et al. ‘97]

a to

the

ball

a

to

the ball

a

to the

ball

hoop

basket

bat

glove

tackle

touchdown

View target virus proteins as documents Alphabet size: 20 amino acids No word/motif boundaries (e.g. Thai, Japanese)

Features: position-independent n-grams, position-dependent n-grams (mutations)

Extract n-grams from every reading frame Represent as vector of n-gram counts

G S V E R D S V E E V L K A F R L F D D G N S G T…G S G M R M S R E Q L L N A W R L F C K D N S H T…

G S G E R D S R E E I L K A F R L F D D D N S G T…

N-gram \ Count ≥ 5 ≥ 10 … ≥ 100

A

… (unigrams)

N-gram \ Count ≥ 1 ≥ 2 … ≥ 20

AA

… (bigrams)

AAA

… (trigrams)

Mutation \ Count ≥ 0.05 ≥ 0.1 … ≥ 1

188G

Cc xc

xcxcx

),e(

),o(),e()(

22

N

tnxc

xc),e(

Observed # of sequences with

feature x

Expected # of seqs with feature x and resistance level c

Total # of sequences

# of sequences with feature x

# of sequences with resistance

level c

Chi-square feature selection is the best for document classification. (Yang & Pedersen, 1997)

N-gram \ Count ≥ 5 ≥ 10 … ≥ 100

A 23.5 … 12.1

… (unigrams)

N-gram \ Count ≥ 1 ≥ 2 … ≥ 20

AA 23.1 19.9 …

… (bigrams)

AAA 15.1 … 10.2

… (trigrams)

30.2

29.9

45.1

AAA ( ≥ 1) … A ( ≥ 10) … AA ( ≥ 20)

Classifier

N-grams extracted at every reading frame of protein sequence 01001………51571

225

FFT……TFTFF

Counts of all n-grams

Selected n-grams occurring more frequently than their most discriminative thresholds

Chi-Square Feature

Selection

G S G E R D S R E E I L K A F R L F D D D N S G T…

G S V E R D S V E E V L K A F R L F D D G N S G T…G S G M R M S R E Q L L N A W R L F C K D N S H T…

Previous study (Rhee et al., 2006) compared performance of 3 feature sets: Expert-selected mutations Treatment-selected mutations (TSM) Mutations occurring more than 2x in dataset

TSM trained from additional database of patients treated with a given drug class but no drugs targeting same protein Not possible to be specific to each drug

Found human experts or TSM to perform best

Drug χ2-Selected Rhee et al.,

2006

3Indep PosDep TSM Expe

rt

Nucleoside RT Inhibitors (NRTI)

3TC 0.897 0.934 0.92 0.92

ABC 0.680 0.713 0.74 0.75

AZT 0.733 0.752 0.71 0.72

D4T 0.778 0.781 0.76 0.77

DDI 0.723 0.745 0.75 0.75

TDF 0.705 0.705 0.69 0.67

Avg. 0.753 0.772 0.76 0.76

Non-Nucleoside RT Inhibitors

(NNRTI)

DLV 0.823 0.842 0.84 0.82

EFV 0.864 0.855 0.85 0.80

NVP 0.912 0.910 0.91 0.89

Avg. 0.866 0.869 0.87 0.84

Drug χ2-Selected Rhee et al.,

2006

3Inde

p

PosDe

p

TSM Exper

t

Protease Inhibitors (PI)

APV 0.788 0.786 0.78 0.77

ATV 0.660 0.678 0.65 0.65

IDV 0.753 0.732 0.75 0.75

LPV 0.764 0.797 0.74 0.73

NFV 0.761 0.774 0.80 0.80

RTV 0.840 0.837 0.85 0.84

SQV 0.793 0.812 0.80 0.77

Avg. 0.765 0.774 0.77 0.76

Avg.

0.78

0

0.79

1 0.78 0.78

Using same dataset and classifier (decision tree), our X2-selected features performed comparably to TSM and expert-selected mutations

Evaluated on several learning algorithms Glass-box: decision tree, naïve Bayes, random forest Black-box: SVM

Average 100-120 X2 features Choice of classifier did not make much

difference

Learning Alg.

PI NRTI NNRTI Avg.

Decision Tree 0.774 0.772 0.869 0.791

Naïve Bayes 0.781 0.767 0.858 0.790

Random Forest

0.800 0.785 0.875 0.808

SVM 0.809 0.807 0.880 0.822

Used regression algorithms to predict resistance factor (IC50 ratio) Comparing the best models from each study for each drug, our

model matched or outperformed Rhee et al. on 12 of 16 drugs Average difference < 0.01

Drug Our Method Rhee et al.,

2006

Regressi

on

r2 r2 Regress

ion

Protease Inhibitors (PI)

APV SVM

0.82

1 0.82 SVM

ATV SVM

0.77

5 0.76 LSR

IDV SVM

0.82

6 0.83 LSR

LPV SVM

0.86

5 0.87 SVM

NFV Linear

0.85

4 0.84 LSR

RTV SVM

0.90

0 0.89 LSR

SQV SVM

0.83

8 0.84 LSR

Drug Our Method Rhee et al.,

2006

Regressi

on

r2 r2 Regress

ion

Nucleoside RT Inhibitors (NRTI)

3TC SVM

0.93

5 0.95 SVM

ABC Linear

0.78

8 0.79 LARS

AZT Linear

0.76

7 0.74 SVM

D4T SVM

0.74

7 0.79 SVM

DDI SVM

0.72

9 0.75 SVM

TDF SVM

0.52

7 0.59 SVM

Non-Nucleoside RT Inhibitors

(NNRTI)

DLV SVM

0.81

5 0.79 LARS

EFV SVM

0.86

4 0.85 LARS

NVP SVM

0.81

1 0.79 LARS

53 of 54 expert-selected mutations for PIranked 108th or higher by χ2

20 of 21 expert-selected mutations for NRTI

ranked 120th or higher by χ2

All 15 expert-selected mutations for NNRTI

ranked 107th or higher by χ2

RTV 3TC EFV

V54 -184 N103

V71 V184 -103

M90 N67 I100

A82 W210 A190

-10 -215 V74

I10 L41 C181

I46 R65 S190

V84 Y215 E101

-54 D69 P101

-71 H228 L188

-82 D44 G98

-90 C181 H225

R20 -67 R228

F33 -41 Q190

-46 I118 E190

RTV 3TC EFV

I24 M75 E179

I36 E43 -190

S73 F215 L227

T43 A190 N219

T82 Y208 Y221

I20 K83 E43

L46 R82 S103

-63 I54 L230

-20 G98 M135

F10 L227 Y208

I32 E218 R102

L53 -210 D179

-84 A106 I74

D37 I69 A106

V48 A39 Q102

Phenotype systems predict drug resistance the detected genotype has currently

Not a summation of resistance to individual drugs Mutations can cause resistance to one drug while

increasing sensitivity to another Minor strains not detected by genotype testing

Treatment history Variation in human host affects response

Adherence [Ying et al., 2007] Haplotype? Gender? State of health? Lifestyle habits?

Model impact of interaction between all these factors using a feature for each combination

χ2 reduces to manageable number of important features before applying to glass-box model

Amortized optimization of HAART requires short-term and long-term response model

Given a new protein sequence, classify it into the correct category at each level in the hierarchy Subfamily classification based on function

G-Protein Coupled Receptors (GPCR) is target of 60% of current drugs

Previous classification studies rely on alignment-based features

Karchin et al.(2002) evaluated performance of classifiers at varying levels of complexity and concluded SVMs were necessary to attain 85%+ accuracy

Document classification approach with χ2 features and naïve Bayes or decision tree

SVM, Neural Nets, Clustering

Decision Trees, Naïve Bayes

Hidden MarkovModels (HMM)

K-NearestNeighbours

Complex

Simple

Classifier # of Features Type of Features Accuracy

Naïve Bayes 7400 Chi-square n-gram features 93.2 %

SVM 9 per match state in the HMM

Gradient of the log-likelihood that the sequence is generated by the given HMM model

88.4 %

BLAST Local sequence alignment 83.3 %

Decision Tree 2700 Chi-square n-gram features 78.0 %

SAM-T2K HMM A HMM model built for each protein subfamily 69.9 %

kernNN 9 per match state in the HMM


64.0 %

Naïve Bayes with chi-square attained 39.7% reduction in residual error.

Position-independent n-grams outperformed position-specific ones because diversity of GPCR seqs made sequence alignment difficult.

Classifier # of Features Type of Features Accuracy

Naïve Bayes 8100 Chi-square n-gram features 92.4 %

SVM 9 per match state in the HMM


86.3 %

BLAST Local sequence alignment 74.5 %

Decision Tree 2300 Chi-square n-gram features 70.2 %

SAM-T2K HMM A HMM model built for each protein subfamily 70.0 %

kernNN 9 per match state in the HMM


51.0 %

Naïve Bayes with chi-square attained 44.5% reduction in residual error.

N-grams selected by chi-square joined to form motifs found in literature.

Current phenotype prediction systems require human experts to maintain – either rules or resistance-associated mutations

Text document classification approach led to fully automatic prediction model with comparable results to state-of-the-art yet requiring no human expertise

χ2 identified mutations overlap strongly with human experts

Similar approach had found success in previous work on GPCR proteins

Aim: An automatic prediction model for short-term and long-term viral load response to HAART so that amortized treatment optimization is possible

Betty Cheng ([email protected])Jaime Carbonell ([email protected])

understanding the language of virus proteins to automatically detect drug resistance

Documents

d n s h tg s g e r d

f d d g n s g tg s g

s r e q

f d d d n s g tngram

vector of n

positionindependent

drug x

gpcrconclusionsdrug