understanding the language of virus proteins to automatically detect drug resistance
DESCRIPTION
Understanding the Language of Virus Proteins to Automatically Detect Drug Resistance. Betty Cheng, Jaime Carbonell Language Technologies Institute, School of Computer Science Carnegie Mellon University. Outline. HIV & Drug Resistance Phenotype Prediction Models Machine Learning - PowerPoint PPT PresentationTRANSCRIPT
Betty Cheng, Jaime CarbonellLanguage Technologies Institute, School of Computer Science
Carnegie Mellon University
HIV & Drug Resistance Phenotype Prediction Models Machine Learning Language of Proteins Document Classification of HIV
Genotypes Comparison to state-of-the-art & human
experts Other Area of Application: GPCR Conclusions
Drug resistance is an obstacle in treatment and control of many infectious diseases
33.2 million living with AIDS in 2007 2.1 million died from AIDS in 2007 High mutation rate of HIV leads to quasi-
species of virus strains inside each patient
25% diversity
4 %
Currently ~25 drugs in 4 main drug classes Treatments with 3+ drugs (HAART) used to
cover as many virus strains as possible in quasi-species Personalized Medicine Trial-and-error not an option due to cross resistance
Goal: Optimize treatment to take longest for virus population to develop resistance
Current: Phenotype predicted from genotype test results to identify resistance present now
Problem: Predict resistance (high/low/none) to each drug given patient’s HIV genotype
Example: Rega and ANRS systems If at least Z of <list of mutations> are present, then
predict resistance level Y to drug X. Example: HIVdb
Sum the penalty scores from each mutation.
Advantage: easy to understand reason for prediction
Disadvantage: impossible to maintain as more data and drugs become available
Find db sequence most similar to test sequence at all selected mutation positions
Does not interpolate between partial matches
Example: VirtualPhenotypeTM [from Virco]
Advantage: no rules to maintain Disadvantages:
Human experts still needed to identify mutation positions Large amount of data needed to ensure a db match
Systems can “learn” by detecting patterns in training data and deduction
Enables knowledge discovery Varies in the type of features and learning
algorithm Features:
Presence of mutation Mutation Structure-based
Maintenance is just re-running learning algorithm on new data
Takes minutes ~ hours to train, seconds ~ minutes to test a sequence
Sufficient for Protease Inhibitors
Majority of studies
Glass-box alg. allows knowledge discovery Black-box alg. more tolerant of extra features Existing systems trade-off between black-box
systems and expert-selected mutations
Decision tree for EFV(Beerenwinkel, ‘02)
Neural Network:27 Mutations
Classify document by topic based on words Trade-off between using all English words or
select keyword Chi-square feature selection found to be best
at selecting keywords in text [Yang et al. ‘97]
a to
the
ball
a
to
the ball
a
to the
ball
hoop
basket
bat
glove
tackle
touchdown
View target virus proteins as documents Alphabet size: 20 amino acids No word/motif boundaries (e.g. Thai, Japanese)
Features: position-independent n-grams, position-dependent n-grams (mutations)
Extract n-grams from every reading frame Represent as vector of n-gram counts
G S V E R D S V E E V L K A F R L F D D G N S G T…G S G M R M S R E Q L L N A W R L F C K D N S H T…
G S G E R D S R E E I L K A F R L F D D D N S G T…
N-gram \ Count ≥ 5 ≥ 10 … ≥ 100
A
… (unigrams)
N-gram \ Count ≥ 1 ≥ 2 … ≥ 20
AA
… (bigrams)
AAA
… (trigrams)
Mutation \ Count ≥ 0.05 ≥ 0.1 … ≥ 1
188G
Cc xc
xcxcx
),e(
),o(),e()(
22
N
tnxc
xc),e(
Observed # of sequences with
feature x
Expected # of seqs with feature x and resistance level c
Total # of sequences
# of sequences with feature x
# of sequences with resistance
level c
Chi-square feature selection is the best for document classification. (Yang & Pedersen, 1997)
N-gram \ Count ≥ 5 ≥ 10 … ≥ 100
A 23.5 … 12.1
… (unigrams)
N-gram \ Count ≥ 1 ≥ 2 … ≥ 20
AA 23.1 19.9 …
… (bigrams)
AAA 15.1 … 10.2
… (trigrams)
30.2
29.9
45.1
AAA ( ≥ 1) … A ( ≥ 10) … AA ( ≥ 20)
Classifier
N-grams extracted at every reading frame of protein sequence 01001………51571
225
FFT……TFTFF
Counts of all n-grams
Selected n-grams occurring more frequently than their most discriminative thresholds
Chi-Square Feature
Selection
G S G E R D S R E E I L K A F R L F D D D N S G T…
G S V E R D S V E E V L K A F R L F D D G N S G T…G S G M R M S R E Q L L N A W R L F C K D N S H T…
Previous study (Rhee et al., 2006) compared performance of 3 feature sets: Expert-selected mutations Treatment-selected mutations (TSM) Mutations occurring more than 2x in dataset
TSM trained from additional database of patients treated with a given drug class but no drugs targeting same protein Not possible to be specific to each drug
Found human experts or TSM to perform best
Drug χ2-Selected Rhee et al.,
2006
3Indep PosDep TSM Expe
rt
Nucleoside RT Inhibitors (NRTI)
3TC 0.897 0.934 0.92 0.92
ABC 0.680 0.713 0.74 0.75
AZT 0.733 0.752 0.71 0.72
D4T 0.778 0.781 0.76 0.77
DDI 0.723 0.745 0.75 0.75
TDF 0.705 0.705 0.69 0.67
Avg. 0.753 0.772 0.76 0.76
Non-Nucleoside RT Inhibitors
(NNRTI)
DLV 0.823 0.842 0.84 0.82
EFV 0.864 0.855 0.85 0.80
NVP 0.912 0.910 0.91 0.89
Avg. 0.866 0.869 0.87 0.84
Drug χ2-Selected Rhee et al.,
2006
3Inde
p
PosDe
p
TSM Exper
t
Protease Inhibitors (PI)
APV 0.788 0.786 0.78 0.77
ATV 0.660 0.678 0.65 0.65
IDV 0.753 0.732 0.75 0.75
LPV 0.764 0.797 0.74 0.73
NFV 0.761 0.774 0.80 0.80
RTV 0.840 0.837 0.85 0.84
SQV 0.793 0.812 0.80 0.77
Avg. 0.765 0.774 0.77 0.76
Avg.
0.78
0
0.79
1 0.78 0.78
Using same dataset and classifier (decision tree), our X2-selected features performed comparably to TSM and expert-selected mutations
Evaluated on several learning algorithms Glass-box: decision tree, naïve Bayes, random forest Black-box: SVM
Average 100-120 X2 features Choice of classifier did not make much
difference
Learning Alg.
PI NRTI NNRTI Avg.
Decision Tree 0.774 0.772 0.869 0.791
Naïve Bayes 0.781 0.767 0.858 0.790
Random Forest
0.800 0.785 0.875 0.808
SVM 0.809 0.807 0.880 0.822
Used regression algorithms to predict resistance factor (IC50 ratio) Comparing the best models from each study for each drug, our
model matched or outperformed Rhee et al. on 12 of 16 drugs Average difference < 0.01
Drug Our Method Rhee et al.,
2006
Regressi
on
r2 r2 Regress
ion
Protease Inhibitors (PI)
APV SVM
0.82
1 0.82 SVM
ATV SVM
0.77
5 0.76 LSR
IDV SVM
0.82
6 0.83 LSR
LPV SVM
0.86
5 0.87 SVM
NFV Linear
0.85
4 0.84 LSR
RTV SVM
0.90
0 0.89 LSR
SQV SVM
0.83
8 0.84 LSR
Drug Our Method Rhee et al.,
2006
Regressi
on
r2 r2 Regress
ion
Nucleoside RT Inhibitors (NRTI)
3TC SVM
0.93
5 0.95 SVM
ABC Linear
0.78
8 0.79 LARS
AZT Linear
0.76
7 0.74 SVM
D4T SVM
0.74
7 0.79 SVM
DDI SVM
0.72
9 0.75 SVM
TDF SVM
0.52
7 0.59 SVM
Non-Nucleoside RT Inhibitors
(NNRTI)
DLV SVM
0.81
5 0.79 LARS
EFV SVM
0.86
4 0.85 LARS
NVP SVM
0.81
1 0.79 LARS
53 of 54 expert-selected mutations for PIranked 108th or higher by χ2
20 of 21 expert-selected mutations for NRTI
ranked 120th or higher by χ2
All 15 expert-selected mutations for NNRTI
ranked 107th or higher by χ2
RTV 3TC EFV
V54 -184 N103
V71 V184 -103
M90 N67 I100
A82 W210 A190
-10 -215 V74
I10 L41 C181
I46 R65 S190
V84 Y215 E101
-54 D69 P101
-71 H228 L188
-82 D44 G98
-90 C181 H225
R20 -67 R228
F33 -41 Q190
-46 I118 E190
RTV 3TC EFV
I24 M75 E179
I36 E43 -190
S73 F215 L227
T43 A190 N219
T82 Y208 Y221
I20 K83 E43
L46 R82 S103
-63 I54 L230
-20 G98 M135
F10 L227 Y208
I32 E218 R102
L53 -210 D179
-84 A106 I74
D37 I69 A106
V48 A39 Q102
Phenotype systems predict drug resistance the detected genotype has currently
Not a summation of resistance to individual drugs Mutations can cause resistance to one drug while
increasing sensitivity to another Minor strains not detected by genotype testing
Treatment history Variation in human host affects response
Adherence [Ying et al., 2007] Haplotype? Gender? State of health? Lifestyle habits?
Model impact of interaction between all these factors using a feature for each combination
χ2 reduces to manageable number of important features before applying to glass-box model
Amortized optimization of HAART requires short-term and long-term response model
Given a new protein sequence, classify it into the correct category at each level in the hierarchy Subfamily classification based on function
G-Protein Coupled Receptors (GPCR) is target of 60% of current drugs
Previous classification studies rely on alignment-based features
Karchin et al.(2002) evaluated performance of classifiers at varying levels of complexity and concluded SVMs were necessary to attain 85%+ accuracy
Document classification approach with χ2 features and naïve Bayes or decision tree
SVM, Neural Nets, Clustering
Decision Trees, Naïve Bayes
Hidden MarkovModels (HMM)
K-NearestNeighbours
Complex
Simple
Classifier # of Features Type of Features Accuracy
Naïve Bayes 7400 Chi-square n-gram features 93.2 %
SVM 9 per match state in the HMM
Gradient of the log-likelihood that the sequence is generated by the given HMM model
88.4 %
BLAST Local sequence alignment 83.3 %
Decision Tree 2700 Chi-square n-gram features 78.0 %
SAM-T2K HMM A HMM model built for each protein subfamily 69.9 %
kernNN 9 per match state in the HMM
Gradient of the log-likelihood that the sequence is generated by the given HMM model
64.0 %
Naïve Bayes with chi-square attained 39.7% reduction in residual error.
Position-independent n-grams outperformed position-specific ones because diversity of GPCR seqs made sequence alignment difficult.
Classifier # of Features Type of Features Accuracy
Naïve Bayes 8100 Chi-square n-gram features 92.4 %
SVM 9 per match state in the HMM
Gradient of the log-likelihood that the sequence is generated by the given HMM model
86.3 %
BLAST Local sequence alignment 74.5 %
Decision Tree 2300 Chi-square n-gram features 70.2 %
SAM-T2K HMM A HMM model built for each protein subfamily 70.0 %
kernNN 9 per match state in the HMM
Gradient of the log-likelihood that the sequence is generated by the given HMM model
51.0 %
Naïve Bayes with chi-square attained 44.5% reduction in residual error.
N-grams selected by chi-square joined to form motifs found in literature.
Current phenotype prediction systems require human experts to maintain – either rules or resistance-associated mutations
Text document classification approach led to fully automatic prediction model with comparable results to state-of-the-art yet requiring no human expertise
χ2 identified mutations overlap strongly with human experts
Similar approach had found success in previous work on GPCR proteins
Aim: An automatic prediction model for short-term and long-term viral load response to HAART so that amortized treatment optimization is possible
Betty Cheng ([email protected])Jaime Carbonell ([email protected])