weighing evidence in the absence of a gold standard

Weighing Evidence Weighing Evidence in the Absence in the Absence

of a Gold Standardof a Gold StandardPhil Long

Genome Institute of Singapore

(joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir Friedman and Edison Liu.)

Problem: Ortholog mappingProblem: Ortholog mapping

• Pair genes in one organism with their equivalent counterparts in another

• Useful for supporting medical research using animal models

A little molecular biologyA little molecular biology

• DNA has nucleotides (A, C, T and G) arranged linearly along chromosomes

• Regions of DNA, called genes, encode proteins

• Proteins biochemical workhorses

• Proteins made up of amino acids• also strung together linearly

• fold up to form 3D structure

Mutations and evolutionMutations and evolution

• Speciation often roughly as follows:• one species separated into two populations• separate populations’ genomes drift apart through

mutation• important parts (e.g. genes) drift less

• Orthologs have common evolutionary ancestor• Genes sometimes copied

• original retains function• copy drifts or dies out

• Both fine-grained and coarse-grained mutations

Evidence of orthologyEvidence of orthology

• (protein) sequence similarity

• comparison with third organism

• conservation of synteny

.

.

.

Conserved synteny Conserved synteny • Neighbor relationships often preserved

• Consequently, similarity among their neighbors evidence that a pair of genes are orthologs

PlanPlan

• Identify numerical features corresponding to• sequence similarity

• common similarity to third organism

• conservation of synteny

• “Learn” mapping from feature values to prediction

Problem – no “gold Problem – no “gold standard”standard”

• for mouse-human orthology, Jackson database reasonable

• for human-zebrafish? human-pombe?

Another “no gold standard” Another “no gold standard” problem: protein-protein problem: protein-protein

interactionsinteractions• Sources of evidence:

• Yeast two-hybrid

• Rosetta Stone

• Phage display

• All yield errors

.

.

.

Related Theoretical Work Related Theoretical Work [MV95] – Problem[MV95] – Problem

• Goal:• given m training examples generated as below• output accurate classifier h

• Training example generation:• All variables {0,1}-valued• Y chosen randomly, fixed

• X1,...,Xn chosen independently with Pr(Xi = Y) = pi, where pi is• unknown, • same when Y is 0 or 1 (crucial for analysis)

• only X1,...,Xn given to training algorithm

Related Theoretical Work Related Theoretical Work [MV95] – Results[MV95] – Results

• If n ≥ 3, can approach Bayes error (best possible for source) as m gets large

• Idea:• variable “good” if often agrees with others

• can e.g. solve for Pr(X1 = Y) as function of Pr(X1 = X2), Pr(X1 = X3), and Pr(X2 = X3)

• can estimate Pr(X1 = X2), Pr(X1 = X3), and Pr(X2 = X3) from the training data

• can plug in to get estimates of Pr(X1 = Y),...,Pr(Xn = Y)

• can use resulting estimates of Pr(X1 = Y),...,Pr(Xn = Y) to approximate optimal classifier for source

In our problem(s)...In our problem(s)...

• Pr(Y = 1) small

• X1,...,Xn continuous-valued

• Reasonable to assume X1,...,Xn conditionally independent given Y

• Reasonable to assume Pr(Y = 1 | Xi = x) increasing in x, for all i

• Sufficient to sort training examples in order of associated conditional probabilities that Y = 1

Key IdeaKey Idea

• Suppose Pr(Y = 1) known• For variable i,

• Set threshold so that Pr(Ui = 1) = Pr(Y = 1)• Then Pr(Y = 1 and Ui = 0) = Pr(Y = 0 and Ui = 1) • Can solve for these error probabilities for all i in

terms of probabilities Ui’s agree,...

- - - - - - - - - - - - - + -- - - + + - + - - + + +

Ui = 1Ui = 0

Final Plan (informal)Final Plan (informal)

• Assume various values of Pr(Y = 1); predict orthologs given each

• For pairs of genes predicted to be orthologs even when Pr(Y = 1) assumed small, confidently predict orthology

• For pairs of genes predicted to be orthologs only when Pr(Y = 1) assumed pretty big, predict orthology more tentatively

Final Plan – Final Plan – Probabilistic ViewpointProbabilistic Viewpoint

• Consider hidden variable Z:• takes values uniformly distributed in [0,1]• interpretation: “obviously orthologous”

• Assumptions• Pr(Y = 1| Z = z) increasing in z

• For all z, Pr(Z ≥ z | Xi = x ) increasing in x

• For various z• Let Vz = 1 if Z ≥ z, Vz = 0 otherwise

• Let Uz,i = 1 if Xi ≥ θz,i, Uz,i = 0 otherwise, where θz,i chosen so that Pr(Uz,i = 1) = Pr(Vz = 1)

• Interpretations: • Vz is “In the top 100(1-z)% most likely to have Y = 1 overall”

• Uz,i “In the top 100(1-z) % most likely to have Y = 1 given Xi”

Final Plan - AlgorithmFinal Plan - Algorithm

• Estimate conditional probability that Vz = 1, i.e. that Z ≥ z, given each

training example, using estimated probabilities pairs of Uz,i’s agree

• Add to estimate Z’s; sort by estimates.

Practical problemPractical problem

• Small errors in estimates of Pr(Uz,i = Uz,j)’s

can lead to large errors in estimates of Pr(Uz,i = Vz )’s (in fact, program crashes).

• Solution: • when Pr(Vz = 1) small is important case

(confident predictions)

• can approximate: Pr(Uz,i ≠ Vz ) ~ ½ (Pr(Uz,i ≠ Uz,j) + Pr(Uz,i ≠ Uz,k) - Pr(Uz,j ≠ Uz,k)).

Evaluation: Artificial SourceEvaluation: Artificial Source

• Examples generated using randomly chosen probability distribution:

• Pr(Yz = 1) = 0.1, n = 5• For each i,

• choose μi uniformly from [min,max]• set distributions for ith variable:

• Pr(Xi | Y=0) = N(-μi,1), • Pr(Xi | Y=1) = N(μi,1).

• Evaluate using area under the ROC curve• Repeat 100 times, average

ROC curveROC curve

False positives

True

positives

1

1

Area under

the ROC curve

Results: Artificial SourceResults: Artificial Source

m min μ max μ peer AUC opt (w/ Y’s)

1000 0.2 1.0 .940 .985

1000 0.1 0.5 .811 .881

1000 0.05 0.25 .635 .818

1000 0.02 0.1 .611 .753

Evaluation: mouse-human Evaluation: mouse-human ortholog mappingortholog mapping

• Use Jackson mouse-human ortholog database as “gold standard”

• Apply algorithm, post-processing to map each gene to unique ortholog

• Compare with analogous BLAST-only algorithm• Plot ROC curve • Treat anything not in database as non-ortholog

• some “false positives” in fact correct• error rate overestimated

Results: mouse-human Results: mouse-human ortholog mappingortholog mapping

Open problemsOpen problems

• Given our assumptions, is there an algorithm for learning using random examples that always approaches the optimal AUC given knowledge of the source?

• Is discretizing the independent variables necessary?

• How does our method compare with other natural algorithms? (E.g. what about algorithms based on clustering?)

weighing evidence in the absence of a gold standard

Documents