a discrete optimization approach for svd best truncation choice based on roc curves

29
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves IEEE BIBE 2013 13rd IEEE International Conference on Bioinformatics and Bioengineering, 11st November, Chania, Greece, EU Davide Chicco, Marco Masseroli [email protected]

Upload: davide-chicco

Post on 14-Jul-2015

124 views

Category:

Education


0 download

TRANSCRIPT

A Discrete Optimization Approach

for SVD Best Truncation Choice based

on ROC Curves

IEEE BIBE 201313rd IEEE International Conference on

Bioinformatics and Bioengineering, 11st November, Chania, Greece, EU

Davide Chicco, Marco Masseroli

[email protected]

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 2

Summary

1. The context & the problem

• Biomolecular annotations

• Prediction of biomolecular annotations

• SVD (Singular Value Decomposition)

• SVD Truncation

2. The proposed solution

• ROC Area Under the Curve comparison

• Truncation level choices

3. Evaluation

• Evaluation data set & results

4. Conclusions

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 3

Biomolecular annotations

• The concept of annotation: association of nucleotide or amino

acid sequences with useful information describing their features

• This information is expressed through controlled vocabularies,

sometimes structured as ontologies, where every controlled

term of the vocabulary is associated with a unique

alphanumeric code

• The association of such a code with a gene or protein ID

constitutes an annotation

Gene /

Protein

Biological function feature

Annotation

gene2bff

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves”

Biomolecular annotations (2)

• The association of an information/feature with a gene or

protein ID constitutes an annotation

• Annotation example:

• gene: GD4

• feature: “is present in the mitochondrial membrane”

4

Gene /

Protein

Biological function feature

Annotation

gene2bff

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 5

Prediction of biomolecular annotations

• Many available annotations in different databanks

• However, available annotations are incomplete

• Only a few of them represent highly reliable, human–curated

information

• To support and quicken the time–consuming curation process,

prioritized lists of computationally predicted annotations

are extremely useful

• These lists could be generated softwares based that implement

Machine Learning algorithms

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 7

Annotation prediction through

Singular Value Decomposition – SVD

• Annotation matrix A {0, 1} m x n

− m rows: genes / proteins

− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any

descendant of j in the considered ontology structure (true

path rule)

A(i,j) = 0 otherwise (it is unknown)

term01 term02 term03 term04 … termN

gene01 0 0 0 0 … 0

gene02 0 1 1 0 … 1

… … … … … … …

geneM 0 0 0 0 … 0

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 8

Annotation prediction through

Singular Value Decomposition – SVD

• Annotation matrix A {0, 1} m x n

− m rows: genes / proteins

− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any

descendant of j in the considered ontology structure (true

path rule)

A(i,j) = 0 otherwise (it is unknown)

term01 term02 term03 term04 … termN

gene01 0 0 0 0 … 0

gene02 0 1 1 0 … 1

… … … … … … …

geneM 0 0 0 0 … 0

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 9

Compute SVD:

Compute reduced rank approximation:

• An annotation prediction is performed by computing a reduced

rank approximation Ak of the annotation matrix A

(where 0 < k < r, with r the number of non zero singular values

of A, i.e. the rank of A)

TA U V

TA U V

TA U V TA U V TA U V

T

k k k kA U V

k

T

k k k kA U V T

k k k kA U V T

k k k kA U V T

k k k kA U V

k

Singular Value Decomposition – SVD

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 10

Singular Value Decomposition – SVD (2)

• Ak contains real valued entries related to the likelihood that

gene i shall be annotated to term j

For a certain real threshold τ:

if Ak(i,j) > τ, gene i is predicted to be annotated to term j

− The threshold τ can be chosen in order to obtain the

best predicted annotations [Khatri et al., 2005]

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 11

Singular Value Decomposition – SVD (3)

• It is possible to rewrite the SVD decomposition in an equivalent

form, such that the predicted annotation profile is given by:

ak,iT = ai

T Vk VkT

where ak,iT is a row vector containing the predictions for gene i

• Note that Vk depends on the whole set of genes

• Indeed, the columns of Vk are a set of eigenvectors of the

global term-to-term correlation matrix T = ATA, estimated from

the whole set of available annotations

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 12

Evaluation of the prediction

To evaluate the prediction, we compare each A(i,j) element to its

corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0

• if A(i,j) = 1 & Ak(i,j) > τ: AC: Annotation Confirmed

(AC <- AC+1)

• if A(i,j) = 1 & Ak(i,j) ≤ τ: AR: Annotation to be Reviewed

(AR <- AR+1)

• if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed

(NAC <- NAC+1)

• if A(i,j) = 0 & Ak(i,j) > τ: AP: annotation predicted

(AP <- AP+1)

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 13

SVD truncation

• The main problem of truncated SVD: how to choose the

truncation?

• Where to truncate?

How to choose the k here?

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 14

New concept: Receiver Operating Characteristic

(ROC) curve

Starting from the annotation prediction evaluation factor we just

introduced

AC: Annotation Confirmed

AR: Annotation to be Reviewed

NAC: No Annotation Confirmed

AP: Annotation Predicted

We can design the Receiver Operating Characteristic curves for

every prediction:

On the x, the annotation to be reviewed rate:𝑨𝑹

𝑨𝑪+𝑨𝑹

On the y, the annotation predicted rate:𝑨𝑷

𝑨𝑷+𝑵𝑨𝑪

Input Output

Yes Yes

Yes No

No No

No Yes

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 15

New concept: Receiver Operating Characteristic

(ROC) curve (2)

On the y, the annotation confirmed rate:𝑨𝑪

𝑨𝑪+𝑨𝑹

On the x, the annotation predicted rate:𝑨𝑷

𝑨𝑷+𝑵𝑨𝑪

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 16

SVD truncation choice

Algorithm:

1) Choose some possible truncation levels

2) Compute the Receiver Operating Characteristic for each

SVD prediction of those truncation levels

3) Compute the Area Under the Curve (AUC) of each ROC

4) Choose the truncation level of the ROC that has

maximum AUC

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 17

SVD truncation choice (2)

Algorithm:

1) Choose some possible truncation levels

2) Compute the Receiver Operating Characteristic for each

SVD prediction of those truncation levels

3) Compute the Area Under the Curve (AUC) of each ROC

4) Choose the truncation level of the ROC that has

maximum AUC

Quite easy!

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 18

SVD truncation choice (3)

Algorithm:

1) Choose some possible truncation levels

2) Compute the Receiver Operating Characteristic for each

SVD prediction of those truncation levels

3) Compute the Area Under the Curve (AUC) of each ROC

4) Choose the truncation level of the ROC that has

maximum AUC

Quite challenging!

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 19

Minimum AUC between all the ROCs of various

truncation levels

1) Choose some possible truncation levels

We cannot compute the SVD, its ROC and its AUC for every

truncation values because would be too expensive (for time

and resources).

Algorithm:

1) Since the matrix A(i,j) has m rows (genes) and n columns

(annotation terms), we take p = min(m, n)

2) Since r ≤ p is the number of non-zero singular values

along the diagonal of , the best truncation value is in the

interval [1; r]

3) newInterval = {1, r}

4) k = firstElement(newInterval)

5) step = length(newInterval) / numStep

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 20

Minimum AUC between all the ROCs of various

truncation levels (2)

4. We make a sampling of all the N non-null singular values,

with constant sample intervals of size step (step=10% * N)

5. For every sampled singular value, we compute the SVD

and its corresponding ROC AUC for ACrate in [0%, 100%]

and APrate in [0%, 1%]

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 21

Minimum AUC between all the ROCs of various

truncation levels (3)

Given the first AUC, if the AUCs of all the three subsequent

samples decrease, we take it for the zoom next step

This means we found a local maximum.

Local

Best

Index

zoom

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 22

Minimum AUC between all the ROCs of various

truncation levels (3)

If the AUC differences of the last three singular values are

lower than gamma = 10%, , we take it for the zoom next step

This means that the AUCs do not grow up enough

Chosen

Index

zoom

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 23

Minimum AUC between all the ROCs of various

truncation levels (3)

Once we chose the index where to zoom, we re-run the

algorithm in the sub-interval

Until one of the previously described condition is satisfied

Or the maximum number of zooms (numZoom = 4) is reached

zoom

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 24

Example

Dataset: annotations with Gallus gallus genes and Biological

Process Gene Ontology terms

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 25

Results

• To evaluate the performance of our method, we used

annotations of

terms: Biological process (BP), Cellular component (CC) and

Molecular function (MF) GO features

organisms Bos Taurus, Danio rerio, Gallus gallus genes

• Available on July 2009 in an old version of the Gene Ontology

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 26

Results (2)

We then checked the, against the percentage of annotations

predicted percentage of annotations predicted with our SVD

method and our optimized truncation levelby the SVD method

and fixed truncation level (k=500) used by Draghici et al. in the

paper “A semantic analysis of the annotations of the human

genome” (2005)

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 27

Conclusions

Problem: SVD truncation in

the prediction of genomic

annotations context

Proposed solution: finding the

truncation level corresponding

to the maximum AUC of the

ROC curve, and it’s near to

zero

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 28

Conclusions (2)

•To avoid computing SVD for all the possible truncation levels

(too expensive!), we proposed an algorithm for the search of

local and global maxima, by zooming sub-intervals

•The best SVD truncation levels suggested by this algorithm for

our dataset (annotations of Bos Taurus, Danio Rerio, and Gallus

gallus genes, and GO terms) gave better results than other

truncation levels, in a reasonable time.

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 29

Future developments

• To obtain the best sampling, we could study the gradient

variations in the distribution of the AUC values for different

truncation levels and the histogram of the eigenvalues

• Our approach is not limited to the Gene Ontology and can be

applied to any controlled annotations

“A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC Curves” 30

Thanks for your attention!!!

www.DavideChicco.it

[email protected]

A Discrete Optimization Approach for SVD Best

Truncation Choice based on ROC Curves