hashnet: deep learning to hash by continuation

1
HashNet: Deep Learning to Hash by Continuation Zhangjie Cao , Mingsheng Long , Jianmin Wang , and Philip S. Yu †‡ KLiss, MOE; NEL-BDS; TNList; School of Software, Tsinghua University, China University of Illinois at Chicago, IL, USA +1 -1 x y 0 1 input CNNs fch sgn similarity label weighted cross- entropy loss -2 -1 0 1 2 -1 1 h=tanh(β b z) h=tanh(β g z) h=tanh(β o z) z h Figure 1: (left) The proposed HashNet for deep learning to hash by continuation, which is comprised of four key components: (1) Standard convolutional neural network (CNN), e.g. AlexNet and ResNet, for learning deep image representations, (2) a fully-connected hash layer (fch) for transforming the deep representation into K -dimensional representation, (3) a sign activation function (sgn) for binarizing the K -dimensional representation into K -bit binary hash code, and (4) a novel weighted cross-entropy loss for similarity-preserving learning from sparse data. (right) Plot of smoothed responses of the sign function h = sgn (z ): Red is the sign function, and blue, green and orange show functions h = tanh (βz ) with bandwidths β b g o . The key property is lim β →∞ tanh (βz ) = sgn (z ). Best viewed in color. Summary 1. An end-to-end deep learning to hash architecture for image retrieval which unifies four key components: a standard convolutional neural network a fully-connected hash layer (fch) for transforming the deep representation into K -dimensional representation a sign activation function (sgn) for binarizing the K -dimensional representation into K -bit binary hash code a novel weighted cross-entropy loss for similarity-preserving learning from sparse data 2. Learning by continuation method to optimize HashNet Challenges 1. Limitations of previous works: Ill-posed gradient problem: We need the sign function as activation function to convert deep continuous representations to binary hash codes. But the gradient of the sign function is zero for all nonzero inputs, making standard back-propagation infeasible. attack the ill-posed gradient problem by the continuation methods, which address a complex optimization problem by smoothing the original function, turning it into a easier to optimize problem. Data imbalance problem: The number of similar pairs is much smaller than that of dissimilar pairs in real retrieval systems which makes similarity-preserving learning ineffective. propose a novel weighted cross-entropy loss for similarity-preserving learning from sparse similarity. Model Formulation Weighted Maximum Likelihood (WML) estimation: log P (S|H )= X s ij ∈S w ij log P (s ij |h i , h j ), (1) Weight Formulation: w ij = c ij · |S| / |S 1 | , s ij =1 |S| / |S 0 | , s ij =0 (2) Pairwise logistic function: P (s ij |h i , h j )= σ (h i , h j ) , s ij =1 1 - σ (h i , h j ) , s ij =0 = σ (h i , h j ) s ij (1 - σ (h i , h j )) 1-s ij (3) Optimization problem of HashNet: min Θ X s ij ∈S w ij (log (1 + exp (α h i , h j )) - αs ij h i , h j ), (4) Learning by Continuation Algorithm: Input: A sequence 1 = β 0 1 <...<β m = for stage t =0 to m do Train HashNet (4) with tanh(β t z ) as activation Set converged HashNet as next stage initialization end Output: HashNet with sgn(z ) as activation, β m →∞ Experiment Setup Datasets: ImageNet, NUS-WIDE and MS COCO Baselines: LSH, SH, ITQ, ITQ-CCA, BRE, KSH, SDH, CNNH, DNNH and DHN Criteria: MAP, Precision-recall curve and Precision@top R curve Results Dataset Method MAP 16 bits 32 bits 48 bits 64 bits ImageNet HashNet 0.5059 0.6306 0.6633 0.6835 DHN 0.3106 0.4717 0.5419 0.5732 DNNH 0.2903 0.4605 0.5301 0.5645 CNNH 2812 0.4498 0.5245 0.5538 SDH 0.2985 0.4551 0.5549 0.5852 KSH 0.1599 0.2976 0.3422 0.3943 ITQ-CCA 0.2659 0.4362 0.5479 0.5764 ITQ 0.3255 0.4620 0.5170 0.5520 BRE 0.5027 0.5290 0.5475 0.5546 SH 0.2066 0.3280 0.3951 0.4191 LSH 0.1007 0.2350 0.3121 0.3596 NUS- WIDE HashNet 0.6623 0.6988 0.7114 0.7163 DHN 0.6374 0.6637 0.6692 0.6714 DNNH 0.5976 0.6158 0.6345 0.6388 CNNH 0.5696 0.5827 0.5926 0.5996 SDH 0.4756 0.5545 0.5786 0.5812 KSH 0.3561 0.3327 0.3124 0.3368 ITQ-CCA 0.4598 0.4052 0.3732 0.3467 ITQ 0.5086 0.5425 0.5580 0.5611 BRE 0.0628 0.2525 0.3300 0.3578 SH 0.4058 0.4209 0.4211 0.4104 LSH 0.3283 0.4227 0.4333 0.5009 MS COCO HashNet 0.6873 0.7184 0.7301 0.7362 DHN 0.6774 0.7013 0.6948 0.6944 DNNH 0.5932 0.6034 0.6045 0.6099 CNNH 0.5642 0.5744 0.5711 0.5671 SDH 0.5545 0.5642 0.5723 0.5799 KSH 0.5212 0.5343 0.5343 0.5361 ITQ-CCA 0.5659 0.5624 0.5297 0.5019 ITQ 0.5818 0.6243 0.6460 0.6574 BRE 0.5920 0.6224 0.6300 0.6336 SH 0.4951 0.5071 0.5099 0.5101 LSH 0.4592 0.4856 0.5440 0.5849 Table 1: MAP of Hamming Ranking on the three datasets Number of bits 20 25 30 35 40 45 50 55 60 Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (a) Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (b) Number of top returned images 100 200 300 400 500 600 700 800 900 1000 Precision 0.3 0.4 0.5 0.6 0.7 HashNet DHN DNNH CNNH ITQ-CCA KSH ITQ SH (c) Number of Bits 20 25 30 35 40 45 50 55 60 Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (d) Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (e) Number of top returned images 100 200 300 400 500 600 700 800 900 1000 Precision 0.3 0.4 0.5 0.6 HashNet DHN DNNH CNNH ITQ-CCA KSH ITQ SH (f) Number of bits 20 25 30 35 40 45 50 55 60 Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (g) Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision 0.4 0.5 0.6 0.7 0.8 (h) Number of top returned images 100 200 300 400 500 600 700 800 900 1000 Precision 0.5 0.6 0.7 0.8 HashNet DHN DNNH CNNH ITQ-CCA KSH ITQ SH (i) Precision curve w.r.t. top-N @ 64 bits Figure 2: The experimental results of Precision within Hamming radius 2, Precision-recall curve @ 64 bits and Precision curve w.r.t. top-N @ 64 bits on the three datasets. Code Quality HashNet 0 0.5 1 Frequency 0 500 1000 1500 2000 2500 3000 DHN 0 0.5 1 (a)ImageNet HashNet 0 0.5 1 Frequency 0 500 1000 1500 2000 2500 3000 3500 4000 DHN 0 0.5 1 (b)NUS-WIDE HashNet 0 0.5 1 Frequency 0 500 1000 1500 2000 2500 3000 DHN 0 0.5 1 (c)COCO Figure 3: Histogram of non-binarized codes of HashNet and DHN. Ablation Study Method ImageNet NUS-WIDE 16 bits 32 bits 48 bits 64 bits 16 bits 32 bits 48 bits 64 bits HashNet+C 0.5059 0.6306 0.6633 0.6835 0.6646 0.7024 0.7209 0.7259 HashNet 0.5059 0.6306 0.6633 0.6835 0.6623 0.6988 0.7114 0.7163 HashNet-W 0.3350 0.4852 0.5668 0.5992 0.6400 0.6638 0.6788 0.6933 HashNet-sgn 0.4249 0.5450 0.5828 0.6061 0.6603 0.6770 0.6921 0.7020 Table 2: MAP Results of HashNet and Its Variants on ImageNet and NUS-WIDE Table 3: MAP Results of HashNet and Its Variants on MS COCO Method MS COCO 16 bits 32 bits 48 bits 64 bits HashNet+C 0.6876 0.7261 0.7371 0.7419 HashNet 0.6873 0.7184 0.7301 0.7362 HashNet-W 0.6853 0.7174 0.7297 0.7348 HashNet-sgn 0.6449 0.6891 0.7056 0.7138 HashNet+C: variant using continuous similarity. HashNet-W: variant using maximum likelihood without weight. HashNet-sgn: variant using tanh() not sgn() as activation function. Convergence Analysis Number of Iterations 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Loss Value 0 0.5 1 1.5 2 2.5 3 3.5 4 HashNet-sign HashNet+sign DHN-sign DHN+sign (a)ImageNet Number of Iterations 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Loss Value 0 0.5 1 1.5 2 2.5 3 3.5 HashNet-sign HashNet+sign DHN-sign DHN+sign (b)NUS-WIDE Number of Iterations 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Loss Value 0 0.5 1 1.5 2 2.5 3 3.5 HashNet-sign HashNet+sign DHN-sign DHN+sign (c)COCO Figure 4: Losses of HashNet and DHN through training process. Visualization -40 -20 0 20 40 -40 -30 -20 -10 0 10 20 30 40 (a)HashNet -60 -40 -20 0 20 40 -40 -30 -20 -10 0 10 20 30 40 (b)DHN Figure 5: The t-SNE of hash codes learned by HashNet and DHN. Contact Information Web: http://ise.thss.tsinghua.edu.cn/~mlong/ Email: [email protected]

Upload: others

Post on 16-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HashNet: Deep Learning to Hash by Continuation

HashNet: Deep Learning to Hash byContinuation

Zhangjie Cao†, Mingsheng Long†, Jianmin Wang†, and Philip S. Yu†‡†KLiss, MOE; NEL-BDS; TNList; School of Software, Tsinghua University, China

‡University of Illinois at Chicago, IL, USA

+1

-1x

y

0

1

input CNNs fch sgnsimilarity

label

weightedcross-

entropy loss

-2 -1 0 1 2

-1

1

h=tanh(βbz)h=tanh(βgz)

h=tanh(βoz)

z

h

Figure 1: (left) The proposed HashNet for deep learning to hash by continuation, which is comprised of four key components: (1) Standard convolutional neural network (CNN), e.g. AlexNet and ResNet, for learning deep image representations, (2) a fully-connected hash layer (fch)for transforming the deep representation into K-dimensional representation, (3) a sign activation function (sgn) for binarizing the K-dimensional representation into K-bit binary hash code, and (4) a novel weighted cross-entropy loss for similarity-preserving learning from sparsedata. (right) Plot of smoothed responses of the sign function h = sgn (z): Red is the sign function, and blue, green and orange show functions h = tanh (βz) with bandwidths βb < βg < βo. The key property is limβ→∞ tanh (βz) = sgn (z). Best viewed in color.

Summary

1. An end-to-end deep learning to hash architecture for image retrievalwhich unifies four key components:

•a standard convolutional neural network•a fully-connected hash layer (fch) for transforming the deeprepresentation into K-dimensional representation

•a sign activation function (sgn) for binarizing the K-dimensionalrepresentation into K-bit binary hash code

•a novel weighted cross-entropy loss for similarity-preserving learningfrom sparse data2. Learning by continuation method to optimize HashNet

Challenges

1. Limitations of previous works:•Ill-posed gradient problem: We need the sign function as activationfunction to convert deep continuous representations to binary hashcodes. But the gradient of the sign function is zero for all nonzeroinputs, making standard back-propagation infeasible.→ attack the ill-posed gradient problem by the continuation methods,which address a complex optimization problem by smoothing theoriginal function, turning it into a easier to optimize problem.

•Data imbalance problem: The number of similar pairs is much smallerthan that of dissimilar pairs in real retrieval systems which makessimilarity-preserving learning ineffective.→ propose a novel weighted cross-entropy loss for similarity-preservinglearning from sparse similarity.

Model Formulation

Weighted Maximum Likelihood (WML) estimation:logP (S|H) = ∑

sij∈Swij logP (sij|hi,hj), (1)

Weight Formulation:

wij = cij ·

|S| / |S1| , sij = 1|S| / |S0| , sij = 0

(2)

Pairwise logistic function:

P (sij|hi,hj) =

σ (〈hi,hj〉) , sij = 11− σ (〈hi,hj〉) , sij = 0

= σ(〈hi,hj〉)sij(1− σ (〈hi,hj〉))1−sij(3)

Optimization problem of HashNet:min

Θ∑

sij∈Swij (log (1 + exp (α 〈hi,hj〉))− αsij 〈hi,hj〉), (4)

Learning by Continuation

Algorithm:Input: A sequence 1 = β0 < β1 < . . . < βm =∞for stage t = 0 to m doTrain HashNet (4) with tanh(βtz) as activation Set converged HashNetas next stage initialization

endOutput: HashNet with sgn(z) as activation, βm→∞

Experiment Setup

Datasets: ImageNet, NUS-WIDE and MS COCOBaselines: LSH, SH, ITQ, ITQ-CCA, BRE, KSH, SDH, CNNH, DNNHand DHNCriteria: MAP, Precision-recall curve and Precision@top R curve

Results

Dataset Method MAP16 bits 32 bits 48 bits 64 bits

ImageNet

HashNet 0.5059 0.6306 0.6633 0.6835DHN 0.3106 0.4717 0.5419 0.5732DNNH 0.2903 0.4605 0.5301 0.5645CNNH 2812 0.4498 0.5245 0.5538SDH 0.2985 0.4551 0.5549 0.5852KSH 0.1599 0.2976 0.3422 0.3943

ITQ-CCA 0.2659 0.4362 0.5479 0.5764ITQ 0.3255 0.4620 0.5170 0.5520BRE 0.5027 0.5290 0.5475 0.5546SH 0.2066 0.3280 0.3951 0.4191LSH 0.1007 0.2350 0.3121 0.3596

NUS-WIDE

HashNet 0.6623 0.6988 0.7114 0.7163DHN 0.6374 0.6637 0.6692 0.6714DNNH 0.5976 0.6158 0.6345 0.6388CNNH 0.5696 0.5827 0.5926 0.5996SDH 0.4756 0.5545 0.5786 0.5812KSH 0.3561 0.3327 0.3124 0.3368

ITQ-CCA 0.4598 0.4052 0.3732 0.3467ITQ 0.5086 0.5425 0.5580 0.5611BRE 0.0628 0.2525 0.3300 0.3578SH 0.4058 0.4209 0.4211 0.4104LSH 0.3283 0.4227 0.4333 0.5009

MSCOCO

HashNet 0.6873 0.7184 0.7301 0.7362DHN 0.6774 0.7013 0.6948 0.6944DNNH 0.5932 0.6034 0.6045 0.6099CNNH 0.5642 0.5744 0.5711 0.5671SDH 0.5545 0.5642 0.5723 0.5799KSH 0.5212 0.5343 0.5343 0.5361

ITQ-CCA 0.5659 0.5624 0.5297 0.5019ITQ 0.5818 0.6243 0.6460 0.6574BRE 0.5920 0.6224 0.6300 0.6336SH 0.4951 0.5071 0.5099 0.5101LSH 0.4592 0.4856 0.5440 0.5849

Table 1: MAP of Hamming Ranking on the three datasets

Number of bits

20 25 30 35 40 45 50 55 60

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(a)

Recall

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(b)

Number of top returned images100 200 300 400 500 600 700 800 900 1000

Pre

cis

ion

0.3

0.4

0.5

0.6

0.7HashNetDHNDNNHCNNHITQ-CCAKSHITQSH

(c)

Number of Bits

20 25 30 35 40 45 50 55 60

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(d)

Recall

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(e)

Number of top returned images100 200 300 400 500 600 700 800 900 1000

Pre

cis

ion

0.3

0.4

0.5

0.6HashNetDHNDNNHCNNHITQ-CCAKSHITQSH

(f)

Number of bits

20 25 30 35 40 45 50 55 60

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(g)

Recall

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

0.4

0.5

0.6

0.7

0.8

(h)

Number of top returned images100 200 300 400 500 600 700 800 900 1000

Pre

cis

ion

0.5

0.6

0.7

0.8HashNetDHNDNNHCNNHITQ-CCAKSHITQSH

(i)Precision curve w.r.t. top-N @ 64bits

Figure 2: The experimental results of Precision within Hamming radius 2, Precision-recallcurve @ 64 bits and Precision curve w.r.t. top-N @ 64 bits on the three datasets.

Code Quality

HashNet0 0.5 1

Fre

quency

0

500

1000

1500

2000

2500

3000

DHN0 0.5 1

(a)ImageNet

HashNet0 0.5 1

Fre

quency

0

500

1000

1500

2000

2500

3000

3500

4000

DHN0 0.5 1

(b)NUS-WIDE

HashNet0 0.5 1

Fre

quency

0

500

1000

1500

2000

2500

3000

DHN0 0.5 1

(c)COCOFigure 3: Histogram of non-binarized codes of HashNet and DHN.

Ablation Study

Method ImageNet NUS-WIDE16 bits 32 bits 48 bits 64 bits 16 bits 32 bits 48 bits 64 bits

HashNet+C 0.5059 0.6306 0.6633 0.6835 0.6646 0.7024 0.7209 0.7259HashNet 0.5059 0.6306 0.6633 0.6835 0.6623 0.6988 0.7114 0.7163

HashNet-W 0.3350 0.4852 0.5668 0.5992 0.6400 0.6638 0.6788 0.6933HashNet-sgn 0.4249 0.5450 0.5828 0.6061 0.6603 0.6770 0.6921 0.7020Table 2: MAP Results of HashNet and Its Variants on ImageNet and NUS-WIDE

Table 3: MAP Results of HashNet and Its Variants on MS COCO

Method MS COCO16 bits 32 bits 48 bits 64 bits

HashNet+C 0.6876 0.7261 0.7371 0.7419HashNet 0.6873 0.7184 0.7301 0.7362

HashNet-W 0.6853 0.7174 0.7297 0.7348HashNet-sgn 0.6449 0.6891 0.7056 0.7138

HashNet+C: variant using continuous similarity.HashNet-W: variant using maximum likelihood without weight.HashNet-sgn: variant using tanh() not sgn() as activation function.

Convergence Analysis

Number of Iterations0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Loss V

alu

e

0

0.5

1

1.5

2

2.5

3

3.5

4

HashNet-signHashNet+signDHN-signDHN+sign

(a)ImageNet

Number of Iterations0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Loss V

alu

e

0

0.5

1

1.5

2

2.5

3

3.5

HashNet-signHashNet+signDHN-signDHN+sign

(b)NUS-WIDE

Number of Iterations0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Loss V

alu

e

0

0.5

1

1.5

2

2.5

3

3.5

HashNet-signHashNet+signDHN-signDHN+sign

(c)COCOFigure 4: Losses of HashNet and DHN through training process.

Visualization

-40 -20 0 20 40-40

-30

-20

-10

0

10

20

30

40

(a)HashNet

-60 -40 -20 0 20 40-40

-30

-20

-10

0

10

20

30

40

(b)DHNFigure 5: The t-SNE of hash codes learned by HashNet and DHN.

Contact Information•Web: http://ise.thss.tsinghua.edu.cn/~mlong/•Email: [email protected]