3dviewgraph: learning global features for 3d shapes from a...

3DViewGraph: Learning Global Features for 3D Shapes from A Graph ofUnordered Views with Attention

Zhizhong Han1,2, Xiyang Wang1, Chi-Man Vong3, Yu-Shen Liu1∗, Matthias Zwicker 2,C.L. PhilipChen4

1School of Software, Tsinghua University, Beijing, ChinaBeijing National Research Center for Information Science and Technology (BNRist)

2Department of Computer Science, University of Maryland, College Park, USA3Department of Computer and Information Science, University of Macau, Macau, China

4Faculty of Science and Technology, University of Macau, Macau, [email protected], [email protected], [email protected]

[email protected], [email protected], [email protected]

Abstract

Learning global features by aggregating informa-tion over multiple views has been shown to be ef-fective for 3D shape analysis. For view aggregationin deep learning models, pooling has been appliedextensively. However, pooling leads to a loss of thecontent within views, and the spatial relationshipamong views, which limits the discriminability oflearned features. We propose3DViewGraph to re-solve this issue, which learns 3D global featuresby more effectively aggregating unordered viewswith attention. Specifically, unordered views tak-en around a shape are regarded as view nodes ona view graph. 3DViewGraph first learns a novellatent semantic mapping to project low-level viewfeatures into meaningful latent semantic embed-dings in a lower dimensional space, which is s-panned by latent semantic patterns. Then, the con-tent and spatial information of each pair of viewnodes are encoded by a novel spatial pattern cor-relation, where the correlation is computed amonglatent semantic patterns. Finally, all spatial patterncorrelations are integrated with attention weightslearned by a novel attention mechanism. This fur-ther increases the discriminability of learned fea-tures by highlighting the unordered view nodeswith distinctive characteristics and depressing theones with appearance ambiguity. We show that3DViewGraph outperforms state-of-the-art meth-ods under three large-scale benchmarks.

1 IntroductionGlobal features of 3D shapes can be learned from raw 3D rep-resentations, such as meshes, voxels, and point clouds. As analternative, a number of works in 3D shape analysis employedmultiple views [Su and others, 2015; Hanet al., 2019b] asraw 3D representation, exploiting the advantage that multiple

∗Corresponding author: Yu-Shen Liu

views can facilitate understanding of both manifold and non-manifold 3D shapes via computer vision techniques. There-fore, effectively and efficiently aggregating comprehensiveinformation over multiple views, is critical for the discrim-inability of learned features, especially in deep learningmod-els.

Pooling was designed as a procedure for information ab-straction in deep learning models. In order to describe a3D shape by considering features from multiple views, viewaggregation is usually performed by max or mean pooling,where pooling only employs the max or mean value of eachdimension across all view features[Su and others, 2015]. Al-though pooling is able to eliminate the rotation effect of 3Dshapes, both the content information within views and the s-patial relationship among views cannot be fully preserved.As a consequence, this limits the discriminability of learnedfeatures. In this work, we address the challenge to learn 3Dfeatures in a deep learning model by more effectively aggre-gating the content information within individual views, andthe spatial relationship among multiple unordered views.

To tackle this issue, we propose a novel deep learning mod-el called3D View Graph (3DViewGraph), which learns 3Dglobal features from multiple unordered views. By takingmultiple views around a 3D shape on a unit sphere, we rep-resent the shape as a view graph formed by the views, whereeach view denotes a node, and the nodes are fully connectedwith each other by edges. 3DViewGraph learns highly dis-criminative global 3D shape features by simultaneously en-coding both the content information within the view nodes,and the spatial relationship among the view nodes.

i) We propose a novel deep learning model called 3DView-Graph for 3D global feature learning by effectively ag-gregating multiple unordered views. It not only encodesthe content information within all views, but also pre-serves the spatial relationship among the views.

ii) We propose an approach to learn a low-dimensional la-tent semantic embedding of the views by directly cap-turing the similarities between each view and a set of la-tent semantic patterns. As an advantage, 3DViewGraph

avoids mining the latent semantic patterns across the w-hole training set explicitly.

iii) We perform view aggregation by integrating a novel spa-tial pattern correlation, which encodes the content infor-mation and the spatial relationship in each pair of views.

iv) We propose a novel attention mechanism to increasethe discriminability of learned features by highlightingthe unordered view nodes with distinctive characteristicsand depressing the ones with appearance ambiguities.

2 Related workDeep learning models have made a big progress onlearning 3D shape features from different raw represen-tations, such as meshes[Han and others, 2018], voxel-s [Wu and others, 2016], point clouds[Qi and others, 2017]and views[Su and others, 2015]. Because of page limit, wefocus on reviewing view-based deep learning models to high-light the novelty of our view aggregation.View-based methods.View-based methods represent a 3Dshape as a set of rendered views[Kanezakiet al., 2018] orpanorama views[Sfikas and others, 2017]. Besides direc-t set-to-set comparison[Bai and others, 2017], pooling is thewidely used way of aggregating multiple views in deep learn-ing models[Su and others, 2015]. In addition to global fea-ture learning, pooling can also be used to learn local fea-tures [Huanget al., 2017; Yuet al., 2018] for segmentationor correspondence by aggregating local patches.

Although pooling can aggregate views on the fly in themodels, it can not encode all the content information with-in views and the spatial relationship among views. Thus, thestrategies of concatenation[Savva and others, 2016], viewpair weighting [Johnset al., 2016], cluster specified pool-ing [Wang and others, 2017], RNN [Han and others, 2019],were employed to resolve this issue. However, these meth-ods can not learn from unordered views or fully capture thespatial information among unordered views.

To resolve the aforementioned issues, 3DViewGraph ag-gregates unordered views more effectively by simultaneouslyencoding their content information and spatial relationship.Graph-based methods. To handle the irregularstructure of graphs, various methods have been pro-posed[Hamilton and others, 2017]. Although we formulatethe multiple views from a 3D shape as a view graph, existingmethods proposed for graphs cannot be directly used forlearning the 3D feature in our scenario. The reasons aretwo-fold. First, these methods mainly focus on how tolocally learn meaningful representation for each node in agraph from its raw attributes rather than globally learningthe feature of the whole graph. Second, these methodsmainly learns how to process the nodes in a graph with firmorder, while the order of views involved in 3DViewGraph arealways ambiguous because of the rotation of 3D shapes.

Moreover, some methods have employed graphs to re-trieve 3D shapes from multiple views[Ananet al., 2015;An-An et al., 2016]. Different from these methods, 3DView-Graph employs a more efficient way of view aggregation indeep learning models, which makes the learned features use-ful for both classification and retrieval.

3 3DViewGraphOverview. Fig. 1 shows an overview of 3DViewGraph,where the global featureF i ∈ R

1×F of a 3D shapemi islearned from its corresponding view graphGi. Here,mi

is the i-th shape in a training set ofM 3D shapes, wherei ∈ [1,M ]. Based on theF -dimensional featureF i, 3D-ViewGraph classifiesmi into one ofL shape classes accord-ing to the probabilityP i = [P i(li = 1|F i), ..., P i(li =a|F i), ..., P i(li = L|F i)], which is provided by a final soft-max classifier (Fig. 1(f)), whereli is the class label ofmi.

We first take a set of unordered viewsvi = {vij |j ∈ [1, V ]}

on a unit sphere centered atmi, as shown in Fig. 1(a). Here,we use “unordered views” to indicate that the views cannot beorganized in a sequential way. The viewsvij are regarded asview nodesDi

j (briefly shown by symbols) of an undirectedgraphGi, where eachDi

j is fully connected with other viewnodesDi

j′ by edgesEij,j′ , such thatGi = ({Di

j}, {Eij,j′}).

Next, we extract low-level features f ij of

each view vij using a fine-tuned VGG19 net-work [Simonyan and Zisserman, 2014], as shown inFig. 1(b), wheref i

j ∈ R1×4096 is extracted from the last fully

connected layer. To obtain lower-dimensional, semanticallymore meaningful view features, we subsequently learn alatent semantic mappingΦ (Fig. 1(c)) to project a low-levelview featuref i

j into its latent semantic embeddingdij .

To resolve the effect of rotation, 3DViewGraph encodes thecontent and spatial information ofGi by exhaustively com-puting our novel spatial pattern correlation between each pairof view nodes. As illustrated in Fig. 1(d), we compute thepattern correlationcij,j′ betweenDi

j and each other nodeDij′ ,

and we weight it with their spatial similaritysij,j′ . In addition,for each nodeDi

j , we compute its cumulative correlationCij

to summarize all spatial pattern correlations as the character-istics of the 3D shape from thej-th view nodeDi

j .Finally, we obtain the global featureF i of shapemi by

integrating all cumulative correlationsCij with our novel at-

tention weightsαi, as shown in Fig. 1(e) and (f).αi aims tohighlight the view nodes with distinctive characteristicswhiledepressing the ones with appearance ambiguity.Latent semantic mapping learning.To learn global featuresfrom unordered views, 3DViewGraph encodes the content in-formation within all views and the spatial relationship amongviews in a pairwise way. 3DViewGraph relies on the intu-ition that correlations between pairs of views can effectivelyrepresent discriminative characteristics of a 3D shape, espe-cially considering the relative spatial position of the views.To implement this intuition, each view should be encoded interms of a small set of common elements across all views inthe training set. Unfortunately, the low-level featuresf i

j aretoo high dimensional and not suitable as a representation ofthe views in terms of a set of common elements.

To resolve this issue, 3DViewGraph introduces a latent se-mantic mappingΦ by learning a kernel functionK to directlycapture the similarities betweenV viewsvij andN latent se-mantic patterns{φn}. Our approach avoids additionally andexplicitly mining {φn} across the whole training set as the

......

VGG

VGG

VGG

VGG

VGG

.........

......

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

=

==

=

x ( )Distance

x ( )Distance

x ( )Distance

x ( )Distance

+=

+=

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

... x0.0

50.0

40.0

40.0

20.0

40.0

60.0

30.0

40.0

20.0

30.0

30.0

40.2

00.0

50.0

50.1

30.0

30.0

20.0

40.0

4

+=

......

Clssificatio

n o

f 3D

shap

es

(a) Unordered views (b) Low-level features(c) Latent semantic embeddings (d) Spatial pattern correlation (e) Attentioned correlation aggregation (f) Global features

Laten

t seman

tic map

pin

g

Distinctiveness Shape class features

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

mi

v i fij d

ijGi

ijD ...

...

......

ij'D

ij,j'E

c ij,j'

Cij

α iF

i

P i

Φ

ij,j's

Ci

x

x

WFF

ully

connected

... ...

... ... Softmax

Figure 1: The demonstration of 3DViewGraph framework.

VxNRow-wise convolution with

... fij

...dij∈

Vx4096

∈∈

4096xN

Row-wise softmax normalization

{ωn}

Φ

and{εn}{ { { {

Figure 2: The demonstration of latent semantic mappingΦ.

common elements.Φ projects low-level view featuresf ij in-

to latent semantic space spanned by{φn} as latent semanticemdeddingsdi

j. dij represents view nodesDi

j with more se-mantic meaning but lower dimension thanf i

j . Specifically,predicted by kernelK, then-th dimension ofdi

j characterizesthe similarity betweenf i

j and then-th semantic patternφn,such thatdi

j = [K(f ij ,φ1), ...,K(f i

j ,φn), ...,K(f ij ,φN )] ∈

R1×N . We define the kernelK as

K(f ij ,φn) =

exp(−β‖f ij − φn‖

2

2)

∑N

n′=1exp(−β‖f i

j − φn′‖22), (1)

where the similarityK(f ij ,φn) is inversely proportional to

the distance betweenf ij andφn throughexp(), and gets nor-

malized across the similarities betweenf ij and allφn. Param-

eterβ controls the decay of the response with the distance.This equation can be further simplified by cancelling the nor-m of f i

j from the numerator and the denominator as follows,

K(f ij ,φn) =

exp(−β‖f ij‖

2

2+ 2βf i

jφT

n − β‖φn‖2

2)

∑N

n′=1exp(−β‖f i

j‖2

2+ 2βf i

jφT

n′ − β‖φn′‖22)

K(f ij ,φn) =

exp(2βf ijφ

T

n − β‖φn‖2

2)

∑N

n′=1exp(2βf i

jφT

n′ − β‖φn′‖22)

K(f ij ,φn) =

exp(f ijωn + εn)

∑N

n′=1exp(f i

jωn′ + εn′),

(2)

where in the last step, we substituted2βφT

n and−β‖φn‖2

2by

ωn andεn, respectively. Here,{ωn}, {εn} and{φn} are setsof learnable parameters, in addition, both{ωn} and{εn} de-pend on{φn}. However, to obtain more flexible training byfollowing the viewpoint in[Arandjelovic and others, 2016],we employ two independent sets of{ωn}, {εn}, decoupling

{ωn} and {εn} from {φn}. This decoupling enables 3D-ViewGraph to directly predict the similarity betweenf i

j andφn by the kernelK without explicitly miningφn across alllow-level view features in the training set.

Based on the last line in Eq. 2, we implement the latentsemantic mappingΦ as a row-wise convolution with each pairof {ωn} and{εn} corresponding to a filter and a row-wisesoftmax normalization, as shown in Fig. 2.Spatial pattern correlation. The pattern correlationcij,j′aims to encode the content of view nodesDi

j andDij′ . cij,j′

makes the semantic patterns that co-occur in both viewsmore prominent while the non-co-occurring ones more sub-tle. More precisely, we use the latent semantic embeddingsdij anddi

j′ to computecij,j′ as follows,

cij,j′ = (dij)

T × dij′ , (3)

where cij,j′ is a N × N dimensional matrix whose entrycij,j′(n, n

′) measures the correlation between the semanticpatternφn contributing todi

j andφn′ contributing todij′ .

......

(a) Edges on the unit sphere (b) The length of edge (c) The spatial similarities

ij,j'E i

j,j's

ijD

ij''D

ij,j''E

ijD

ij'D

ij,j'E

θ

θ'

Figure 3: The illustration of spatial similaritysij,j′ .

We further enhance the pattern correlationcij,j′ betweenthe view nodesDi

j andDij′ by their spatial similaritysij,j′ ,

which forms the spatial pattern correlationsij,j′cij,j′ .

Fig. 3 visualizes how we compute the spatial similaritysij,j′ . In Fig. 3(a), we show all edgesEi

j,j′ connectingDij

to all other view nodesDij′ in different colors, whereDi

j

is briefly shown by symbols. The length ofEij,j′ is mea-

sured by the length of the shortest arc connecting the twoview nodesDi

j andDij′ on the unit sphere. Thus,Ei

j,j′ =

2π × 1 × (θ/2π) = θ as illustrated in Fig. 3(b), whereθ

is the central angle of the arc and the factor1 correspond-s to the radius of the unit sphere. To reduce the high vari-ance of{Ei

j,j′}, we employEij,j′ = 0.5(1 − cos θ) instead

of Eij,j′ = θ, which normalizesEi

j,j′ into the range of[0, 1].Finally, sij,j′ is inversely proportional toEi

j,j′ as follows,

sij,j′ = exp(−σEij,j′ ), (4)

whereσ is a parameter to control the decay of the responsewith the edge length. In Fig. 3(c), we visualizesij,j′ by map-ping the value ofsij,j′ to the width of edgesEi

j,j′ .To represent the characteristics of 3D shapemi from thej-

th view nodeDij onGi, we finally introduce the cumulative

correlationCij , which encodes all spatial pattern correlations

starting fromDij as follows,

Cij =

V∑

j′=1

sij,j′cij,j′ . (5)

Attentioned correlation aggregation. Intuitively, moreviews will provide more information to any deep learningmodel, which should allow it to produce more discrimina-tive 3D features. However, additional views may also intro-duce appearance ambiguities that negatively affect the dis-criminability of learned features, as shown in Fig. 4.

0.223490.26088 0.33450

0.000480.03024 0.00634

Distinctive

Ambiguous

(a) Bathtub (b) Sink (c) Bed

α ji=

α ji=

Figure 4: The ambiguous views and distinctive views.

To resolve this issue, 3DViewGraph employs a novel at-tention mechanism in the aggregation of the 3D shape char-acteristics from all unordered view nodes of a shape, as il-lustrated in Fig. 1(e). 3DViewGraph learns attention weightsαi = {αi

j |j ∈ [1, V ]} for all view nodesDij onGi, whereαi

j

would be a large value (the second row in Fig. 4) if the viewvij has distinctive characteristics, whileαi

j would be a smallvalue (the first row in Fig. 4) ifvij exhibits appearance ambi-

guity with views from other shapes. Note that∑V

j=1αij = 1.

Our novel attention mechanism evaluates how distinctiveeach view is to the views that 3DViewGraph has processed.To comprehensively represent the characteristics of the viewsthat 3DViewGraph has processed, the attention mechanismemploys the fully connected weightsWF in the final softmaxclassifier which accumulates the information of all views, asshown in Fig. 1(f). The attention mechanism projects thecharacteristicsCi

j of 3D shapemi from thej-th view nodeDi

j and the characteristicsWF of the views that 3DView-Graph has processed into a common space to calculate the

distinctiveness of viewvij , as defined below,

αij = ω(WCC

ijωC +WFωF + b),

αi = softmax(αi),(6)

whereWC , ωC , ωF , b andω are learnable parameters in theattention mechanism,WF ∈ R

L×F , whereF is the dimen-sion of the learned global featureF i, andL is the number ofshape classes. WithWC ∈ R

L×V andωC ∈ RV×1, Ci

j isprojected into aL × 1 dimensional space, whereb ∈ R

L×1

is a bias in that space. In addition,WF is projected into thesame space byωF ∈ R

K×1 to compute the similarities be-tweenCi

j andWF along allL dimensions. Subsequently, theattention weightαi

j is calculated by comprehensively summa-rizing all similarities along all theL dimensions with a linearmappingω ∈ R

1×L. Finally, theαij in αi for all views of

i-th shape are normalized by softmax normalization.Based onαi, the characteristicsCi

j of 3D shapemi fromall view nodes are aggregated with weightingαi into atten-tioned correlation aggregationCi, as defined below,

Ci =

V∑

j=1

αijC

ij , (7)

whereCi represents 3D shapemi as aN × N matrix, asshown in Fig. 1(e). Finally, the global featureF i of 3D shapemi is learned by a fully connected layer with attentioned cor-relation aggregationCi as input, as shown in Fig. 1(f), wherethe fully connected layer is followed by a sigmoid function.

UsingF i, the final softmax classifier computes the prob-abilitiesP i to classify the 3D shapemi into one ofL shapeclasses as

P i = softmax(WFFi + bF ), (8)

whereWF ∈ RL×K andbF ∈ R

K×1 are learnable param-eters for the computation ofP i. WF is used to represent allthe characteristics of views that 3DViewGraph has processed,as employed to calculateαi in Eq. 6.Learning inference. The parameters involved in 3DView-Graph are optimized by minimizing the log-likelihoodO overM 3D shapes in the training set, whereQi is the truth label,

O = −1

M

M∑

i=1

L∑

a=1

Qi(li = a) logP i(li = a). (9)

The parameter optimization is conducted by back propaga-tion of classification errors of 3D shapes. Noteworthy,WF isupdated by two elements with the learning rateε as follows,

WF ←WF − ε(∂O

∂WF

+V∑

j=1

∂αij

∂WF

). (10)

The advantage of Eq. (10) is thatWF can be learned moreflexibly for optimization convergence.WF also enablesαi tosimultaneously observe the characteristics of shapemi fromeach view nodeDi

j and take all views that have been pro-cessed from different shapes as reference.

4 Results and analysisWe evaluate 3DViewGraph by comparing it with the state-of-the-art methods in shape classification and retrieval un-der ModelNet40[Wu and others, 2015], ModelNet10 andShapeNetCore55[Savva and others, 2017]. We also show ab-lation studies to justify the effectiveness of novel elements.

Table 1:F comparison,ε = 0.009, σ = 10, N = 128.

F 64 128 256 512 1024Acc % 93.44 93.03 93.80 93.07 93.19

Parameters. We first explore how the important parametersF , N andσ affect the performance of 3DViewGraph underModelNet40. The comparison in Table. 1, 2, and 3 showsthat their effects are slight in a proper range.

Table 2:N comparison,ε = 0.009, σ = 10, F = 256.

N 32 64 128 256 512Acc % 90.84 92.91 93.80 93.44 93.40

Classification.As compared under ModelNet in Table 4, 3D-ViewGraph outperforms all the other methods under the samecondition1. In addition, we show the single view classifica-tion accuracy in VGG fine-tuning (“VGG(ModelNet)”). Tohighlight the contribution of VGG fine-tuning, spatial simi-larity, and attention, we remove fine-tuning (“Ours(No fine-tune)”) or set all spatial similarity (“Ours(No spatiality)”) andattention (“Ours(No attention)”) to 1. The degenerated result-s indicate these elements are important for 3DViewGraph toachieve high accuracy. Similar phenomena is observed whenwe justify the effect ofCi

j andWF in Eq. 6 by setting themto 1 (“Ours(No attention-)”), respectively. We also justifythe latent semantic embedding and spatial pattern correlationby replacing them by single view features (“Ours(No laten-t)”) and summation (“Ours(No correlation)”), the degenerat-ed results also show that they are important elements. Fi-nally, we compare our proposed view aggregation with mean(“Ours(MeanPool)”) and max pooling (“Ours(MaxPool)”) bydirectly pooling all single view features together. Due to theloss of content information in each view and spatial informa-tion among multiple views, pooling performs worse.

Table 3:σ comparison,ε = 0.009, N = 128, F = 256.

σ 0 1 5 10 11Acc % 92.91 93.48 93.72 93.80 93.48

3DViewGraph also achieves the best under the more chal-lenging benchmark ShapeNetCore55, based on the fine-tunedVGG (“VGG(ShapeNetCore55)”), as shown in Table 5. Wealso find that different parameters do not significantly affectthe performance, such asN andσ.

1We use the same modality of views from the same camera sys-tem for the comparison, where the results of RotationNet arefromFig.4 (d) and (e) in https://arxiv.org/pdf/1603.06208.pdf. Moreover,the benchmarks are with the standard training and test split.

Table 4: Classification comparison under ModelNet withε = 0.009,σ = 10, F = 256, N = 128, unless noted otherwise.

Methods MN40(%) MN10(%)3DGAN[Wu and others, 2016] 83.3 91.0

PointNet++[Qi and others, 2017] 91.9 -FoldingNet[Yanget al., 2018] 88.4 94.4

PANO[Sfikas and others, 2017] 90.7 91.1Pairwise[Johnset al., 2016] 90.7 92.8GIFT[Bai and others, 2017] 89.5 91.5

Domi[Wang and others, 2017] 92.2 -MVCNN[Su and others, 2015] 90.1 -

Spherical[Caoet al., 2017] 93.31 -Rotation[Kanezakiet al., 2018] 92.37 94.39

SO-Net[Li and others, 2018] 90.9 94.1SVSL[Han and others, 2019] 93.31 94.82VIPGAN[Hanet al., 2019a] 91.98 94.05

VGG(ModelNet40) 87.27 -VGG(ModelNet10) - 88.63

Ours 93.80 94.82Ours(σ = 5) 93.72 95.04

Ours(No finetune) 90.40 -Ours(No spatiality) 92.91 94.16Ours(No attention) 93.07 93.72

Ours(No attention-Cij) 91.82 93.39

Ours(No attention-WF ) 91.57 93.28Ours(No latent) 92.34 92.95

Ours(No correlation) 89.30 93.83Ours(MeanPool) 92.38 93.06Ours(MaxPool) 91.89 92.84

Attention visualization. We visualize the attention learnedby 3DViewGraph under ModelNet40, which demonstrateshow 3DViewGraph understands 3D shapes by analyzingviews on a view graph. In Fig. 5, attention weightsαi onview nodesDi

j of Gi are visualized as a vector which is rep-resented by scattered black nodes, where the correspondingviews are also shown nearby, such as the views of a toilet inFig. 5(a), a table in Fig. 5(b) and a cone in Fig. 5(c). The coor-dinates of black nodes along the y-axis indicate how much at-tention 3DViewGraph pays to the corresponding view nodes.In addition, the views that is paid the most and least atten-tion to are highlighted by the red upward and blue downwardarrow, respectively.

0

0.2

0.4

0.6

0.8

1

Graph View Set

Clique Graph

3DViewGraph

0.9 10.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

Pre

cisi

on

Recall

Figure 6: The precision and recallcomparison with graph-based multi-view learning methods under PSB.

Fig. 5 demonstratesthat 3DViewGraph isable to understand eachview, since the viewwith the most ambigu-ous appearance in aview graph is depressedwhile the view with themost distinctive appear-ance is highlighted. Forexample, the most am-biguous views of toilet,table and cone merelyshow some basic shapes

Table 6: Retrieval comparison under ShapeNetCore55,ε = 0.009, σ = 10, F = 256, N = 128.

micro macroMethods P@N R@N F1@N mAP@N NDCG@N P@N R@N F1@N mAP@N NDCG@N

All 0.818 0.803 0.798 0.772 0.865 0.618 0.667 0.590 0.583 0.657Taco 0.701 0.711 0.699 0.676 0.756 - - - - -Ours 0.6090 0.8034 0.6164 0.8492 0.9054 0.1929 0.8301 0.2446 0.7019 0.8461

Table 5: Classification comparison under ShapeNetCore55 withε =

0.009, σ = 10, F = 256, N = 128, unless noted otherwise.

Methods Views Accuracy(%)VIPGAN[Hanet al., 2019a] 12 82.97SVSL[Han and others, 2019] 12 85.47

VGG(ShapeNetCore55) 1 81.33Ours 20 86.87

Ours(N = 256) 20 86.36Ours(σ = 5) 20 86.56

Ours(σ = 5,N = 256) 20 86.71

(a)

0

0.02

0.04

0.06

0.08

0.1

0.12

10 12 14 16 18 202 4 6 80

10 12 14 16 18 200.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

2 4 6 80

0.01

0.02

0.03

0.04

0.05

0.06

0.07

10 12 14 16 18 202 4 6 80

(b)

(c)

Figure 5: The visualization of attention weights (black nodes)learned for views of (a) a toilet, (b) a table, and (c) a cone. Thehighest and lowest attention weights are indicated by red upwardarrow and blue downward arrow, respectively.

that provide little useful information for classification,suchas the rectangles of the toilet and table, and the circle of thecone. In contrast, the most distinctive views of toilet, tableand cone exhibit more unique and distinctive characteristics.Retrieval. We evaluate the retrieval performance of 3D-ViewGraph under ModelNet in Table 7. We outperformthe state-of-the-art methods, where the retrieval range isal-so shown. We further detail the precision and recall curvesof these results in Fig. 7. In addition, 3DViewGraph alsoachieve the best results under ShapeNetCore55 in Table 6.We compare 10 state-of-the-art methods under testing setin the SHREC2017 retrieval contest[Savva and others, 2017]and Taco[Cohenet al., 2018], where we summarize all the10 methods (“All”) by presenting the best result of each

Table 7: Retrieval comparison (mAP) under ModelNet,ε = 0.009,σ = 10, F = 256, N = 128.

Methods Range MN40 MN10SHD Test-Test 33.26 44.05LFD Test-Test 40.91 49.82

3DNets[Wu and others, 2015] Test-Test 49.23 68.26GImage[Sinhaet al., 2016] Test-Test 51.30 74.90

DPano[Shi and others, 2015] Test-Test 76.81 84.18MVCNN[Su and others, 2015] Test-Test 79.50 -PANO[Sfikas and others, 2017] Test-Test 83.45 87.39

GIFT[Bai and others, 2017] Random 81.94 91.12Triplet[Heet al., 2018] Test-Test 88.0 -

Ours Test-Test 90.54 92.40Ours Test-Train 93.49 95.17Ours Train-Train 98.75 99.79Ours All-All 96.95 98.52

SHDLFDFisher vector3D ShapeNetsDeepPanoPanoramaMVCNNGIFTOUR-Test2testOUR-Test2trainOUR-Train2trainOUR-All2all

0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

Pre

cisi

on

Recall

(a)

SHDLFD3D ShapeNetsDeepPanoPanoramaGIFTOUR-Test2testOUR-Test2trainOUR-Train2trainOUR-All2all

0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

Pre

cisi

on

Recall

(b)

Figure 7: The precision and recall cure comparison among differentmethods under (a) ModelNet40 and (b) ModelNet10.

metric due to page limit. Finally, we demonstrate that 3D-ViewGraph is also superior to other graph-based multi-viewlearning methods[Ananet al., 2015; An-Anet al., 2016] un-der Princeton Shape Benchmark (PSB) in Fig. 6.

5 ConclusionIn view-based deep learning models for 3D shape analysis,view aggregation via widely used pooling, leads to informa-tion loss about content and spatial relationship of views. Wepropose 3DViewGraph to address this issue for 3D globalfeature learning by more effectively aggregating unorderedviews with attention. By organizing unordered views takenaround a 3D shape into a view graph, 3DViewGraph learnsglobal features of the 3D shape by simultaneously encodingboth the content information within view nodes and the spa-tial relationship among the view nodes. Through a novel la-tent semantic mapping, low-level view features are projectedinto a meaningful, lower-dimensional latent semantic embed-ding using a learned kernel function, which directly captures

the similarities between low-level view features and latent se-mantic patterns. The latent semantic mapping successfullyfacilitates 3DViewGraph to encode the content informationand the spatial relationship in each pair of view nodes us-ing a novel spatial pattern correlation. Further, our novelat-tention mechanism effectively increases the discriminabilityof learned features by efficiently highlighting the unorderedview nodes with distinctive characteristics and depressing theones with appearance ambiguity. Our results in classificationand retrieval under three large-scale benchmarks show that3DViewGraph can learn better global features than the state-of-the-art methods due to its more effective view aggregation.

6 AcknowledgmentsThis work was supported by National Key R&D Programof China (2018YFB0505400), NSF (1813583), University ofMacau (MYRG2018-00138-FST), and FDCT (273/2017/A).We thank all anonymous reviewers for their constructive com-ments.

References[An-An et al., 2016] Liu An-An, Nie Wei-Zhi, and Su Yu-

Ting. Multi-modal clique-graph matching for view-based3D model retrieval.IEEE Transactions on Image Process-ing, 25(5):2103–2115, 2016.

[Ananet al., 2015] Liu Anan, Wang Zhongyang, NieWeizhi, and Su Yuting. Graph-based characteristic viewset extraction and matching for 3D model retrieval.Information Sciences, 320:429–442, 2015.

[Arandjelovic and others, 2016] Relja Arandjelovic et al.NetVLAD: CNN architecture for weakly supervised placerecognition. InIEEE Conference on Computer Vision andPattern Recognition, pages 5297–5307, 2016.

[Bai and others, 2017] Song Bai et al. GIFT: Towards scal-able 3D shape retrieval.IEEE Transaction on Multimedia,19(6):1257–1271, 2017.

[Caoet al., 2017] Zhangjie Cao, Qixing Huang, and KarthikRamani. 3D object classification via spherical projections.In International Conference on 3D Vision. 2017.

[Cohenet al., 2018] Taco S. Cohen, Mario Geiger, JonasKohler, and Max Welling. Spherical CNNs. InInterna-tional Conference on Learning Representations, 2018.

[Hamilton and others, 2017] William L. Hamilton et al. Rep-resentation learning on graphs: Methods and applications.IEEE Data Engineering Bulletin, 40(3):52–74, 2017.

[Han and others, 2018] Zhizhong Han et al. Deep spatial-ity: Unsupervised learning of spatially-enhanced globaland local 3D features by deep neural network with cou-pled softmax. IEEE Transactions on Image Processing,27(6):3049–3063, 2018.

[Han and others, 2019] Zhizhong Han et al. Seqviews 2se-qlabels: Learning 3D global features via aggregating se-quential views by rnn with attention.IEEE Transactionson Image Processing, 28(2):1941–0042, 2019.

[Hanet al., 2019a] Zhizhong Han, Mingyang Shang, Yu-Shen Liu, and Matthias Zwicker. View inter-predictiongan: Unsupervised representation learning for 3D shapesby learning global shape memories to support local viewpredictions. InAAAI, 2019.

[Hanet al., 2019b] Zhizhong Han, Mingyang Shang,Xiyang Wang, Yu-Shen Liu, and Matthias Zwicker.Y2seq2seq: Cross-modal representation learning for 3Dshape and text by joint reconstruction and prediction ofview and word sequences. InAAAI, 2019.

[Heet al., 2018] Xinwei He, Yang Zhou, Zhichao Zhou,Song Bai, and Xiang Bai. Triplet-center loss for multi-view 3D object retrieval. InThe IEEE Conference on Com-puter Vision and Pattern Recognition, 2018.

[Huanget al., 2017] H. Huang, E. Kalegorakis, S. Chaud-huri, D. Ceylan, V. Kim, and E. Yumer. Learning lo-cal shape descriptors with view-based convolutional neuralnetworks.ACM Transactions on Graphics, 2017.

[Johnset al., 2016] Edward Johns, Stefan Leutenegger, andAndrew J. Davison. Pairwise decomposition of imagesequences for active multi-view recognition. InIEEEConference on Computer Vision and Pattern Recognition,pages 3813–3822, 2016.

[Kanezakiet al., 2018] Asako Kanezaki, Yasuyuki Mat-sushita, and Yoshifumi Nishida. Rotationnet: Joint objectcategorization and pose estimation using multiviews fromunsupervised viewpoints. InIEEE Conference on Com-puter Vision and Pattern Recognition, 2018.

[Li and others, 2018] Jiaxin Li et al. SO-Net: Self organizingnetwork for point cloud analysis. InThe IEEE Conferenceon Computer Vision and Pattern Recognition, 2018.

[Qi and others, 2017] Charles Qi et al. Pointnet++: Deep hi-erarchical feature learning on point sets in a metric space.In Advances in Neural Information Processing Systems,pages 5105–5114, 2017.

[Savva and others, 2016] M. Savva et al. Shrec’16 tracklarge-scale 3D shape retrieval from shapeNet core55. InEG 2016 workshop on 3D Object Recognition, 2016.

[Savva and others, 2017] Manolis Savva et al. SHREC’17Large-Scale 3D Shape Retrieval from ShapeNet Core55.In Eurographics Workshop on 3D Object Retrieval, 2017.

[Sfikas and others, 2017] Konstantinos Sfikas et al. Exploit-ing the PANORAMA Representation for ConvolutionalNeural Network Classification and Retrieval. InEG Work-shop on 3D Object Retrieval, pages 1–7, 2017.

[Shi and others, 2015] B. Shi et al. Deeppano: Deeppanoramic representation for 3D shape recognition.IEEESignal Processing Letters, 22(12):2339–2343, 2015.

[Simonyan and Zisserman, 2014] K. Simonyan and A. Zis-serman. Very deep convolutional networks for large-scaleimage recognition.CoRR, abs/1409.1556, 2014.

[Sinhaet al., 2016] Ayan Sinha, Jing Bai, and Karthik Ra-mani. Deep learning 3D shape surfaces using geometryimages. InEuropean Conference on Computer Vision,pages 223–240, 2016.

[Su and others, 2015] Hang Su et al. Multi-view convolu-tional neural networks for 3D shape recognition. InInter-national Conference on Computer Vision, pages 945–953,2015.

[Wang and others, 2017] Chu Wang et al. Dominant set clus-tering and pooling for multi-view 3D object recognition.In British Machine Vision Conference, 2017.

[Wu and others, 2015] Zhirong Wu et al. 3D ShapeNets: Adeep representation for volumetric shapes. InProceed-ings of IEEE Conference on Computer Vision and PatternRecognition, pages 1912–1920, 2015.

[Wu and others, 2016] Jiajun Wu et al. Learning a proba-bilistic latent space of object shapes via 3D generative-adversarial modeling. InAdvances in Neural InformationProcessing Systems, pages 82–90, 2016.

[Yanget al., 2018] Yaoqing Yang, Chen Feng, Yiru Shen,and Dong Tian. Foldingnet: Point cloud auto-encoder viadeep grid deformation. InIEEE Conference on ComputerVision and Pattern Recognition, 2018.

[Yu et al., 2018] Tan Yu, Jingjing Meng, and Junsong Yuan.Multi-view harmonized bilinear network for 3D objec-t recognition. InThe IEEE Conference on Computer Vi-sion and Pattern Recognition, 2018.

3dviewgraph: learning global features for 3d shapes from a...

Documents