3d posi4on es4ma4on using recursive neural...

1
Image super-resolu.on using deep convolu.onal networks. IEEE Transac*ons on Pa-ern Analysis and Machine Intelligence, Today, video feeds from phone or droid cameras are used for augmented reality or 3D reconstruc.on. Addi.onally, recent explora.on using rendered image data for deep learning has shown we can perform viewpoint es.ma.on [4] in single images. Using these observa.ons, we create a dataset of rendered image sequences to emulate video frames and use a a VGG Net [3] aJached to a Recursive Neural Net (RNN) to predict an object’s 3D bounding box for a given image sequence. Extending this technique to predict feature points for point cloud crea.on would be the most useful applica.on. We created a dataset of synthe.c images by rendering ShapeNet models [2] from a variety of different viewpoints. We adjusted the rendering pipeline from the Render For CNN paper [4] to meet our specific needs. Rendering Pipeline: 1. Render: render each image sequence of a 3D model using Blender viewpoints calculated by semi-randomly interpola.ng the ini.al viewpoint 2. Crop: crop the images according to [4] except images within a given sequence are cropped the same with minimal randomiza.on 3. Overlay: each rendered image sequence are given a randomly sampled from the SUN397 dataset 4. Preprocess: all images are resized and/or padded to be 224x224 and given a 4 th channel of 0 or 1 indica.ng the 2D bounding box around the object Data Set: Training examples: 50 image sequences of length 20 for each of the 12 classes from PASCAL 3D+ [5] for a total of 530 image sequences (10% removed for valida.on and tes.ng) Labels: ground truth 3D bounding boxes, (x, y, z, width, height, depth) 3D Posi4on Es4ma4on Using Recursive Neural Networks Hanna Winter [email protected] [1] ABADI, M., AGARWAL, A., BARHAM, P., BREVDO, E., CHEN, Z., CITRO, C., CORRADO, G. S., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., GOODFELLOW, I., HARP, A., IRVING, G., ISARD, M., JIA, Y., JOZEFOWICZ, R., KAISER, L., KUDLUR, M., LEVENBERG, J., MANE ́, D., MONGA, R., MOORE, S., MURRAY, D., OLAH, C., SCHUSTER, M., SHLENS, J., STEINER, B., SUTSKEVER, I., TALWAR, K., TUCKER, P., VANHOUCKE, V., VASUDEVAN, V., VIE ́GAS, F., VINYALS, O., WARDEN, P., WATTENBERG, M., WICKE, M., YU, Y., AND ZHENG, X., 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org. [2] CHANG, A. X., FUNKHOUSER, T., GUIBAS, L., HANRA- HAN, P., HUANG, Q., LI, Z., SAVARESE, S., SAVVA, M., SONG, S., SU, H., XIAO, J., YI, L., AND YU, F. 2015. ShapeNet: An Information-Rich 3D Model Repository. Tech. Rep. arXiv: 1512.03012 [cs.GR], Stanford University — Prince- ton University — Toyota Technological Institute at Chicago. [3] SIMONYAN, K., AND ZISSERMAN, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/ 1409.1556. [4] SU, H., QI, C. R., LI, Y., AND GUIBAS, L. J. 2015. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In The IEEE International Conference on Computer Vision (ICCV). [5] XIANG, Y., MOTTAGHI, R., AND SAVARESE, S. 2014. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV). MSE IOU Accuracy Train 0.0408 0.362 0.53 Test 0.0483 0.275 0.41 Problem Statement Implementa4on Future Data Results References Discussion Preliminary test results for our VGG19+RNN model. The ground truth bounding box is show in black around the 3D model. Our model’s output is the yellow bounding box. These results are displayed in Blender. The model was able to produce semi- reasonable results with minimal training of 100 training iterations and batch size of 5. The loss continually decreases over these training iterations showing that with more training and hypertuning, the model should be able to improve. Additionally, training on image sequences longer than 15 should produce even better results. We build a deep learning architecture with Tensorflow [1] using an out-of-box VGG19 network and replacing all but the first fully connected layer with an RNN. The image sequences are fed through the VGG as a large batch of images and reshaped back into a sequence before going through the RNN. The output of the network is the 3D bounding box of the object. Hyperparameters: ini.al network parameters (no hypertuning) Image Sequence Length: 15 Batch Size: 5 Learning Rate: 0.1 OpGmizer: Momentum update RNN hidden size: 2048 RNN layers: 1 Extending this method it to perform 3D feature point predic.on would be the logical next step. Addi.onally, the model probably can be improved by training on larger datasets with longer image sequences. This can be achieved by training and tes.ng on more computa.onally powerful machines. Finally, aspects of the dataset can be improved. The current cropping scheme is too randomized to mimic video data accurately and some of the 3D models do not render correctly in Blender. Sample of rendered image sequences ager resizing and padding. This displays 6 different classes from the 12 total categories of PASCAL 3D+. Results of our VGG+RNN model on our dataset. MSE: the mean squared difference between the ground truth bounding box and our model’s output IOU: intersec.on volume over the union volume of the 3D bounding boxes Accuracy: we say our model is correct if the IOU > 0.3 Test data: Currently tes.ng is done on the rendered image sequences. It would produce a more accurate Other 3D datasets with real images are available for tes.ng but do not have the 3D bounding box labels we require. Addi.onally, there does not seem to exist a 3D dataset with image sequences or video frame data.

Upload: others

Post on 23-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3D Posi4on Es4ma4on Using Recursive Neural Networkscs229.stanford.edu/proj2016/poster/Winter_3D...render correctly in Blender. Sample of rendered image sequences aer resizing and padding

•  Imagesuper-resolu.onusingdeepconvolu.onalnetworks.IEEETransac*onsonPa-ernAnalysisandMachineIntelligence,

Today,videofeedsfromphoneordroidcamerasareusedforaugmentedrealityor3Dreconstruc.on.Addi.onally,recentexplora.onusingrenderedimagedatafordeeplearninghasshownwecanperformviewpointes.ma.on[4]insingleimages.Usingtheseobserva.ons,wecreateadatasetofrenderedimagesequencestoemulatevideoframesanduseaaVGGNet[3]aJachedtoaRecursiveNeuralNet(RNN)topredictanobject’s3Dboundingboxforagivenimagesequence.Extendingthistechniquetopredictfeaturepointsforpointcloudcrea.onwouldbethemostusefulapplica.on.

Wecreatedadatasetofsynthe.cimagesbyrenderingShapeNetmodels[2]fromavarietyofdifferentviewpoints.WeadjustedtherenderingpipelinefromtheRenderForCNNpaper[4]tomeetourspecificneeds.RenderingPipeline:1.   Render:rendereachimagesequenceofa3DmodelusingBlender

viewpointscalculatedbysemi-randomlyinterpola.ngtheini.alviewpoint2.   Crop:croptheimagesaccordingto[4]exceptimageswithinagiven

sequencearecroppedthesamewithminimalrandomiza.on3.   Overlay:eachrenderedimagesequencearegivenarandomlysampledfrom

theSUN397dataset4.   Preprocess:allimagesareresizedand/orpaddedtobe224x224andgivena

4thchannelof0or1indica.ngthe2DboundingboxaroundtheobjectDataSet:•  Trainingexamples:50imagesequencesoflength20foreachofthe12

classesfromPASCAL3D+[5]foratotalof530imagesequences(10%removedforvalida.onandtes.ng)

•  Labels:groundtruth3Dboundingboxes,(x,y,z,width,height,depth)

3DPosi4onEs4ma4onUsingRecursiveNeuralNetworksHannaWinterhannawii@stanford.edu

[1]ABADI, M., AGARWAL, A., BARHAM, P., BREVDO, E., CHEN, Z., CITRO, C., CORRADO, G. S., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., GOODFELLOW, I., HARP, A., IRVING, G., ISARD, M., JIA, Y., JOZEFOWICZ, R., KAISER, L., KUDLUR, M., LEVENBERG, J., MANE ́, D., MONGA, R., MOORE, S., MURRAY, D., OLAH, C., SCHUSTER, M., SHLENS, J., STEINER, B., SUTSKEVER, I., TALWAR, K., TUCKER, P., VANHOUCKE, V., VASUDEVAN, V., VIE ́GAS, F., VINYALS, O., WARDEN, P., WATTENBERG, M., WICKE, M., YU, Y., AND ZHENG, X., 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.

[2]CHANG, A. X., FUNKHOUSER, T., GUIBAS, L., HANRA- HAN, P., HUANG, Q., LI, Z., SAVARESE, S., SAVVA, M., SONG, S., SU, H., XIAO, J., YI, L., AND YU, F. 2015. ShapeNet: An Information-Rich 3D Model Repository. Tech. Rep. arXiv:1512.03012 [cs.GR], Stanford University — Prince- ton University — Toyota Technological Institute at Chicago.

[3]SIMONYAN, K., AND ZISSERMAN, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.

[4]SU, H., QI, C. R., LI, Y., AND GUIBAS, L. J. 2015. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In The IEEE International Conference on Computer Vision (ICCV).

[5]XIANG, Y., MOTTAGHI, R., AND SAVARESE, S. 2014. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV).

MSE IOU Accuracy

Train 0.0408 0.362 0.53

Test 0.0483 0.275 0.41

ProblemStatement Implementa4on

Future

Data

Results

References

Discussion

Preliminary test results for our VGG19+RNN model. The ground truth bounding box is show in black around the 3D model. Our model’s output is the yellow bounding box. These results are displayed in Blender.

The model was able to produce semi-reasonable results with minimal training of 100 training iterations and batch size of 5. The loss continually decreases over these training iterations showing that with more training and hypertuning, the model s h o u l d b e a b l e t o i m p r o v e . Additionally, training on image sequences longer than 15 should produce even better results.

WebuildadeeplearningarchitecturewithTensorflow[1]usinganout-of-boxVGG19networkandreplacingallbutthefirstfullyconnectedlayerwithanRNN.TheimagesequencesarefedthroughtheVGGasalargebatchofimagesandreshapedbackintoasequencebeforegoingthroughtheRNN.Theoutputofthenetworkisthe3Dboundingboxoftheobject.

Hyperparameters:ini.alnetworkparameters(nohypertuning)•  ImageSequenceLength:15•  BatchSize:5•  LearningRate:0.1

•  OpGmizer:Momentumupdate•  RNNhiddensize:2048•  RNNlayers:1

Extendingthismethodittoperform3Dfeaturepointpredic.onwouldbethelogicalnextstep.Addi.onally,themodelprobablycanbeimprovedbytrainingonlargerdatasetswithlongerimagesequences.Thiscanbeachievedbytrainingandtes.ngonmorecomputa.onallypowerfulmachines.Finally,aspectsofthedatasetcanbeimproved.Thecurrentcroppingschemeistoorandomizedtomimicvideodataaccuratelyandsomeofthe3DmodelsdonotrendercorrectlyinBlender.

Sampleofrenderedimagesequencesagerresizingandpadding.Thisdisplays6differentclassesfromthe12totalcategoriesofPASCAL3D+.

ResultsofourVGG+RNNmodelonourdataset.

MSE:themeansquareddifferencebetweenthegroundtruthboundingboxandourmodel’soutputIOU:intersec.onvolumeovertheunionvolumeofthe3DboundingboxesAccuracy:wesayourmodeliscorrectiftheIOU>0.3

Testdata:Currentlytes.ngisdoneontherenderedimagesequences.ItwouldproduceamoreaccurateOther3Ddatasetswithrealimagesareavailablefortes.ngbutdonothavethe3Dboundingboxlabelswerequire.Addi.onally,theredoesnotseemtoexista3Ddatasetwithimagesequencesorvideoframedata.