toward long distance tabletop hand-document...

Toward Long Distance TabletopHand-Document Telepresence

Conclusion & Future WorkEvaluation

Introduction

Proposed Method

• Two distributed environment setups over long distances • Palo Alto/US ∼ Yokohama/Japan (~ 5100 miles)• Palo Alto/US ∼ Verona/Italy (~ 6000 miles)

System Setup

• We presented a novel system for hand-document telepresence with high resolution document capture and hand skeleton tracking, and with two separate channels for transmitting these data.

• We evaluated our system over long distances, and compared our system to a tele-immersive system that was tested over much shorter distances.

• For future work, one direction is to evaluate our system on user level tasks over long distances, and to use our system in realistic situations with remote colleagues.

Related Work[1] 4K document capture: (Kim et al., DocEng’15)[2] Deep learning based hand tracking: (Zimmermann et al., ICCV’17)[3] H-Time: Haptic-enabled tele-immersive musculoskeletal examination: (Tian et al., ACM Multimedia’17)

Chelhwon Kim1, Patrick Chiu1, Joseph de la Pena2, Laurent Denoue1, Jun Shingu2, Yulius Tjahjadi1

1FX Palo Alto Laboratory, 2Fuji Xerox

Standard video conferencing Our system

• Discussion over a document in a telepresence: showing the user’s hand position on the document

• Problem: How to capture and transmit the hand movements efficiently with high resolution document images

• Too low-res to read document• Occlusion by hand

• Hi-res document capture• Hand skeleton: less occlusion• Two-channel data transmission

at different rate

Document & Hand Tracking

Normalized skeleton by Homography

Page boundary &Hand skeleton detection

1.0

1.0

Stitched and rectified hi-res image

A sequence of high-res document images

Document capture

Light-weight app on a web browser at the remote site

Server

Hi-res document image (~12 MB)

Hand skeleton data (~2.6 kB)

4K camera

webcam

• 4K camera for hi-res document capture [1]

• Webcam for hand tracking (Deep Learning based hand pose estimation [2]) & Document page tracking

• Document page on tabletop

Table 1: Data transmission statistics of hand skeleton datafromUS to Japan and to Italy. For Latency, the �rst two num-bers are the lower and upper bound of the mean of the la-tency. For Jitter, the �rst number is the mean of the jitter.The numbers in parenthesis are the standard deviation.

Palo Alto-Yokohama Session 1 Session 2 Session 3Latency 78⇠180 (111) 9⇠115 (11) 74⇠171 (82)Jitter 79 (77) 6 (10) 78 (66)

Palo Alto-Verona Session 1 Session 2 Session 3Latency 11⇠189 (21) 20⇠188 (31) 23⇠191 (46)Jitter 8 (19) 9 (26) 13(41)

(msec)

Table 2: Data transmission latency of document page image(12M) from US to Japan and to Italy.

(sec) Session 1 Session 2 Session 3Palo Alto-Yokohama 24.44 11.71 26.53Palo Alto-Verona 24.68 13.96 28.48

the system latency with the upper and lower bounds of the timedi�erence d respectively. The green bar represents the theoreticalnetwork latency that matches the speed of light.

Table 1 shows the upper and lower bounds of the mean of thesystem latency for all test sessions. The numbers in parenthesis arethe standard deviations of the latency. The system’s jitter in the tableis de�ned as the delay between two consecutive hand skeleton datasamples at the receiving side. While the data transmission latencybetween US and Japan shows some large variations (especially forSession1 and Session2), the latency between US to Italy shows morestable (i.e. small standard deviations). The average of the means ofthe upper bound (worst latency) across all test sessions in Table 1is 155 msec from Palo Alto/US to Yokohama/Japan (approximately5100 miles), and 189 msec to Verona/Italy (approximately 6000miles). The network latency by the speed of light from Palo Alto toYokohama is around 27 msec, and to Verona is around 32 msec.

Table 2 shows the high-res document page image streaminglatency to the remote site. The average latency is 20.9 sec from PaloAlto to Yokohama, and 22.4 sec from Palo Alto to Verona.

The average system latency for transmitting the hand skeletondata and the hi-res document page data from Palo Alto to Yoko-hama is smaller than the ones from Palo Alto to Verona, which isconsistent with the distances of two remote sites from the local site.

We compare our system with the H-TIME [9] tele-immersivesystem in terms of transmission data size, distance between twodistributed sites, one-way latency, and data frequency (see Fig. 8).Due to our small skeleton data size (⇠2.6kB, 2 orders of magnitudesmaller), we can achieve relatively low latency (⇠190 msec) overmuch longer distance (⇠6000 miles, 2 orders of magnitude farther)compared with H-time (⇠400 msec) that transmits mesh data (⇠450kB) over much shorter distances (⇠30 miles). The frequency of datatransmission is 6 fps in our system, and 25 fps in H-TIME.

Figure 8: Comparison of our system with H-TIME [9].

5 CONCLUSION AND FUTUREWORKWe presented a novel system for hand-document telepresence withhigh resolution document capture and hand skeleton tracking, andwith two separate channels for transmitting these data. We evalu-ated our system over long distances, and compared our system to atele-immersive system that was tested over much shorter distances.

For future work, one direction is to evaluate our system on userlevel tasks over long distances, and to use our system in realisticsituations with remote colleagues. Further enhancements to oursystem include improving the transmission of the large documentimage using progressive decoding, optimizing the hand and docu-ment trackers to improve the frame rate, and supporting more thanone hand over a document from multiple sites.

REFERENCES[1] Géry Casiez, Nicolas Roussel, and Daniel Vogel. 2012. 1âĆň �lter: a simple speed-

based low-pass �lter for noisy input in interactive systems. In Proceedings of theSIGCHI Conference on Human Factors in Computing Systems. ACM, 2527–2530.

[2] Piotr Dollár and C Lawrence Zitnick. 2015. Fast edge detection using structuredforests. IEEE transactions on pattern analysis and machine intelligence 37, 8 (2015),1558–1570.

[3] Aaron M Genest, Carl Gutwin, Anthony Tang, Michael Kalyn, and Zenja Ivkovic.2013. KinectArms: a toolkit for capturing and displaying arm embodiments indistributed tabletop groupware. In Proceedings of the 2013 conference on Computersupported cooperative work. ACM, 157–166.

[4] Richard Hartley and Andrew Zisserman. 2003. Multiple view geometry in computervision. Cambridge university press.

[5] Chelhwon Kim, Patrick Chiu, and Henry Tang. 2015. High-Quality Capture ofDocuments on a Cluttered Tabletop with a 4K Video Camera. In Proceedings ofthe 2015 ACM Symposium on Document Engineering (DocEng ’15). ACM, NewYork, NY, USA, 219–222. https://doi.org/10.1145/2682571.2797074

[6] J. Matas, C. Galambos, and J. Kittler. 2000. Robust Detection of Lines Usingthe Progressive Probabilistic Hough Transform. Computer Vision and ImageUnderstanding 78, 1 (2000), 119 – 137. https://doi.org/10.1006/cviu.1999.0831

[7] Suraj Raghuraman, Karthik Venkatraman, Zhanyu Wang, Balakrishnan Prab-hakaran, and Xiaohu Guo. 2013. A 3D Tele-immersion streaming approachusing skeleton-based prediction. In Proceedings of ACM Multimedia 2013. ACM,721–724.

[8] A. Tang, C. Neustaedter, and S. Greenberg. 2007. Videoarms: embodiments formixed presence groupware. People and Computers 20 (2007), 85–102.

[9] Yuan Tian, Suraj Raghuraman, Thiru Annaswamy, Aleksander Borresen, KlaraNahrstedt, and Balakrishnan Prabhakaran. 2017. H-TIME: Haptic-enabled tele-immersive musculoskeletal examination. In Proceedings of ACM Multimedia 2017.ACM, 137–145.

[10] E. Wood, J. Taylor, J. Fogarty, A. Fitzgibbon, and J. Shotton. 2016. ShadowHands:High-�delity remote hand gesture visualization using a hand tracker. In Proceed-ings of ACM ISS 2016. ACM, 77–84.

[11] Ying Xiong. 2016. Fast and Accurate Document Detection for Scanning, Dropboxblogs. Retrieved August 9, 2016 from https://blogs.dropbox.com/tech/2016/08/fast-and-accurate-document-detection-for-scanning/

[12] Z. Yang, Y. Cui, Z. Anwar, R. Bocchino, N. Kiyanclar, K. Nahrstedt, R. H. Camp-bell, and W. Yurcik. 2006. Realtime 3d video compression for tele-immersiveenvironments. In Proceedings of Multimedia Computing and Networking 2006.ACM.

[13] Christian Zimmermann and Thomas Brox. 2017. Learning to Estimate 3D HandPose from Single RGB Images. In Proceedings of ICCV 2017. IEEE, 4913–4921.

Data transmission with two channels• Document page image

(~12MB): only when the page has changed

• Hand skeleton data(~2.6kB): 5~6 fps

• Hand Skeleton Data (~ 2.6kB) Latency• Palo Alto ∼ Yokohama: ~ 180 msec• Palo Alto ~ Verona: ~ 191 msec

• High-res Document Data (~ 12MB) Latency• Palo Alto ∼ Yokohama: ~ 26.53 sec• Palo Alto ~ Verona: ~ 28.48 sec

• Our system can achieve relatively low latency over a long distance since we transmit the small hand skeleton data

• Comparison with H-TIME [3] system that transmits large mesh data over shorter distance