A Robust Abstraction for First-Person Video Streaming: Techniques,
Applications, and Experiments
Neil J. McCurdyWilliam G. Griswold
Leslie A. Lenert
Department of Computer Science and Engineering
University of California, San Diego
2
Why stream first-person video?
• Remote vision at dangerous job sites– Disaster Response– Hazmat– SWAT
• Live streams for remote loved ones– My-day live diaries
• Citizen reporting– Cell-phone cameras broadcasting
news-worthy events– Think YouTube, but live– No tripods, no expert camera work
3
Why stream first-person video?
• Remote vision at dangerous job sites– Disaster Response– Hazmat– SWAT
• Live streams for remote loved ones– My-day live diaries
• Citizen reporting– Cell-phone cameras broadcasting
news-worthy events– Think YouTube, but live– No tripods, no expert camera work
4
Challenges of first-person video
• Limited bandwidth “in the wild”– Cellular networks (60-80 Kbps)– Multiple cameras on 802.11 drops total
throughput
• First-person video compression is difficult– Low inter-frame overlap reduces
compression opportunities
5
Challenges of first-person video
• Limited bandwidth “in the wild”– Cellular networks (60-80 Kbps)– Multiple cameras on 802.11 drops total
throughput
• First-person video compression is difficult– Low inter-frame overlap reduces
compression opportunities
6
Challenges of first-person video
• Limited bandwidth “in the wild”– Cellular networks (60-80 Kbps)– Multiple cameras on 802.11 drops total
throughput
• First-person video compression is difficult– Low inter-frame overlap reduces
compression opportunities
7
Challenges of first-person video
• Limited bandwidth “in the wild”– Cellular networks (60-80 Kbps)– Multiple cameras on 802.11 drops total
throughput
• First-person video compression is difficult– Low inter-frame overlap reduces
compression opportunities– Must either reduce frame rate or image
quality– Low frame-rate video is disorienting. How
do the frames relate to one-another?
8
Challenges of first-person video
• Limited bandwidth “in the wild”– Cellular networks (60-80 Kbps)– Multiple cameras on 802.11 drops total
throughput
• First-person video compression is difficult– Low inter-frame overlap reduces compression
opportunities– Must either reduce frame rate or image
quality– Low frame-rate video is disorienting. How do
the frames relate to one-another
• Aesthetic challenges– Blair Witch-type nausea– Constant motion difficult to track– Camera operator’s interests may not intersect
viewer’s interests
9
RealityFlythrough (RFT): A novel solution
What we do• Reduce frame-rate• Approximately reconstruct
camera motion using sensors and image processing
Benefits• High-quality frames• Disorientation minimized• Long dwell-time on each
frame• Aesthetically appealing
– Calm– Mesmerizing
10
RealityFlythrough (RFT): A novel solution
What we do• Reduce frame-rate• Approximately reconstruct
camera motion using sensors and image processing
Benefits• High-quality frames• Disorientation minimized• Long dwell-time on each
frame• Aesthetically appealing
– Calm– Mesmerizing
11
Roadmap
• Introduction
• Video compression challenges
• How RealityFlythrough works
• Experimental results
• Conclusion
12
Video compression challenges revisited
• High-panning video has little redundancy between frames– Most codecs do little better than MJPEG– e.g. sizes of different encodings of 1st
clip• mpg4: 364 KB• mjpeg: 359 KB
• Of course, with redundancy, mpg4 improves– For 2nd clip
• mpg4: 284 KB• mjpeg: 386 KB
• Decimating frame-rate to preserve image quality further reduces temporal redundancy, forcing further decimation in the frame rate– Causes confusion and disorientation
13
Video compression challenges revisited
• High-panning video has little redundancy between frames– Most codecs do little better than MJPEG– e.g. sizes of different encodings of 1st
clip• mpg4: 364 KB• mjpeg: 359 KB
• Of course, with redundancy, mpg4 improves– For 2nd clip
• mpg4: 284 KB• mjpeg: 386 KB
• Decimating frame-rate to preserve image quality further reduces temporal redundancy, forcing further decimation in the frame rate– Causes confusion and disorientation
14
Video compression challenges revisited
• High-panning video has little redundancy between frames– Most codecs do little better than MJPEG– e.g. sizes of different encodings of 1st
clip• mpg4: 364 KB• mjpeg: 359 KB
• Of course, with redundancy, mpg4 improves– For 2nd clip
• mpg4: 284 KB• mjpeg: 386 KB
• Decimating frame-rate to preserve image quality further reduces temporal redundancy, forcing further decimation in the frame rate– Causes confusion and disorientation
15
RFT System Architecture
802.11
H323 Video ConferencingStream
RFT MCU(Multipoint Control Unit)
RFT Engine
Cameras
ImageCapture SensorCapture
StreamCombine (352x288 videoresolution)
RFT Server
How RFT Works
1xEVDO Cellular
(~60 Kbps)
16
Simplifying 3d space
• We know the orientation of each frame• We project the camera’s image onto a virtual wall
at that same orientation• When the user’s orientation is the same as the
camera’s, the entire screen is filled with the image• Results in a 2d simplification of 3d space
How RFT Works
17
The transition
• A transition between frames is achieved by moving the user’s orientation from the point of view of the source frame to the point of view of the destination frame
• The virtual walls are shown in perspective• Overlapping portions of images are alpha-blended
How RFT Works
18
Images are projected inside a sphereHow RFT Works
19
Images are projected inside a sphereHow RFT Works
20
Point matching improves experience
• If frames overlap, point matching allows for more accurate placement– Use SIFT method [Lowe, 2004];
autopano implementation– Client device computes match
and transmits meta-data w/ frame
• 2d morphing between frames improves blend
• Works w/ inter-frame and inter-camera
How RFT Works
21
Point matching meets sensors
• New point-matched frames join the panorama
• The panorama consists of the 5 most recent frames (older ones discarded)
• A new panorama is started when a non-point-matched frame arrives. Sensor data positions the frame.
How RFT Works
22
Field study
Experimental setup• Hazmat bulking process
– Wore full hazmat suits– Labor-intensive– Accurate motion model for head-
mounted camera• .5 fps video transmitted over 1xEVDO• Hazmat supervisor used video to
explain the bulking process
Results• Ran for 64 minutes• Much more camera motion than
expected• Supervisor preferred transitions over
other encoding techniques– Not because of frame quality– Traditional first-person video was too
busy (“It interferes with my thinking. Literally, it’s messing with my head”)
– 1 fps “video” w/o transitions seen as useless
Experimental results
23
Field study
Experimental setup• Hazmat bulking process
– Wore full hazmat suits– Labor-intensive– Accurate motion model for head-
mounted camera• .5 fps video transmitted over 1xEVDO• Hazmat supervisor used video to
explain the bulking process
Results• Ran for 64 minutes• Much more camera motion than
expected• Supervisor preferred transitions over
other encoding techniques– Not because of frame quality– Traditional first-person video was too
busy (“It interferes with my thinking. Literally, it’s messing with my head”)
– 1 fps “video” w/o transitions seen as useless
Experimental results
24
Lab study
• Determine if people may actually prefer transitions to traditional first-person video
Experimental setup• Three first-person videos encoded in 4 different
ways– encFast: RFT Transitions sampled at 1 fps– encSlow: RFT Transitions sampled at .67 fps
Experimental results
25
Lab study
• Determine if people may actually prefer transitions to traditional first-person video
Experimental setup• Three first-person videos encoded in 4 different
ways– encFast: RFT Transitions sampled at 1 fps– encSlow: RFT Transitions sampled at .67 fps– encIdeal: Regular video encoded at 11 fps (∞ bitrate)
Experimental results
26
Lab study
• Determine if people may actually prefer transitions to traditional first-person video
Experimental setup• Three first-person videos encoded in 4 different
ways– encFast: RFT Transitions sampled at 1 fps– encSlow: RFT Transitions sampled at .67 fps– encIdeal: Regular video encoded at 11 fps (∞ bitrate)– encChoppy: Regular video encoded at 5 fps
Experimental results
Same bitrate
27
Lab study
• Determine if people may actually prefer transitions to traditional first-person video
Experimental setup• Three first-person videos encoded in 4 different
ways– encFast: RFT Transitions sampled at 1 fps– encSlow: RFT Transitions sampled at .67 fps– encIdeal: Regular video encoded at 11 fps (∞ bitrate)– encChoppy: Regular video encoded at 5 fps
• Subjects did side-by-side comparisons and ranked encodings in order of preference
• Subjects answered questions to help them arrive at a task-independent ranking
Experimental results
28
Taking out the trash Experimental results
encChoppy
encFast
encIdeal
29
Taking out the trash Experimental results
encChoppy
encFast
encIdeal
30
Taking out the trash Experimental results
encChoppy
encFast
encIdeal
31
Results and analysis
• 12/14 subjects preferred one of our encodings to encChoppy
• 4/14 subjects preferred our encodings to encIdeal w/ 4 more on fence!
• Our encodings grew on people (4 people ranked our encodings higher at end of experiment than at beginning)
Experimental results
• Positives: calm, smooth, slow-motion, sharp, artistic, soft, not-so-dizzy• Negatives: herkey-jerkey, artificial, makes me feel detached, insecure
Our encodings gave subjects time to catch up with what the camera operator was seeing. First-person video tends to dart around too
much.
32
Conclusion
• First-person video is difficult to compress
• To stream it, we must sacrifice image quality or frame-rate
• Very low frame-rate video (< 5 fps) is disorienting
• Video streamed at a low bitrate (e.g. 60 Kbps) loses both frame-rate and image quality and can be painful to watch
• Our solution– Transmit high-quality low frame-
rate (~1 fps) video along with tilt sensor meta-data
– “Reconstruct” intervening frames by inferring camera motion from meta-data
Low overlap
Low frame-rate
Low quality
33
Conclusion
• First-person video is difficult to compress
• To stream it, we must sacrifice image quality or frame-rate
• Very low frame-rate video (< 5 fps) is disorienting
• Video streamed at a low bitrate (e.g. 60 Kbps) loses both frame-rate and image quality and can be painful to watch
• Our solution– Transmit high-quality low frame-
rate (~1 fps) video along with tilt sensor meta-data
– “Reconstruct” intervening frames by inferring camera motion from meta-data
Low overlap
Low frame-rate
Low quality
34
Conclusion
• First-person video is difficult to compress
• To stream it, we must sacrifice image quality or frame-rate
• Very low frame-rate video (< 5 fps) is disorienting
• Video streamed at a low bitrate (e.g. 60 Kbps) loses both frame-rate and image quality and can be painful to watch
• Our solution– Transmit high-quality low frame-
rate (~1 fps) video along with tilt sensor meta-data
– “Reconstruct” intervening frames by inferring camera motion from meta-data
Low overlap
Low frame-rate
Low quality
35
Conclusion
• First-person video is difficult to compress
• To stream it, we must sacrifice image quality or frame-rate
• Very low frame-rate video (< 5 fps) is disorienting
• Video streamed at a low bitrate (e.g. 60 Kbps) loses both frame-rate and image quality and can be painful to watch
• Our solution– Transmit high-quality low
frame-rate (~1 fps) video along with tilt sensor meta-data
– “Reconstruct” intervening frames by inferring camera motion from meta-data
http://www.realityflythrough.com [email protected]
36
Conclusion
• First-person video is difficult to compress
• To stream it, we must sacrifice image quality or frame-rate
• Very low frame-rate video (< 5 fps) is disorienting
• Video streamed at a low bitrate (e.g. 60 Kbps) loses both frame-rate and image quality and can be painful to watch
• Our solution– Transmit high-quality low
frame-rate (~1 fps) video along with tilt sensor meta-data
– “Reconstruct” intervening frames by inferring camera motion from meta-data
http://www.realityflythrough.com [email protected]
37
Other slides
38
Lab study results
39
Why digital instead of analog?
• RealityFlythrough piggy-backs on wireless mesh network that is deployed by first-responders on-site
• Varying conditions of the network can be better managed in digital domain. Frame-rates can be throttled and image quality can be degraded.– Also can guarantee eventual delivery of high-quality data
• Support multiple cameras using same bandwidth managing techniques
40
Related Work
• Panoramic Viewfinder– Baudisch, et al.
• Recognizing Panoramas– Brown, Lowe
• View Morphing– Seitz and Dyer
• Efficient Representations of Video Sequences and their Applications– Irani, et al.
• Predictive perceptual compression for real time video communication– Komogortsev, Khan