architekture.com, inc

architekture.com, inc. TM

design with intelligence

Optimizing Video Conferences with Macromedia Flash Technologies

Jim Cheng

[email protected]

Allen Ellison [email protected]

February 2005

ii Copyright 2005, Architekture.com, All Rights Reserved.

Copyright © 2005 Architekture.com, Inc. All rights reserved. This white paper is for information purposes only. ARCHITEKTURE.COM MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. Macromedia, Macromedia Flash, Flash Communication Server, and Flash Player are either trademarks or registered trademarks of Macromedia, Inc. in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. ARCHITEKTURE.COM, INC. 600 GRANT STREET SUITE 850 DENVER, CO 80203 (720) 231-3166

iii Copyright 2005, Architekture.com, All Rights Reserved.

INTRODUCTION

It is well known that the combination of Macromedia Flash Communication Server and Macromedia Flash Player offers many exciting possibilities for live video conferencing. The task of choosing optimal hardware selections and software settings, however, has remained quite burdensome and arcane. All too often, developers have to deal with audio synchronization, frozen video images, and lag issues. Even for seasoned Macromedia Flash developers, the task of implementing quality Flash-based video conferencing applications becomes a challenge when confronted with the bewildering selection of cameras, network configurations, and software settings. However, the ability to create high-quality video conferencing experiences in Flash is essential to meeting client expectations for many of today’s cutting-edge Flash Communication Server applications. In the course of developing such applications for a variety of clients during 2004, Architekture.com has conducted significant research on optimizing high-bandwidth video conferencing applications with the goal of finding a good balance between video and sound quality, and limiting the use of CPU and network resources to mitigate problems associated with skipped frames, lag, or out-of-sync sound. We are pleased to present our findings and recommendations to the Flash developer community in this white paper. Architekture.com is a leading Macromedia Flash development firm with recognized expertise in Flash Communication Server. Our world-class development team creates cutting-edge solutions that push the limits of what is thought possible. We specialize in the development of immersive, real-time multi-player simulations, as well as rapid prototype development and real-time business collaboration applications.

iv Copyright 2005, Architekture.com, All Rights Reserved.

CONTENTS Introduction........................................................................................................ iii Why Optimization Matters ................................................................................... 1 Focusing on the Client Side ................................................................................. 1 Testing Environment ............................................................................................ 2 Hardware........................................................................................................... 2

Cameras ........................................................................................................ 2 Microphones................................................................................................... 8 Networking..................................................................................................... 8

Software Settings ................................................................................................ 9 Camera Settings.............................................................................................. 9

Camera.setMode() ....................................................................................... 9 Camera.setQuality()................................................................................... 10 Camera.setKeyFrameInterval()..................................................................... 13

Microphone Settings ...................................................................................... 13 Microphone.setRate() ................................................................................. 13 Microphone.setGain() and Microphone.setSilenceLevel()................................ 13 Microphone.setUseEchoSuppression() .......................................................... 14

Buffer Times.................................................................................................. 14 Embedded Video Sizes................................................................................... 14 MovieClip.attachAudio() ................................................................................ 15 Stream Latency.............................................................................................. 15

Scaling ............................................................................................................ 16 Flash Communication Server Limitations.......................................................... 16 Network Limitations ....................................................................................... 17 Client Machine Limitations ............................................................................. 18 CPU Utilization and Resolution ....................................................................... 19

Summary ......................................................................................................... 21 Appendix A: Error Margins and Significance ........................................................ 22 Appendix B: Detailed Experimental Setups and Results.......................................... 23

Camera Testing ............................................................................................ 23 Encoding/Decoding and CPU Utilization ......................................................... 27 Video Settings ............................................................................................... 30 Scaling......................................................................................................... 34

Appendix C: Where to Download Test Files ......................................................... 38 Appendix D: IIDC/DCAM Camera List ................................................................ 39

1 Copyright 2005, Architekture.com, All Rights Reserved.

WHY OPTIMIZATION MATTERS

Many-to-many video conferencing on desktop computers requires significant quantities of resources, both in terms of processor utilization and network bandwidth. In order to achieve optimal results, it is necessary to find a good balance between video and sound quality that limits the use of resources to a level where processor and network loads do not introduce deleterious effects such as frame skipping, lag, or out-of-sync sound into the video conference experience. Poor choices in hardware selection and improper software settings often contribute to a poor video conferencing experience, and the bewildering number of options often makes it seem next to impossible to create high-quality video conferencing experiences, even with best-of-breed tools. This discourages both clients and developers alike, and convinces many that even with today’s technologies, video conferencing applications are difficult to use and cannot meet the promise of rich audio and visual communication between groups of individuals. Judicious choices of optimal hardware configuration and software settings, however, can make all the difference between a glitchy and nearly useless video conference application, and an impressive high-quality experience that exceeds client expectations. In the course of developing rich video conferencing applications using Macromedia technologies, we at Architekture.com have spent many hours determining best choices in specifying and configuring collaborative video conferencing products for our clients. We hope that sharing our results with the Flash developer community will lead to the development and release of many high-quality video conferencing applications in the future.

FOCUSING ON THE CLIENT SIDE

Although Flash Communication Server plays a crucial role in facilitating video conferencing with Flash technologies, for the most part it only serves to relay streams from one client machine to another in live video conferencing situations. In our testing environments, we have noted that even fairly modest server hardware setups such as a single 2.8 GHz Pentium 4 system with 512 MB of RAM can easily accommodate relatively intensive video conferencing situations that push the limit of a typical professional license. The limitations affecting video conferencing performance are instead mainly concentrated on the client side, because this is where the bulk of the work is done. When publishing a stream, the client machine has to acquire video and audio data, encode it, and push it across the network to the server, all in real time. And in a many-to-many conferencing situation, the same machine will need to subscribe to streams published by all of the other participants, decode them in real time, and present the


results onscreen and through the speakers or headphones—this too in real time (or as close to it as possible). Consequently, our optimization research and recommendations focus nearly entirely on the client-side systems.

TESTING ENVIRONMENT

Principal testing was conducted in the Architekture.com development laboratory on a hyper-threaded 2.8 GHz Pentium 4 computer running Windows XP Professional SP1 with 1.25 GB of RAM. The Flash Communications Server application runs on a similar processor with 512 MB of RAM under Windows Server 2003. These machines are connected on a 100 Mbps Ethernet LAN through a switch, and tests were conducted with in-house testing utilities running under Flash Player 7.0.19.0. We also conducted some additional testing on machines belonging to clients for proprietary video conferencing applications.

HARDWARE

A developer's ability to make or suggest hardware configurations for use with an application will vary depending on client requirements. However, we have found that the choice of hardware goes a long way in affecting the overall video conferencing experience. Even if you are building a video conferencing application for the web and have no control over the hardware configurations of client machines, these findings may help in determining minimum system requirements and in optimizing software settings for an expected range of client machines and network configurations. Our goal in making effective hardware choices for optimal performance is to minimize the load on the client processor and network while maintaining a high-quality audio and video stream. During our tests, we found that high processor loads were strongly correlated with poor performance, because the CPU’s time became divided between processes supporting the video conference and other applications contending for processor time. Maintaining reasonable network loads is an important secondary consideration, particularly in low-bandwidth settings, because available network bandwidth directly limits the amount of data that can be transferred between the client machine and Flash Communication Server.

CAMERAS

Cameras play a basic role in acquiring the video signal for conferencing applications. However, the video signal itself usually requires some degree of additional processing by the CPU before it is ready for use by the client Flash Player. Equally important are the drivers used to interface the camera with the operating system, because poorly


written camera drivers coupled with a camera’s high data throughput can place even greater demands on the processor. For most video conferencing applications, camera resolutions greater than 640 x 480 and frame rates greater than 30 frames per second (fps) are generally not necessary. Furthermore, consumer-level cameras intended for use with video conferencing applications seldom provide resolutions and frame rates higher than these for real-time video feeds. Because of this, we will limit our discussion to these cameras and will not consider those with higher resolutions or frame rates that are typically used for scientific and industrial applications. Most cameras designed for video conferencing use one of two serial bus architectures for communication with the client machine: USB (typically the faster 2.0 specification), or Firewire, also known as IEEE 1394. Firewire cameras can also be further divided in two categories based on data transfer protocol: DV (digital video) cameras, which provide a compressed data stream to the computer, and IIDC/DCAM cameras, which output uncompressed data streams and also offer camera hardware control over the Firewire bus. Our tests, as well as available documentation, suggest that there are significant differences in terms of overall processor demands between the various protocols used to transfer data from the camera to the computer. To determine the processor use required to handle video acquisition for different cameras, we conducted tests with three representative cameras using different bus and protocol combinations for transferring data to the client machine under identical resolution and frame rate settings. For our tests, we used the following cameras: Apple iSight, an IIDC/DCAM-compliant webcam that connects through a 400-Mbit Firewire bus; Sony DCR-TRV460, a DV-compliant consumer camcorder that also connects through 400-Mbit Firewire bus; and Creative Labs NX Ultra, a higher-quality USB webcam. All cameras were specified by their manufacturers as having a maximum live video resolution of 640 x 480 pixels as well as the capability of yielding streams of up to 30 fps (with the exception of the Creative NX Ultra camera, which was limited to 15 fps according to manufacturer specifications). Although the Sony DCR-TRV460 camera also sports a USB connection, we only used its Firewire DV connection for our tests. Table 1 provides an overview of the cameras we used for our tests.


Table 1: Basic Camera Capabilities

Camera Data Bus Max. Resolution Max. FPS Apple iSight 1394 IIDC/DCAM 640x480 30

Sony DCR-TRV460 1394 DV 640x480 30 Creative NX Ultra USB 640x480 15

We measured CPU utilization for locally visualizing video output at varying resolutions and frame rates using each camera. To isolate the processor requirements needed to process the video signal and import it into Flash, we conducted these tests entirely locally using a simple Flash application running under Flash Player 7.0.19.0 without Flash Communication Server integration. Resolutions tested were all at the standard definition ratio of 4:3: 160 x 120, 200 x 150, 240 x 180, 320 x 240, 400 x 300, and 640 x 480 at frame rates of 1, 5, 10, 15, 24, and 30 fps. CPU utilization was measured using Windows Task Manager and averaged over roughly 30 seconds of video acquisition with all other applications and non-essential processes disabled. Although data points were obtained for all cameras at our test resolutions and frame rates, no camera supported all the resolutions natively. Actual resolution and frame rate can be assessed programmatically after making a Camera.setMode() call through the camera object’s width, height, and currentFps properties for comparison. When unsupported resolutions or frame rates were requested, Flash typically causes the video stream to be returned from the camera at a lower resolution and scaled up for display with fairly obvious pixelization. Figure 1 shows example frame captures illustrating this pixelization effect.

Creative Labs NX Ultra 240 x 180 (Camera Resolution: 176 x 132)

Apple iSight 240 x 180 (Camera Resolution: 240 x 180)

Figure 1: Sample frame captures illustrating pixelization

In this example, a resolution of 240x180 was requested of both the Creative Labs NX Ultra and the Apple iSight cameras. The NX Ultra, which does not support a 240x180


capture resolution, is instead yielding a 176 x 132 stream, resulting in pixelization as Flash scales up the image to the display resolution of 240 x 180. On the other hand, Apple iSight natively supports a 240 x 180 capture resolution, resulting in significantly better picture quality. Table 2 lists the supported resolutions for each camera in the test set.

Table 2: Supported Camera Resolutions

The cameras tested do not all support the same range of resolutions and frame rates. For this reason, we focused our analysis on configurations supported by multiple cameras to determine comparative performance, even though data points were obtained for a significantly larger set of configurations. In particular, the 160 x 120, 320 x 240, and 640 x 480 resolutions allowed commensurate comparisons between the cameras at various frame rates up to 15 fps for all cameras, and up to 30 fps for the Sony DCR-TRV460 and the Apple iSight cameras. We also made a number of fairly interesting observations with regard to frame rates. In the case of the Creative NX Ultra camera, Flash was successfully able to request and receive video streams at frame rates up to 30 fps as reported by the Camera.fps property, although the camera itself is specified as having a maximum frame rate of 15 fps. We suspect this might be due to inaccurate reporting on the part of the driver or software-level interpolation. The results from our experiments do not yield conclusive evidence for either possibility. Also, although the Apple iSight camera is not officially supported on the Windows platform, we were able to use it with the default Microsoft drivers for 1394 desktop cameras. However, when using this driver, the frame rate was capped at a maximum frame rate of 15 fps. Using the third-party Unibrain Fire-i IIDC/DCAM driver instead enabled us to reach the specified hardware maximum frame rate of 30 fps as shown in Figure 2. It should also be noted that the Creative Labs NX Ultra camera yielded significantly noisier CPU utilization data than the other cameras during testing. We presume this is due to USB bus usage by other devices, including our keyboard and mouse, but could not conclusively determine the source. Overall, the processor load results came in strongly in favor of the IIDC/DCAM-compliant Apple iSight camera. Processor utilization for image acquisition and importing in Flash was roughly half that required for the other two cameras at the

160 x 120 200 x 150 240 x 180 320 x 240 400 x 300 640 x 480

Apple iSight Yes Yes Yes Yes No Yes Sony DCR-TRV460 Yes No No Yes No Yes Creative NX Ultra Yes No No Yes No Yes


same resolution and frame rate in all comparable cases, with the Unibrain Fire-i driver slightly outperforming the Microsoft driver. Processor utilization was roughly comparable between the Sony DCR-TRV460 and the Creative NX Ultra cameras at low resolutions. At a resolution of 320 x 240, the DV-compliant Sony DCR-TRV460 camera came out in the middle and outperformed the Creative Labs NX Ultra camera, although at 640 x 480, the Sony DCR-TRV460 camera came in last when used with higher frame rates. Also, as expected, processor utilization increases with higher resolutions and frame rates. From a hardware perspective, we recommend the use of IIDC/DCAM-compliant cameras, because the uncompressed data stream appears to reduce significantly the overhead needed to process the image for consumption by Flash, particularly if processor resources are at a premium (for example, slower machines, visually rich user interfaces, or video conferences involving more than two participants). Figure 2 shows graphs of experimental results for various requested resolutions at reported frame rates of 15, 24, and 30 fps (lower CPU utilization is better). Note that resolutions other than 160 x 120, 320 x 240, and 640 x 480 are not directly commensurable between cameras due to differences in actual hardware resolution.


Resolution vs. CPU Utilization - 15 FPS

0

5

10

15

20

25

160x120 200x150 240x180 320x240 400x300 640x480

Resolution

% C

PU U

tiliz

atio

n


0

5

10

15

20

25

160x120 200x150 240x180 320x240 400x300 640x480

Resolution

% C

PU U

tiliz

atio

n


0

5

10

15

20

25

160x120 200x150 240x180 320x240 400x300 640x480

Resolution

% C

PU U

tiliz

atio

n

Figure 2: Resolution versus CPU utilization graphs


MICROPHONES

One of the most common problems we encountered with microphones used for video conferencing was the introduction of unwanted echoes and background noise. Although Flash does provide an option for echo suppression via software, we have found that we were able to obtain significantly better results when we reduced the incidence of echoes and irrelevant background noise on the hardware level through proper microphone selection. Echoes and ambient noise are particularly undesirable, because they not only make speech less intelligible, but the unwanted sounds also interfere with our ability to set accurately the silence level needed to toggle the microphone activity state. In the course of developing video conferencing applications for our clients, we have experimented with a number of different microphone setups, including analog headsets, USB headsets, and discrete microphone and speaker combinations to determine the best configurations for obtaining high-quality sound capture while minimizing unwanted noise. The best setup for reducing echo and ambient noise we have found so far seems to be with noise-canceling USB headsets. Additional improvements to audio quality that can be made through software will be discussed later.

NETWORKING

Our video conferencing application development is, for the most part, geared towards high-bandwidth intranet applications. For this reason, we primarily conduct our testing over 100 Mbit Ethernet connections, with and without non-RTMP “noise” traffic. In our experiments with up to 5 actual participants and simulated conferences involving up to 10 participants, we have not encountered any problems with network saturation thus far. For LAN-based intranet applications, a 100 Mbit Ethernet setup appears to be quite sufficient for video conferencing. We have not tested other local network technologies such as 802.11, but results would be similar to those we have obtained given ample bandwidth and network latencies commensurate with 100 Mbit Ethernet connections. High-quality live video conferencing over high-bandwidth Ethernet connections is possible even at relatively high resolutions such as 320 x 240 for small numbers of simultaneous participants. Additionally, bandwidth utilization can be capped at reasonably low levels (for example, 38,400 bytes per second per video stream) without significant loss of video quality given a judicious choice of video encoding parameters as we describe later. For lower bandwidth usage such as across the Internet, available bandwidth will be markedly lower than that available on a LAN, and latency—the amount of time elapsed from when the video has been encoded on one machine to when the video


has been decoded on the recipient machine—will be increased. These issues are essentially the facts of life when developing Internet-based applications. However, they can be dealt with fairly effectively by minimizing bandwidth usage and allowing for increased latency. It should also be noted that for many-to-many video conferencing, the bandwidth required grows exponentially relative to the number of participants. We discuss this issue in greater depth shortly when we consider network limitations on scaling. This is particularly relevant for cases of limited bandwidth, but is an important concern when dealing with collaborative video conferences with increasing numbers of participants.

SOFTWARE SETTINGS

We have experimented with a large number of the possible software settings in Flash Player 7 for video conferencing and have documented our observations in this section. In particular, we have found that many of the typical glitches observed in video conferencing can be addressed with changes in the settings used in the Flash Player client-side communication objects. We also review several other interesting items that we have found in engineering video conferencing applications.

CAMERA SETTINGS

The principal methods for manipulating the camera object in Flash Player are setMode(), setQuality(), and setKeyFrameInterval(). As the camera object is responsible for generating the bulk of the data needed to be streamed to Flash Communication Server, the settings here have a significant effect on both the video quality and the overall video conferencing experience. We’ll consider each of these methods in turn and discuss the possible options for each setting and our observations, test results, and recommendations for configuring an optimal video conferencing experience.

Camera.setMode() The Camera.setMode() method allows specification of the desired resolution and frame rate for the video data being collected. Of course, only certain resolutions and frame rates are supported natively by each physical camera due to hardware limitations. If the settings specified are not intrinsically supported by the camera, Flash Player will instead fall back to the closest possible setting. The capture size will be favored over frame rate by default, but the preference of capture rate over frame rate can be changed through the optional favorSize flag. While this behavior does allow specification of practically any resolution and frame rate, we have found that using unsupported resolutions is undesirable, because it usually results in a pixelated image (as shown in Figure 1 earlier).


From experience, we have found that resolutions of 160 x 120 and 320 x 240 tend to be good choices because they seem to be supported natively by many typical cameras used for video conferencing applications, and they are small enough to function well when encoding for streaming. It is possible to detect programmatically whether the specified size and frame rate were actually used for the camera hardware by inspecting the read-only width, height, and currentFps properties. From our previous tests conducting basic video capture without encoding for network transport, we observed that lower resolutions and frame rates reduce the processor demand on the machine. With this in mind, we recommend choosing the lowest acceptable capture size and frame rate for an application. For high-bandwidth intranet applications, we have found that a resolution of 320 x 240 at 24 fps works relatively well for up to five simultaneous participants. For conferences intended to be conducted across the Internet through broadband connections, capture size and frame rate will need to be scaled down accordingly.

Camera.setQuality() Camera.setQuality() allows specification of both the maximum bandwidth per second to be used by an outgoing video stream, and the required video quality of the outgoing stream. By default, these are 16384 and 0, respectively. These settings allow for the choice of different setups, each with its own benefits. Either parameter can be set to zero to allow Flash to automatically use as much bandwidth as necessary to maintain a specified video quality, or to throttle video quality to avoid exceeding the given bandwidth cap. The video quality can also be set to 100 to use the lossless non-compressing codec instead. Also, an exact bandwidth limit and a required video quality can be specified when both are equally important. We have been unable to determine significant differences in processor utilization between the various setups. However, our experiments revealed marked differences in how Flash handles the edge cases where quality or bandwidth must be sacrificed to remain within the specified limits. In particular, we focused on settings intended for use in intranets with high-bandwidth network connectivity. For the case where both a maximum bandwidth and desired frame quality are specified, we found that a bandwidth limit between 400,000 and 900,000 bytes per second and a frame quality setting of 60 to 85 gave very acceptable results with smooth playback and no audio synchronization issues. Lower frame quality settings yielded increasingly pixelated video as expected. Low bandwidth limits, however, yielded skipped frames as described in the camera object’s documentation.


We also note that in cases where we chose relatively high bandwidth caps, the actual outgoing bandwidth usage seemed to reach a maximal upper limit below the specified cap. For example, we observed total bandwidth usage to seldom exceed 250,000 bytes per second for a 320 x 240 stream captured at 24 fps despite the fact that maximum bandwidth was allocated for video and that the server-to-client maximum total bandwidth on the Flash Communication Server application was set to higher values. With the frame quality specified and bandwidth usage left up to Flash (set to zero), we conducted a series of experiments to determine actual bandwidth usage and observed video quality for various frame quality settings under simulated video conferencing conditions by publishing and self-subscribing to the same stream with the settings recommended by Giacomo “Peldi” Guilizzoni on his weblog.

Table 3: Camera.setQuality() Basic Settings

Bandwidth: 0 FPS: 24 Favor Size: 0 Frame Quality: As Below Key Frame Interval: 8 Camera Width: 280 Camera Height: 208 Buffer Time: 0 Audio Rate: 22 MHz

Table 4 shows the results we obtained for each specified frame quality. Outgoing bandwidth usage per second and processor utilization were averaged over 30 seconds of simulated video conference usage with intermittent audio and relatively little physical motion.

Table 4: Variable Frame Quality Results

Frame Quality Bandwidth/Sec. CPU Util. (%) Subjective Findings 100 250,000 33 Excellent picture, marked frame skipping 90 68,000 29 Excellent picture, some frame skipping 80 36,000 30 Excellent picture, occasional frame skipping 70 24,000 Not Measured Faint pixelization, smooth playback 60 19,000 Not Measured Mild pixelization, smooth playback 50 13,000 Not Measured Medium pixelization, smooth playback 40 11,000 Not Measured Loss of fine detail, smooth playback 30 10,000 Not Measured Moderate loss of detail, smooth playback 20 9,000 Not Measured Severe loss of detail, smooth playback 10 8,000 27 Loss of gross detail, smooth playback


From the data, we observed that CPU utilization dropped rather slowly with decreasing frame quality. High frame quality yielded very high quality pictures at the cost of frame skipping, whereas specifying lower frame quality yielded smooth playback by sacrificing detail. The sweet spot, as it were, seems to be at about a frame quality between 70 to 80. It is also rather interesting to note that at a frame quality of 100 (using the lossless codec with no compression, and causing exceptionally high bandwidth consumption), the CPU utilization seems to be somewhat greater than when the frame quality is set to lower values and the video data compressed. Using similar settings with the frame quality set to 80 and varying the specified bandwidth, we repeated the experiment to obtain the results shown in Table 5.

Table 5: Variable Bandwidth Results Spec. Bandwidth CPU Use (%) Subjective Findings

19,200 30 Smooth, significant pixelization upon movement 38,400 Not Measured Smooth, some pixelization upon movement 51,200 Not Measured Occasional frame skips, pixelization on gross movement 76,800 Not Measured Frequent frame skips, pixelization with extreme movement 128,000 Not Measured Frequent frame skips, high-quality picture 192,000 Not Measured Frequent frame skips, high-quality picture 256,000 Not Measured Very frequent frame skips, high-quality picture 384,000 30 Constant frame skip, high-quality picture

Here, the trade-off seems to be in smooth video playback versus greater pixelization upon movement. If the video image is very still over time, a high-quality picture can be obtained for practically all the specified bandwidths. However, this is somewhat impractical for most video conferencing applications where one would expect at least a small amount of movement. The sweet spot here for a frame quality of 80 is apparently somewhere between 38,400 to 51,200 bytes per second, though 38,400 is quite acceptable if you don’t mind momentary pixelization upon a video conference participant’s sudden movement. Processor utilization, however, appears to be fairly constant throughout. Allowing Flash to modulate the frame quality as needed has the considerable benefit of keeping the bandwidth usage capped to relatively low levels without significantly sacrificing image quality. This is particularly important for low-bandwidth usage scenarios, such as video conferencing over the Internet, and for scaling video conferences to larger numbers of simultaneous participants for intranet use. It is our preferred setting, because momentary pixelization upon gross movements is considered preferable to frequent and unpredictable frame skipping. However, each application may benefit from experimentation with various bandwidth and frame quality settings, depending on requirements and preferences. Alternatively,


Guilizzoni offers a rather handy calculator for choosing these settings with a number of configurable options at:

http://www.peldi.com/blog/archives/2005/01/generating_opti_1.html

Camera.setKeyFrameInterval() The key frame interval determines how often a full key frame is published to the stream, as opposed to an interpolated frame generated by the codec. Flash allows values ranging from 1 to 48, with the default being 15 (every fifteenth frame is a key frame). Testing with varying values for the key frame interval indicates that low key frame intervals tend to contribute to increased frame skipping (as additional bandwidth is used to transmit a full key frame more often), whereas large intervals yield decreased to non-existent frame skipping, but introduce longer normalization times in cases where the frame quality was automatically throttled down in response to motion. For applications demanding very high quality video, we typically set the key frame interval to be equal to or greater than our frame rate, because we feel that occasionally lowered frame quality is preferable to frequent frame skipping.

MICROPHONE SETTINGS

There are several settings that can be specified for the microphone object from within Flash. Specifically, these are the sampling rate, the gain, the silence level and time out, and whether to enable echo suppression in the audio codec. These settings affect sound acquisition and encoding for publishing to Flash Communication Server.

Microphone.setRate() This method determines the sampling rate used to acquire sound from the microphone in kilohertz (kHz). Flash allows settings of 5, 8, 11, 22, and 44 kHz, with 8 being the default in most cases. In general, higher sampling rates yield more natural-sounding audio with increased bandwidth usage. We generally use settings of 22 or 44 kHz to achieve relatively high-quality audio transmission and haven’t noticed significant performance increases with lower sampling rates.

Microphone.setGain() and Microphone.setSilenceLevel() The gain on the microphone is applied as a multiplier for boosting the input much like a volume knob works, with zero silencing the audio, the default level of 50 leaving the signal strength unchanged, and a maximum value of 100. This setting is used in conjunction with the silence level, which determines the threshold above which the microphone is activated for publishing audio data. Optionally, the Microphone.setSilenceLevel() method can also take a second parameter to specify the silence timeout, which is the time in milliseconds that audio should continue to be published after the sound level drops below the specified silence level. We have noted that oftentimes it can be rather difficult to set the audio gain and silence levels as precisely as we would like to enable the microphone to toggle state


correctly. In some cases, the sweet spot for the silence level has been as narrow as one unit, with too low a value causing the microphone to be keyed on constantly and picking up all manner of ambient noise, while a slightly higher value would not accurately detect a video conference participant’s voice at normal conversational volume. The proper choice of gain and silence level values seems to differ significantly between individual machines and microphone setups, so we are unable to recommend specific values outside of experimentation with particular hardware setups. We do, however, recommend implementation of a tell-tale “talk” light in many cases so a participant can see whether his or her audio signal is being broadcast. Too frequently, we have seen the case of a video of a participant mouthing words silently on-screen, unaware that the microphone remains deactivated. If it is necessary to silence the audio programmatically in response to low activity levels or to implement a push-to-talk feature, setting the gain to zero is an effective means of doing this. However, we have not found setting the silence level to 100 to be effective in all instances, because very loud microphone input can raise the activity level to 100 and thus breach the threshold.

Microphone.setUseEchoSuppression() Flash allows for optional echo suppression through the audio codec to be toggled on and off using ActionScript. We usually enable this with good results, although we have found that a more effective solution to echo reduction is to use USB headsets with noise cancellation over analog headsets or discrete microphone and speaker setups. This has the added benefit of filtering out the majority of background noise before it hits Flash, making it easier to get the silenceLevel setting right.

BUFFER TIMES

The NetStream object allows a buffer time to be set on both publishing and subscribing but with significantly different effects. If set on publishing, it determines the maximum size of the outgoing buffer that, when full, will cause the remaining frames to be dropped. The Macromedia documentation states that this is generally not a concern on high-bandwidth connections and we have found this to be the case in our use. On the subscribing end, the buffer time determines the amount of data to be buffered prior to display. We have typically set both of these to zero with excellent results for use with live video conferencing applications.

EMBEDDED VIDEO SIZES

Our experience with sizing embedded videos suggests that processor load is minimized when the embedded video object is sized to match the subscribed video stream’s resolution exactly. In experiments where the displayed video is sized to be both larger and smaller than the published resolution, we have observed increased


processor utilization. Given that the camera resolution on a publishing machine can be changed easily, we recommend matching subscribers’ embedded video object sizes with the stream’s video resolution. The stream’s native resolution can be determined programmatically on the subscriber machine by examining the attached Video object’s width and height properties.

MOVIECLIP.ATTACHAUDIO()

In order to control certain aspects of a stream’s incoming sound (such as volume and panning), a developer can use the MovieClip.attachAudio() method to attach the incoming sound to a MovieClip and then control it through a Sound object as suggested in the client-side development documentation. However, in our experience, we have found that while such technique does provide for additional control over the incoming sound, it also has an unfortunate tendency to desynchronize the audio playback from the video playback. We have not found an adequate solution for this problem as of yet and recommend against using MovieClip.attachAudio() on live video conferencing streams.

STREAM LATENCY

Latency can be a significant problem with many video conferencing situations, and manifests itself as the delay between events captured at the publishing machine and their arrival and display on a subscriber machine. Because there is no native provision for a client-side determination of latency, we measure latency by broadcasting a message using the NetStream.send() method on a publishing machine and timing the difference in time between the initial broadcast and the subsequent receipt of the message on a second, self-subscribed stream. While this technique measures data latency, all of our observations thus far indicate this directly coincides with video latency. Therefore, we have also taken to interpreting data latency as video latency. In the course of our research, we have noted that, upon subscribing to a live stream, latency typically averages below 50 milliseconds (ms) when audio data is entirely absent. However, upon playback of streamed audio data, latency will typically increase rapidly to several hundred milliseconds with little to no recovery to previous levels, even after audio data has ceased. We have also observed that in some cases with continuous audio data (for example, when the microphone is always keyed on because of significant volume or too low a silence level), the measured latency increases slowly in a continuous manner. While in many cases the latency will tend to level out at 200 to 400 ms (values that we find acceptable), latency will sometimes continue to grow into the seconds, yielding a very poor-quality video conferencing experience. While we typically can restore the latency to low levels by closing the subscribed stream and resubscribing, such a solution is not particularly appealing because it interrupts the video and audio for


several seconds while the stream is reconnected. To date, we have not found an adequate solution for capping latency at manageable levels. It is also important to note that we have not discovered a way of automating the measurement of audio latency, and aside from implementing a questionable hardware-based solution such as feeding the speaker jack into the microphone jack and monitoring the audio activity level, we are at a loss on how to measure audio latency. A means of determining audio latency would be extremely valuable, of course, because we could then identify and measure audio sync issues as they occurred through automated means.

SCALING

While it is relatively easy to create a high-quality video conferencing experience for two simultaneous participants, the demands on both the network and the machines increase quickly as an application is scaled to involve greater numbers of simultaneous participants. Specifically, the bandwidth needed to support many-to-many video conferencing grows exponentially relative to the number of participants such that n2 streams are required for n participants. (For more information on bandwidth usage, see Brian Hock’s Macromedia white paper entitled Calculating Your Bandwidth and Software License Needs for the Macromedia Flash Communication Server MX.) Additionally, each client machine will need to dedicate additional resources to handle the decoding of each subscribed stream. These factors place upper limits on the maximum number of possible participants in a single video conference on several fronts, the FCS server, the network infrastructure’s available bandwidth, and the capabilities of the client machines.

FLASH COMMUNICATION SERVER LIMITATIONS

Flash Communication Server is licensed in increments of 10 Mbit per second or 2,500 simultaneous connections, so the primary consideration when it comes to scaling Flash Communication Server to accommodate increasing numbers of video conference participants is adequate bandwidth support by its current license(s). For video conferencing applications, the 10 Mbit per second peak bandwidth limit will almost surely be reached before coming close to making 2,500 simultaneous connections. There aren’t any limits on the number of streams served, just peak bandwidth usage and total simultaneous connections. A single professional license offers 10 Mbit per second, or about 1.19 megabytes per second in available bandwidth. To calculate the usage for a hypothetical case, let us assume a fairly typical high-bandwidth video conferencing stream with a maximum of 38,400 bytes per second allocated to video data, and a 44 kHz audio sampling rate. Experimentally, this utilizes a peak maximum of roughly 50 kilobytes per second.


Using 50 kilobytes per second as our estimated bandwidth usage, for increasing numbers of participants, we can generate the total streams and estimated maximum bandwidth usage per second in Table 6.

Table 6: Example Bandwidth Calculations for n Participants

Participants Total Streams Max. Bandwidth (Bytes per Sec.) 2 4 200,000 3 9 450,000 4 16 800,000 5 25 1,250,000 6 36 1,800,000 7 49 2,450,000 8 64 3,200,000 9 81 4,050,000 10 100 5,000,000

Of course, these numbers are a rough estimate and probably err slightly on the high side, because we are assuming that all streams are simultaneously reaching their expected peak bandwidth utilization. However, we can use these figures to estimate the bandwidth load on the Flash Communication Server software. Given our earlier assumptions, a single professional license will likely become saturated somewhere between four and five simultaneous participants. To accommodate larger numbers of participants, the maximum bandwidth cap on Flash Communication Server would need to be increased by stacking additional licenses or purchasing higher capacity licenses from Macromedia. In practice, actual bandwidth usage will depend on the choice of settings and how the application is actually used. As screen real estate on the client side is also expected to diminish with increasing numbers of video conference participants, we recommend a strategy of reducing per-stream bandwidth usage with increasing numbers of participants by scaling down the capture resolution and frame rate, video bandwidth cap, or frame quality as the number of participants in a video conference grows. Even with an unlimited capacity license on Flash Communication Server, the limitations on hardware, operating system, and processor performance will eventually impose a hard ceiling on the number of simultaneous participants supported for a video conferencing application.

NETWORK LIMITATIONS

Much of our research has been focused on video conferencing in high-bandwidth intranet configurations with ample network headroom. However, network limitations should be kept in mind when scaling video conferencing applications for deployment on all network configurations, particularly those in heavily used environments or


across the Internet. Also, when comparing bandwidth utilization reported by Flash Communication Server to actual bandwidth used on the physical network, some degree of additional network overhead used for packet envelopes, retransmitted packets, and control messages should be taken into account. In our experience, video conferencing works very well in an intranet setting. However, in busy local network environments, you will need to take into account additional, non-video conference traffic such as e-mail, web browsing, and file transfers also contending for network bandwidth. Depending on local traffic volume and the network architecture, you may encounter lower available bandwidth and quality of service than might be expected in ideal conditions. While we have not encountered any problems traceable to network congestion in test cases involving both shared and dedicated 100 Mbit Ethernet connections for our video conferencing tests, we do suggest testing to ensure that an application runs well in its specific network environment. When video conferencing is conducted over the Internet, other factors come into play. First, significantly greater latency and lower available bandwidth can be expected than those achievable in a local network configuration, even for users with broadband connections. Also, some users may have asymmetric upload and download bandwidth capacities. These limitations place additional constraints on the size and quality of video streams that can be delivered to each client. As recommended by Guilizzoni and Hock, lowering the capture size, bandwidth and video quality of your streams will be necessary to accommodate the limitations of Internet-based conferencing.

CLIENT MACHINE LIMITATIONS

On the client machines, the principal consideration in scaling to larger numbers of participants is the incremental growth of the number of streams that need to be decoded and displayed. We have conducted a number of simulated tests on single video conference clients subscribing to and displaying up to 10 live streams without significant problems when used with reasonable bandwidth and quality settings. We observed that the settings in Table 7 yield very acceptable performance with smooth playback when decoding and rendering up to 10 incoming streams on our test machine.

Table 7: Recommended 10-Participant Settings

Bandwidth: 38400 FPS: 15 Favor Size: 0 Frame Quality: 0 Key Frame Interval: 30 Camera Width: 160


Camera Height: 120 Buffer Time: 0 Audio Rate: 22 MHz

Average processor utilization for 10 streams utilizing the settings in Table 7 was only 36%, demonstrating that a high-quality 10-participant video conference is entirely possible on current systems using Macromedia Flash technologies. We have also conducted additional tests with varying parameters, but have found this combination of settings to yield the best results.

CPU UTILIZATION AND RESOLUTION

We wanted to determine the effect of stream resolution on processor usage and determine optimal resolutions to use with different numbers of simultaneous participants. Using matched publishing and display resolutions, we measured averaged CPU utilization on our test machine over 60 to 90 seconds when subscribed to 4, 6, or 8 simulated video conferencing streams at various resolutions in 4:3 aspect ratios using the settings in Table 8.

Table 8: CPU Utilization versus Resolution Basic Settings

Bandwidth: 38400FPS: 24Favor Size: 0Frame Quality: 0Key Frame Interval: 48Buffer Time: 0Audio Rate: 44 MHz

Figure 3 shows the plotted results obtained for the 4, 6, and 8 stream cases at various resolutions


Figure 3: CPU utilization versus stream resolution area graph The x-axis in this graph is measured in somewhat unusual units, the area resolution of a stream’s video feed in thousands of pixels. For example, a video resolution of 320 x 240 would yield an area of 76.8 kilopixels (320 x 240 = 76,800). To convert back from the area to the original 4:3 aspect ratio dimensions, divide the area by 12 and take the square root of the resulting value. This can be multiplied by 4 to obtain the width, and by 3 to obtain the height. This unit of measurement was used so that we could quantitatively compare various resolutions against each other. The numeric results are provided in Appendix B. The positions of the 160 x 120 and 320 x 240 capture resolutions that are typically supported at the hardware level by many commonly used video conferencing cameras are indicated on the graph to assist in reading. At present, we suspect that the appearance of shelves, where CPU utilization remains fairly stable across relatively substantial changes in resolution with marked changes between certain resolutions, stems from the encoding algorithm used by Flash in compressing video. However, we do not have sufficient information to determine conclusively whether this is the case.


SUMMARY

In summary, we offer these findings of optimal hardware and software configurations for use in live conferencing applications using Flash Communication Server:

• Cameras for video conferencing differ significantly in the processor load needed for video acquisition. We have found that Firewire cameras using the IIDC/DCAM protocol perform significantly better than USB cameras or DV Firewire cameras.

• USB headsets with active noise cancellation are preferred, because they provide superior sound quality and echo reduction compared to analog headsets or discrete setups.

• Resolutions natively supported by the camera hardware are preferable in order to avoid pixelization. Typically, 160 x 120 and 320 x 240 are supported and work reasonably well for streaming.

• Bandwidth utilization should be carefully balanced with image quality. Maximizing either or both tends to yield poor results. A bandwidth limit of about 38,400 bytes per second with an unspecified frame quality and a key frame interval at or above the camera frame rate serves our purposes rather well. Experimentation may be in order to find the configuration best fitting a given application’s requirements. Giacomo Guilizzoni has provided an easy-to-use calculator that recommends values for various setups at:

http://www.peldi.com/blog/archives/2005/01/generating_opti_1.html

• Microphone sampling rates of 22 or 44 kHz work well. Low sampling rates, while reducing bandwidth usage, also result in poor audio quality.

• Embedded videos used for displaying subscribed streams should be sized to match the originating camera resolution for optimal performance.

• MovieClip.attachAudio() should not be used to manipulate the audio from a subscribed stream. This has a tendency to introduce unwanted synchronization issues.


APPENDIX A: ERROR MARGINS AND SIGNIFICANCE

Most of our test results, particularly those of processor utilization read from Windows Task Manager, were obtained by manual estimates of averages from values provided from various tools on a periodic basis. Additionally, video conferences were typically simulated by speaking into our USB headsets in front of our test cameras in a calm manner for up to several minutes. Unfortunately, such practices limit our ability to reproduce visual and audio inputs exactly for each test case. As such, we have assumed a moderate error margin and have refrained from reading significance into cases where only marginal differences were observed due to our inability to obtain results with high precision or granularity. We are actively working to obtain results with greater statistical rigor through research in tools that would yield both better-reproducible test cases and more precise results. Using such tools, we would be able to analyze for significant variations more effectively. To alleviate some of the problems that our current methods introduce, we provide detailed experimental results and community access to our experimental tools in these appendixes so that our tests can be reproduced and the results be compared by others in the Flash Communication Server development community.


APPENDIX B: DETAILED EXPERIMENTAL SETUPS AND RESULTS

CAMERA TESTING

For our camera tests, three representative cameras supporting different protocols were used in conjunction with our CamTest tool: Apple iSight, an IIDC/DCAM-compliant webcam that connects via Firewire; Sony DCR-TRV460, a DV-compliant camcorder that also connects via Firewire; and Creative Labs NX Ultra, a USB webcam. All cameras were specified as having a maximum live video resolution of 640 x 480 pixels, the capability of yielding streams of up to 30 fps (with the exception of the Creative NX Ultra, which was limited to 15 fps). Although the Sony DCR-TRV460 camcorder also supports a USB connection, we only tested it using its DV connection.

Table 9: Basic Camera Specifications

Camera Data Bus Max. Resolution Max. FPS Apple iSight IIDC/DCAM 640x480 30

Sony DCR-TRV460 DV 640x480 30 Creative NX Ultra USB 640x480 15

CPU utilization for locally visualizing video output at varying resolutions and frame rates was measured using each camera using the Windows Task Manager with all non-essential processes disabled. To isolate the processor requirements needed to process the video signal into Flash, these tests were conducted entirely locally using a simple Flash application running under Flash Player 7.0.19.0 with no Flash Communication Server integration. Resolutions tested were all at the standard definition ratio of 4:3: 160 x 120, 200 x 150, 240 x 180, 320 x 240, 400 x 300, and 640 x 480 at rates of 1, 5, 10, 15, 24, and 30 fps. CPU utilization was averaged over roughly 30 seconds of video acquisition. Table 10 provides the supported resolutions for each camera among the test set. The footnotes provide additional details on the actual sizes of the video streams when the given resolution was requested.


Table 10: Supported Resolutions for Test Cameras (Extended)

As a result, the CPU utilization observations obtained for the 200 x 150, 240 x 180, and 400 x 300 resolutions should be interpreted with some caution compared to the resolutions for which all tested cameras provided matched video streams. It is probable that the scaling of lower-resolution video streams to the originally requested size in the Flash Player contributes somewhat to the overall CPU utilization. Additionally, we had some issues with frame rates. In the case of the Creative NX Ultra, although the camera itself is specified as having a maximum frame rate of 15 fps, Flash was successfully able to request and receive video streams at frame rates up to 30 fps. We suspect this might be due to inaccurate reporting on the part of the driver or software-level interpolation. The results from our experiments do not yield conclusive evidence for either possibility. In the case of the Apple iSight camera, we were only able to attain a maximum frame rate of 15 fps, although the technical specifications state that a frame rate of 30 fps was possible. This was likely due to the use of the generic Windows 1394 Desktop Camera driver, because a manufacturer-supplied driver for the Windows operating system was not available. Resolution and frame rate testing for the Apple iSight camera was therefore limited to frame rates of 15 fps and below for the tests described here, though at a later point, we were able to obtain a 30 fps frame rate from the Apple iSight camera using the Unibrain third-party Fire-i drivers for 1394 IIDC/DCAM cameras as described in the main body of this white paper. It should also be noted that results for the Creative NX Ultra camera were significantly noisier than for the other cameras, presumably due to noise from additional USB devices connected to the test machine. Figure 4 presents graphs of our experimental results (lower CPU utilization is better).

1 A video stream of 320 x 240 was obtained when a 400 x 300 stream was requested. 2 Video streams of 160 x 120 were obtained when 200 x 150 and 240 x 180 streams were requested. 3 The Sony DCR-TRV460 camera produces an interlaced video stream at 640 x 480. 4 The Creative NX Ultra camera produced slightly letterboxed frames at 160 x 120, 320 x 240, and 640 x 480. 5 Video streams of 176 x 132 were obtained when 200 x 150 and 240 x 180 streams were requested. 6 A video stream of 352 x 264 was obtained when a 400 x 300 stream was requested.

160x120 200x150 240x180 320x240 400x300 640x480 Apple iSight Yes Yes Yes Yes No1 Yes Sony DCR-TRV460 Yes No2 No2 Yes No1 Yes3 Creative NX Ultra Yes4 No5 No5 Yes4 No6 Yes4


Frame Rate vs. CPU Utilization at 160x120

0

5

10

15

20

25

0 5 10 15 20 25 30

Frames Per Second

% C

PU U

tiliz

atio

n

Apple iSight (1394 IIDC/DCAM)Sony DCR-TRV460 (1394 DV)Creative NX Ultra (USB)


0

5

10

15

20

25

0 5 10 15 20 25 30

Frames Per Second

% C

PU U

tiliz

atio

n



0

5

10

15

20

25

0 5 10 15 20 25 30

Frames Per Second

% C

PU U

tiliz

atio

n




0

5

10

15

20

25

0 5 10 15 20 25 30

Frames Per Second

% C

PU U

tiliz

atio

n



0

5

10

15

20

25

0 5 10 15 20 25 30

Frames Per Second

% C

PU U

tiliz

atio

n



0

5

10

15

20

25

0 5 10 15 20 25 30

Frames Per Second

% C

PU U

tiliz

atio

n


Figure 4: Frame rate versus CPU utilization graphs


Figure 5 shows the results of graphing the same data to compare CPU utilization at 15, 24, and 30 fps.


0

5

10

15

20

25

160x120 200x150 240x180 320x240 400x300 640x480

Sony DCR-TRV 460Creative NX Ultra


0

5

10

15

20

25

160x120 200x150 240x180 320x240 400x300 640x480

Resolution

% C

PU U

tiliz

atio

n

Sony DCR-TRV 460Creative NX Ultra

Figure 5: Resolution versus utilization graphs

ENCODING/DECODING AND CPU UTILIZATION

With the understanding obtained of the effects of various cameras and video stream formats on processor utilization, we next analyzed the CPU utilization incurred by publishing audio and video to Flash Communication Server as well as that needed for subscribing to a stream from Flash Communication Server using our FCSDiag tool. We tested a broadcasting-only configuration (with no local visualization), and a simple loopback configuration where the published stream was resubscribed and rendered by the same machine under several different video setting configurations that have


proven to give high-quality results. The loopback case effectively simulates the load for a participant machine in a simple 1-to-1 video conference. The first test was conducted with a configuration that yields, as we have found through prior work in Flash video conferencing, a relatively high-quality experience. Two additional tests were also conducted, the first having the camera bandwidth set to 38,400 and allowing Flash to throttle the video quality dynamically, and the second having the video quality set to 90 and the bandwidth unspecified by being set to zero, as recent experiments have shown that these configurations also yielded relatively high-quality results. Table 11 lists the configurations used for each of these tests. In the graphed results that follow, the “Publish Only” CPU utilization encompasses the CPU use needed for video acquisition and publishing of the encoded stream to Flash Communication Server, while the “Loopback” CPU utilization adds on the additional processor use needed to subscribe and display the same stream on the test machine.

Table 11: Encoding/Decoding Test Configurations Test Configuration A Test Configuration B Test Configuration C

Bandwidth: 400,000 38,400 0

FPS: 247 247 247

Favor Size: 0 0 0

Frame Quality: 85 0 90

Key Frame Interval: 48 48 48

Camera Width: 320 320 320

Camera Height: 240 240 240

Buffer Time: 0.01 0.01 0.01

Audio Rate: 22 MHz 22 MHz 22 MHz

Figure 6 shows the results graphically.

7 In practice, this results in an actual frame rate of 15 FPS for the Apple iSight due to driver limitations.


Encoding / Decoding - Test Configuration A

0

5

10

15

20

25

30

Apple iSight - FP7 Sony DCR-TRV460 - FP7 Creative NX Ultra - FP7

Camera / Flash Player

% C

PU U

tiliz

atio

nPublish OnlyLoopback

Encoding / Decoding - Test Configuration B

0

5

10

15

20

25

30



% C

PU U

tiliz

atio

n

Publish OnlyLoopback

Encoding / Decoding - Test Configuration C

0

5

10

15

20

25

30



% C

PU U

tiliz

atio

n

Publish OnlyLoopback

Figure 6: Encoding/Decoding Graphs


Our three test configurations yielded comparable results in terms of CPU utilization despite the differences in settings. Because the Apple iSight camera was operating at only 15 fps for these tests (as described earlier), we believe that the CPU utilization in tests employing it are artificially lowered to a certain extent. As such, there do not appear to be substantial differences in the amount of work needed to encode or decode video from the cameras tested. In terms of subjective quality, all tested configurations utilizing the different cameras were quite acceptable, as we had anticipated. Additionally, the data from this series of experiments enable us to break down CPU usage into its constituent parts when combined with our earlier results for CPU utilization during video acquisition with our test cameras. Applying this to the data for Test Configuration A, we can derive the breakdown shown Figure 7. Results for the other test configurations are similar.

Test Configuration A -- Loopback Breakdown

2.5

6.593.5

2.5

310

10

13

0

5

10

15

20

25

30


% C

PU U

tiliz

atio

n

DecodingEncodingAcquisition

Figure 7: CPU utilization breakdown for Test Configuration A While there are slight differences in the CPU usage for encoding and decoding in the three cases shown, the greatest factor affecting total CPU utilization in these simulated simplest case 1-to-1 video conferencing tests remains the choice of camera.

VIDEO SETTINGS

From prior experience with video conferences involving five participants, we had usually set both frame quality and the maximum bandwidth for the camera during testing and used a static video resolution of 320 x 240 pixels at 24 fps (the same as the Flash movie’s frame rate). After significant trial and error, we had arrived at the settings shown in Table 12. These settings yielded the best overall performance with minimal frozen frames and synchronization issues.


Table12: Initial Video Settings

Bandwidth: 400,000-900,000FPS: 24Favor Size: 0Frame Quality: 60-85Key Frame Interval: 48Camera Width: 320Camera Height: 240Buffer Time: 0.01Audio Rate: 22 MHz

In FCSDiag loopback tests employing both video and audio input, the CPU utilization and average latency (time for a NetStream.send call to reach the Flash Communication Server application and return) did not show significant variation within the range of bandwidth and frame quality settings given in Table 12 and were essentially the same as the results obtained in Table 8 for Test Configuration A in the encoding/decoding tests. Subjectively, the video stream appeared very smooth and no frozen frames or problems with audio synchronization were observed. At lower frame quality settings, some pixelization was observed as expected. Table 13 lists typical CPU utilization and average latency obtained using these settings on Flash Player 7.

Table13: CPU Utilization and Latency for Cameras

Camera Avg. Latency (ms) % CPU Utilization Apple iSight 150 13

Sony DCR-TRV460 180 21 Creative NX Ultra 180 25

It should be noted that the average latency tends to remain fairly stable, with the loopback signal being delayed about 150 to 180 ms from real time once audio data has been introduced to the stream. On some occasions, latency will increase to markedly higher values (~1,500 ms) for unknown reasons and yield unsatisfactory results as the received stream lags over a second behind real time. We have also experimented with only setting either the maximum bandwidth or the frame quality so as to allow Flash to manage one or the other in real-time. We were initially introduced to such a possibility from Giacomo Guilizzoni’s weblog, where he presented an optimal settings calculator for Flash Communication Server video settings under different settings.


Adapting his results to our needs, we conducted a number of tests to quantify the effects of each of the parameters under such regimes. Our experiment results indicate these approaches also produce relatively high-quality results with properly chosen settings. As these tests were done using a different program than was employed in the earlier tests in order to measure and graph bandwidth utilization in real time. The resultant CPU utilization measures are not directly comparable to the data obtained in previous experiments. We used the Creative NX Ultra for these experiments. For our initial battery of tests, we set the bandwidth to 0 and throttled the frame quality from 100 down to 0 with the audio muted (to keep latency relatively constant) under the conditions given in the Table 14.

Table14: Variable Frame Quality Settings

Bandwidth: 0FPS: 24Favor Size: 0Frame Quality: As BelowKey Frame Interval: 8Camera Width: 280Camera Height: 208Buffer Time: 0Audio Rate: 22 MHz

We obtained the following results, with the average bandwidth utilization in bytes, selected average CPU utilizations, and subjective findings for each test listed in Table 15.

Table 15: Variable Frame Quality Results Frame Quality Bandwidth/Sec CPU Util. (%) Subjective Findings

100 250,000 33 High-quality picture, marked frame skipping 90 68,000 29 High-quality picture, some frame skipping 80 36,000 30 High-quality picture, occasional frame skipping 70 24,000 Not Measured Faint pixelization, smooth playback 60 19,000 Not Measured Mild pixelization, smooth playback 50 13,000 Not Measured Medium pixelization, smooth playback 40 11,000 Not Measured Loss of fine detail, smooth playback 30 10,000 Not Measured Moderate loss of detail, smooth playback 20 9,000 Not Measured Severe loss of detail, smooth playback 10 8,000 27 Loss of gross detail, smooth playback

Here, CPU utilization seems to drop rather slowly with decreasing frame quality. High frame quality yielded very high-quality pictures at the cost of frame skipping, whereas specifying lower frame quality yielded smooth playback by sacrificing detail. The sweet spot, as it were, seems to be at about a frame quality of 70 to 80. It is also interesting to note that at a frame quality of 100 (zero compression, accompanied by


exceptionally high bandwidth consumption), the CPU utilization seems to be somewhat greater than when the frame quality is set to lower values and the video data compressed. Subsequently, we performed another battery of experiments, this time varying the specified bandwidth but keeping the frame quality set to 80 with settings otherwise identical to those given in Table 6. Although a frame quality of 80 had produced occasional frame skipping shown in Table 15, from previous experience such a value typically yields a decent trade-off between high bandwidth and CPU utilization and low picture quality, and so it was chosen for this set of experiments. Table 16 lists the results.

Table 16: Variable Bandwidth Results

Spec. Bandwidth CPU Util. (%) Subjective Findings 19,200 30 Smooth, significant pixelization upon movement 38,400 Not Measured Smooth, some pixelization upon movement 51,200 Not Measured Occasional frame skips, pixelization on gross movement 76,800 Not Measured Frequent frame skips, pixelization with extreme movement 128,000 Not Measured Frequent frame skips, high quality picture 192,000 Not Measured Frequent frame skips, high quality picture 256,000 Not Measured Very frequent frame skips, high quality picture 384,000 30 Constant frame skip, high quality picture

The trade-off seems to be in smooth video playback versus greater pixelization upon movement. If the video image is very still over time, a high-quality picture can be obtained for practically all the specified bandwidths. The sweet spot for a frame quality of 80 is apparently somewhere between 38,400 to 51,200 bytes per second, although 38,400 is quite acceptable if it's acceptable to experience momentary pixelization upon a video conference participant’s sudden movement. Such settings also have the benefit of keeping the bandwidth usage capped relatively low without significantly sacrificing image quality. This is of particular benefit as we assume that keeping the bandwidth usage in check becomes increasingly necessary when scaling the video conference to greater numbers of participants. Additionally, several ad hoc tests indicate that a low key-frame interval tends to contribute to increased frame skipping, whereas high key-frame intervals, particularly ones higher than the frame rate, result in decreased frame skipping but introduce somewhat longer normalization times in cases where the video image has become pixelated due to motion. Although these tests were not repeated on the Apple iSight camera or the Sony DCR-TRV460 camcorder, the results obtained here lead to the configurations chosen for Test Configuration B and C in the encoding/decoding tests described earlier, which replicate a subset of these batteries for the two additional cameras.


SCALING

The other major goal of our research is the feasibility of scaling video conferencing to support up to 10 simultaneous participants, as one of our goals is determining both the feasibility and extensibility of Flash video conferencing to large participant video conference situations. To do this, we conducted a number of tests using our FCSDiag suite of test applications. Due to both screen size and network bandwidth constraints, we are primarily looking at a resolution of 160 x 120 for each participant’s video stream. The principal considerations in finding optimal settings for supporting a 10-participant conference are maintaining a relatively low CPU utilization, as each machine will need to encode its own stream as well as decode 10 incoming streams, and minimizing network bandwidth utilization, as bandwidth requirements scale exponentially with the number of participants. Some of the initial scaling tests documented in the following tables were performed prior to our determination that the Apple iSight camera performed significantly better in reducing the CPU overhead involved in video acquisition. Our initial tests were done using the Creative NX Ultra camera with relatively naïve video settings with marginally acceptable results. Significantly better results were obtained in tests conducted with the Apple iSight camera incorporating refinements in the video configuration learned through testing. Our efforts in determining optimal configurations for scaling video conferences to 10 participants are described below. All tests were conducted with the test machine publishing its own stream, and subscribing to and displaying n (varying between 1 and 10) streams with identical video settings being broadcast from a second participant machine through Flash Communication Server. This effectively simulates the load of a participant machine in a conference with n + 1 participants where the participant machine is not monitoring a loopback stream. A second participant machine was used to provide the streams to be subscribed on the test machine as this allowed us to focus the second machine’s camera (Logitech QuickCam Orbit) on ambient street traffic outside our facility. With large numbers of video feeds, it was significantly easier to assess frame skipping when imaging steadily moving vehicles rather than facial movements. Audio data was collected and published by both machines from ambient sound in the room. Our initial test (Test 1) was conducted with the configuration shown in Table 17, chosen to sacrifice video quality momentarily if necessary to contain bandwidth usage to reasonable limits.


Table 17: Test 1 Configuration

Bandwidth: 38400 FPS: 24 Favor Size: 0 Frame Quality: 0 Key Frame Interval: 8 Camera Width: 160 Camera Height: 120 Buffer Time: 0 Audio Rate: 22 MHz

While previous testing using the same configuration regime had produced quality results at substantially higher resolution, such results did not scale well with additional streams. Subjectively, while video quality was always high, smooth playback was only observed up to the case with two subscribed streams (n = 2), beyond which increasing numbers of skipped frames and eventually skipped audio were observed. The CPU utilization for this test set was also assessed and is given in Figure 8.

CPU Utilization vs. Subscribed StreamsTest 1

14

20 20

25

29

34

4042

45

50

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10

Subscribed Streams (N)

CPU

Util

izat

ion

(%)

Figure 8: Test 1 results Table 18 provides a breakdown of the subjective observations at each step.

Table 18: Test 1 Results

Streams (N) CPU Util. (%) Subjective Findings 1 14 Good picture quality, smooth playback 2 20 Good picture quality, smooth playback 3 20 Good picture quality, rare frame skips 4 25 Good picture quality, rare frame skips 5 29 Good picture quality, rare frame skips


6 34 Good picture quality, occasional audio and frame skips 7 40 Good picture quality, occasional audio and frame skips 8 42 Good picture quality, occasional audio and frame skips 9 45 Good picture quality, occasional audio and frame skips 10 50 Good picture quality, occasional audio and frame skips

Also, we found it convenient at this point to test whether the additional network load of video streams published at higher-than-necessary resolution and the subsequent need to scale down the streams for display on subscribing machines would negatively affect performance (as one might expect), and if so, exactly how much. To accomplish this, we repeated Test 1, this time instructing the second machine to publish its stream at a resolution of 320 x 240 instead of 160 x 120. Thus, the test machine would be subscribing to n streams at a higher resolution than necessary and would need to scale these down to 160 x 120 for display (Test 2). This resulted in significantly higher CPU utilization and while video quality remained high throughout (as in the previous test), we observed even worse frame and audio skipping. Clearly, publishing streams at higher resolutions than needed by its subscribers has significant negative implications for the overall video-conferencing experience. Table 19 lists the data set obtained from this test.

Table 19: Test 2 Results

Streams (N) CPU Util. (%) Subjective Findings 1 15 Good picture quality, smooth playback 2 27 Good picture quality, rare frame skips 3 37 Good picture quality, rare frame skips 4 47 Good picture quality, occasional frame skips 5 55 Good picture quality, occasional audio and frame skips 6 59 Good picture quality, occasional audio and frame skips 7 61 Good picture quality, occasional audio and frame skips 8 66 Good picture quality, persistent audio and frame skips 9 68 Good picture quality, persistent audio and frame skips 10 69 Good picture quality, severe audio and frame skips

Additionally, we wanted to determine the optimal stream resolution in relation to processor usage when scaling video conferencing applications to increasing numbers of participants. Thus, we used matched publishing and display resolutions and measured CPU utilization as a function of both the number of incoming subscribed streams and the total pixels per stream resolution (for example, 160 x 120 would be 19,200 pixels). CPU utilization was measured and averaged on a test machine over 60 to 90 seconds when subscribed to 4, 6, or 8 simulated video conferencing streams at various resolutions in 4:3 aspect ratios using the settings in Table 20.

Table 20: CPU Utilization versus Resolution Basic Settings


Bandwidth: 38400 FPS: 24 Favor Size: 0 Frame Quality: 0 Key Frame Interval: 48 Buffer Time: 0 Audio Rate: 44 MHz

This series of experiments yielded the results in Table 21 for our range of tested stream and resolution configurations, with pixels being specified as total pixels per streams. This data is also presented graphically in Figure 3 in the main body of this paper.

Table 21: Streams/Pixels versus CPU Utilization

4 Streams 6 Streams 8 Streams Pixels % CPU Pixels % CPU Pixels % CPU 7,500 12.8 19,200 29.1 10,800 31.6

10,800 15.1 30,000 36 14,700 40.6 14,575 23.4 36,300 36 19,200 38.3 19,200 22.7 43,200 43.6 24,300 39.6 24,300 20.2 50,700 40.2 30,000 40 30,000 21.5 58,800 39 76,800 50.4 36,300 22.8 67,500 40.8 43,200 36.8 76,800 40.1 50,700 38.5 58,800 37.2 67,500 36.4 76,800 37.4 86,700 37.9 97,200 36.9

108,300 37.9 We primarily choose to measure CPU utilization for cases where we could balance the stream resolution against total CPU utilization such that the individual streams still provided reasonable size and picture quality while maintaining a relatively low total CPU utilization (for example, under 50%). The commonly supported resolutions of 160 x 120 and 320 x 240 correspond to 19,200 and 76,800 total pixels, respectively. Of particular interest is the appearance of shelving, where CPU utilization tends to remain fairly constant across substantial changes in resolution, with substantive change noted between several relatively small changes in resolution. We suspect, but cannot conclusively confirm, that this is an artifact of the encoding algorithm used by Flash in compressing video to the various tested output resolutions.


APPENDIX C: WHERE TO DOWNLOAD TEST FILES

Our suite of Flash Communication Server application testing tools, FCSDiag, which we used to perform the tests given in this white paper, can be obtained from our website at the following URL:

http://www.architekture.com/whitepapers/fcs_diag.zip Feedback, bug reports and suggestions for improving these applications are particularly welcome. Please send any e-mails to Allen Ellison at [email protected].


APPENDIX D: IIDC/DCAM CAMERA LIST

Model Manufacturer Resolution

X Resolution

Y FPS Optics IIDC Price Marlin F-033C Allied Vision Tech. 656 494 74 C/CS v1.30 $990.00

Marlin F-046C Allied Vision Tech. 780 582 53 C/CS v1.30 $1,090.00

Marlin F-131C Allied Vision Tech. 1280 1024 25 C/CS v1.30 $990.00

CF-2000 AME Optimedia Tech.

640 480 30 4mm v1.04

MOTIONeer AOS Tech. 1280 1024 500

C102T Aplux 640 480 30 6mm v1.04

iSight Apple 640 480 30 fixed AF v1.30 $145.00

A301f Basler 658 494 75 F/C yes $1,460.00

A302f Basler 782 582 30 F/C yes $1,510.00

A601f Basler 659 493 60 C v1.30 $999.00

A602f Basler 659 493 100 C v1.30 $1,475.00

EOS-1Ds Canon 4074 2704 3 EF No $8,000.00

CCi4-1394 C-Cam Techn. 1280 1024 7.5 C No $1,300.00

BCi4 C-Cam Techn. 1280 1024 C/F

BCi5 C-Cam Techn. 1280 1024 27.5 C $1,500.00

CChsl1300 C-Cam Techn. 1280 1024 480 C/F

MicroPix 640 CCDDirect.com 640 480 30 C/CS v1.30 $1,100.00

1200 HS Cooke 1280 1024 625 C

iSweet Cool Stream 640 480 30 N yes

EX6620 Exsys 640 480 30 4mm v1.04 $200.00

DBK21F04 The Imaging Source 640 480 30 C/CS v1.04 $650.00

DFK21F04 The Imaging Source 640 480 30 C/CS v1.04 $590.00

DBM21F04 The Imaging Source 640 480 30 C/CS v1.04 $620.00

DFM21F04 The Imaging Source 640 480 30 C/CS v1.04 $560.00

DBM21F04-ML The Imaging Source 640 480 30 C/CS v1.04 $610.00

DFM21F04-ML The Imaging Source 640 480 30 C/CS v1.04 $550.00

IFWC-V400 IMI Tech 640 480 30 C yes

IFWC-V400/Z IMI Tech 640 480 30 C yes

IFWC-V400/T IMI Tech 640 480 30 C yes

1394 KD iRez 640 480 30 6mm v1.04 $150.00

LW-1.3 ISG 1280 1024 27 C v1.30

LW-ELIS-1024A ISG 1024 1 10,000 C v1.30

LW-SLIS-2048A ISG 2048 1 30,000 C v1.30

FireView Metacontrols 680 480 30 C yes $999.00



X Resolution

Y FPS Optics IIDC Price CamRecord 500 Optronis 1280 1024 500

PCO 1200hs PCO Imaging 1280 1024 625

Ultima 1024 Photron 1024 1024 500 C/CS/Block

Ultima 512 Photron 512 512 2,000 C/Block

Ultima APX Photron 1024 1024 2,000 C,F

FireFly2 Point Grey 640 480 30 4, 6 or 8mm

v1.04 $160.00

Bumblebee Point Grey 640 480 30(x2) 4, 6 or 8mm

v1.30 $2,900.00

Bumblebee Highres

Point Grey 1024 768 15(x2) 4, 6 or 8mm

v1.30 $2,900.00

DragonFly Point Grey 640 480 30 C/CS, 4,

v1.30 $695.00

768 C/CS, 4,

DragonFly Highres

Point Grey 1024

6 or 8mm

v1.30 $895.00

Scorpion Point Grey 640 480 30 C/CS v1.30

CV1280C Prosilica 1280 1024 30 C v1.30 $2,200.00

CV1280FC Prosilica 1280 1024 29 C v1.30 $2,900.00

PMI-4201 Q-Imaging 2032 2044 2 F No

Osiris SAC 640 480 30 C yes

Osiris II SAC 640 480 30 C yes

Osiris III SAC 640 480 30 C yes

Osiris IV SAC 1298 1040 C v1.30

DFW-V300 Sony 640 480 30 C v1.04 $1,000.00

DFW-V500 Sony 640 480 30 C v1.20 $1,100.00

DFW-VL500 Sony 640 480 30 5.4-64.8m

m

v1.20 $1,150.00

Firetail Symagery 1280 1024 C

TSB15LV01EVM TI 640 480 30 fixed v1.04

Fire-i Unibrain 640 480 30 6mm v1.04 $90.00

Fire-i Board Unibrain 640 480 30 2.1, 4.5,

12mm

v1.04

Fire-i400 Unibrain 640 480 30 C yes $450.00

CCD-1300F VDS-Vosskuhler 1280 1024 25 C yes

Ultracam3 Video Scope Intl. 512 512 30

DCAM Videre Design 640 480 30 yes

DCAM-L Videre Design 640 480 30 CS yes



X Resolution

Y FPS Optics IIDC Price Phantom v4.1 Visible Solutions 512 512 1000 C/F/...

Phantom v5 Visible Solutions 1024 1024 1000 F

Phantom v6 Visible Solutions 512 512 1000 C $52,000.00

Phantom v7 Visible Solutions 800 600 4800 C

Phantom v9 Visible Solutions 1632 1200 1019 F

architekture.com, inc

Documents

macromedia flash player

flash developer community

microphone settings

video conferences

camera settings

live video conferencing

software settings

frozen video images