virtual theremin design project nicholas dargus, daniel · pdf file ·...

Virtual Theremin Design Project Nicholas Dargus, Daniel DeAraujo and Jeremy Gillard

ECE496Y Design Project Course - Final Report

Title: Virtual Theremin using IEEE1394

Project I.D. # 2002105

Prepared by: Nicholas Dargus - [email protected] Daniel DeAraujo - [email protected] Jeremy Gillard - [email protected]

Supervisor: Prof. James MacLean

Section #: 5

Section Coordinator:

Phil Anderson

Date: Friday, April 11, 2003

The Edward S. Rogers Sr. Dept of Electrical and Computer Engineering University of Toronto

Executive Summary

This report outlines the design of the Virtual Theremin project, that makes use of

computer vision techniques. In this project we are explore the capabilities of tracking a

Theremin performer in real-time using an IEEE1394 based camera. Deve lopment of this

project was done in the C/C++ programming language, under a Linux environment with

IEEE1394 support.

The main goals of our project were: to detect a performer’s hands in real- time, apply

algorithms to track a performer’s hands in real-time, and to emulate the sound a physical

theremin instrument. Although there were several changes in design, the project was

completed as scheduled and was a complete success.

The design project was broken down, accordingly, into three main components: image

acquisition, image processing and tracking, and sound generation. This report

concentrates on the methods and materials used to complete these components and is

discussed in detail in the body of our report.


Page 2 of 43

Nicholas Dargus’ Contributions Nicholas contributed to several different aspects of the design project. He worked on the

initial computer system setup and later moved to working on the main graphical user

interface (GUI) that would serve as the front end for the Virtual Theremin.

Nicholas’ first responsibility was the basic initialization and setup of a computer with a

Linux-based operating system. He also needed to perform the IEEE1394 Linux driver

installation on this system, along with all of the necessary fixes to ensure its proper

operation. After the system setup was complete, Nicholas was originally scheduled to set

up a GUI that would control the IEEE1394 camera and run the Virtual Theremin. It was

later decided that this would be too complex a task, and that modifying an existing

interface would be more efficient use of time.

Making use of the user-interface code builder, Glade, he modified the Coriander GUI to

add the extra functionality to allow the Virtual Theremin operate correctly. He also

created the necessary interfaces for Daniel’s tracking modules and Jeremy’s sound

generation module.

The following sections were written by Nicholas, with contributions and suggestions

from Daniel and Jeremy: Introduction, Design (4.1, and 4.2), Appendix A and Appendix

B.


Page 3 of 43

Daniel DeAraujo’s Contributions

Daniel was responsible for the skin detection and hand tracking algorithms used in the

Virtual Theremin. His module would take an image provided by Nicholas’s module in

YUV format, search it for skin colored pixels, determine the center of mass of the hands,

and return the (x,y) co-ordinates of each hand to Jeremy’s sound generation module.

Daniel implemented his initial algorithms using Octave. This code consisted of a RGB to

YUV converter, a skin detection filter, a blob-growing algorithm, a full- image hand

locator, and a simple heuristic hand locator. He also implemented a script that was used

to extract U-V information from images consisting only of skin colored pixels which was

used to determine the approximate region in U-V space that skin tone occupies.

Once the Octave simulations were completed, Daniel converted each of his algorithms

into C in the quickest and most direct fashion possible, to ensure that the algorithms

functioned in C before beginning the optimization step. Once the code was functional in

C, the algorithms were tested and integrated into Nicholas’s module. The integrated

package was tested and fixed, while Daniel optimized the hand detection functions for

maximum speed. The optimized code can be found in Appendix C.

The following sections were written by Daniel, with contributions and suggestions from

Nicholas and Jeremy: Materials and Programming Methods, Design (4.3) and

Conclusion.


Page 4 of 43

Jeremy Gillard’s Contributions Jeremy contributed to the design project in multiple areas. His first responsibility was to

perform research on the Theremin instrument to determine how it functions and how it is

played so that an accurate conversion model could be created relating hand position to

sound output.

He researched the Linux sound environment so that the sound card would be correctly

enabled and available as an output device to the system. Considerable research also went

into determining what sound API’s were available, and how they could be used to access

the soundcard.

Afterwards, a function was created that would determine which digital samples had to be

written to the soundcard to produce the desired Theremin sound. He then programmed an

interactive software application that mimicked the Theremin output using the keyboard.

This software was used to adjust and fine tune the sound output function.

Next, a conversion algorithm was created that would map hand positions on the screen to

a specific amplitude and frequency the Theremin would produce based on those

positions. To thoroughly test this algorithm, a mouse driven application was created. This

application would give specific real time feedback to how the system would respond to

changing hand positions allowing for more accurate mapping between hand position and

sound output.

The following sections were written by Jeremy, with contributions and suggestions from

Daniel and Nicholas: Introduction and Design (4.4)


Page 5 of 43

Acknowledgements

The authors of this report would like to take the opportunity to thank the silent member of

the design team, our project supervisor professor W. James MacLean. He has been an

outstanding supervisor. Not only is he an immense source of knowledge, but he is also

very friendly and a pleasure to work for.

Professor MacLean kept the design team on track by holding weekly meetings and was

always available for consultation. He guided us throughout the duration of the project

and made the work fun. Sincerest thanks to James MacLean from the design team.

In addition, the authors would also like to acknowledge that much of the concepts and

material used in our design originated from the publications listed in our references at the

end of this report. Without them, this project would not have been possible.


Page 6 of 43

Table of Contents CONTRIBUTIONS 2 SECTION 1: ACKNOWLEDGEMENTS 6 SECTION 2: INTRODUCTION 7

2.1 Project Background 2.1.1 The Theremin 7 2.1.2 YUV Color Space 8 2.1.3 Octave 9 2.1.4 IEEE1394 10

2.2 Motivation 10 2.3 Project Objectives 11 2.4 Report Outline 11 2.5 Literature Review 12

2.5.1 Image Acquistion 12 2.5.2 Image Processing and Tracking 13 2.5.3 Sound Generation 14 2.5.4 Programming Referances 16

SECTION 3: Materials and Programming Methods 17

3.1 Materials Used 3.1.1 Hardware 18 3.1.2 Software 18

2.2 Programming Techniques 18 3.2.1 Freestore 18 3.2.2 Concurrent Code Development 19 3.2.3 Incremental Software Development 19

SECTION 4: Design 20

4.1 Testing the IEEE1394 Bus 20 4.2 Graphical User Interface Implementation and Image Acquisition 22

4.2.1 Toggle Buttons 23 4.2.2 Spin Boxes 23

4.3 Image Processing 25 4.3.1 Skin Detection 25 4.3.2 Hand Tracking 28 4.3.3 Development Details 30

4.4 Sounc Card Support in Linux 32 4.4.1 Interactive Program for sound emulation using keyboard support 33 4.4.2 Driven Application for Sound Mapping Testing 34

SECTION 5: CONCLUSIONS 36

5.1 Further Work 37 5.2 Possible Applications 37

SECTION 6: REFERENCES 38 APPENDIX A: COMPUTER SYSTEM AND SOFTWARE VERSIONS 40 APPENDIX B: SETUP AND INSTALLATION OF LINUX 41 APPENDIX C: SOURCE CODE 43


Page 7 of 43

Section 2: Introduction

The purpose of our project was to create a Virtual Theremin musical instrument using

computer vision techniques under a Linux-based environment. The project was broken

down into three main components: image acquisition, image processing and target

tracking, and sound generation, each of which were created as individual modules for the

main program.

2.1 Project Background 2.1.1 The Theremin

The Theremin musical instrument was invented by

physicist Lev Sergeivitch Termen (anglicized to Leon

Theremin) in 1919 while working for the Russian

government on alarm devices. One of the alarm devices he

created caused a whistling noise which changed in a

predictable way with the approaching of a body. He found

that he could play out melodies by moving his hand in

discrete amounts in front of the alarm. This gave him the

inspiration to create a musical instrument that could be played without physically contact.

Theremin was granted a US patent on February 28, 1928 for the "Thereminvox", as it

was called at the time.

The Theremin is based on the theory of beat frequencies. When you play a note that is

not in tune relative to a reference note of the same frequency, there is a recognizable

Leon Theremin playing his creation.[8]


Page 8 of 43

pulse until both notes are brought to the same frequency. As these notes get further away

in frequency, beats are produced at a faster rate. The Theremin uses an electronic

oscillator to create a stable high frequency reference tone. Another electronic oscillator

which is initially in tune with the first oscillator is controlled by a capacitive sensing

antenna. The difference between the two pitches created by the oscillators is in the

auditory range, and is amplified . Ideally, the Theremin would produce a perfect sine

wave as its output, but due to coupling of its oscillators, it produces an asymmetrically

slewed sine wave.

The capacitance of the antenna is changed by

moving a hand towards the antenna or away

from it. This allows one to alter the pitch

produced by the instrument. A second antenna

is used to control the volume. Changing the

capacitance of the set of antennae with your

hands allows you to play the Theremin musical

instrument.

2.1.2 YUV Color Space

The YUV color model is an alternate way of representing a standard Red, Green Blue

(RGB) image. It was originally devised as a method of transmitting a full-color signal

from a television station while maintaining compatibility with existing black and white

television sets.

Figure 2.1: Image of a Theremin


Page 9 of 43

A YUV image is composed of a Luminance channel (Y) which contains information

about the intensity of the pixels in the image and two Chrominance channels (U and V)

which contain all the necessary color information required for accurate reconstruction of

the image. The Y channel is the grayscale image that would be constructed from the

original RGB image. The U channel is created by subtracting the original Red channel

from the Y channel, and the V channel is constructed in a similar fashion by subtracting

the Blue channel from the Y channel.

We use the YUV representation of colors instead of HSI, as previously outlined in our

proposal [13], or other similar color spaces because our ADS Pyro camera can directly

capture images in YUV format, which negates the need for a costly conversion stage.

2.1.3 Octave

(From [9]) GNU Octave is a high- level language, primarily intended for numerical

computations. It provides a convenient command line interface for solving linear and

nonlinear problems numerically, and for performing other numerical experiments using a

language that is mostly compatible with (Mathworks’) Matlab. GNU Octave is also freely

redistributable software.

Octave was used as the development platform for the initial skin detection and hand

tracking algorithms because it has a simpler language and syntax compared to C, and

provides easy access to all the data being manipulated. However, it is much to slow to be

considered as a platform for a real- time system and was therefore only used to test the

correctness of above algorithms.


Page 10 of 43

2.1.4 IEEE1394

IEEE1394 is an extremely very fast external bus standard that can support high data

transfer rates [26]. Currently there are two versions available, 1394a which has a transfer

rate of 400Mbps and 1394b, which has a transfer rate of 800Mbps.

A single IEEE 1394 port can be used to connect up to 63 external devices In addition to

its high speed, IEEE 1394 also supports isochronous data transfers. Isochronous data

transfers are time dependant and refer to processes where data must be delivered within

certain time constraints without corruption. This mechanism is ideal for our project

because we are tracking in real-time and need to transfer large amounts of data from our

camera to the computer.

2.2 Motivation

In the past when performing with a Theremin instrument, the performer had to move his

or her hands in and out of several electro-magnetic fields in order to create music. The

positions of their hands, with reference to the Theremin are what generate music. Using

knowledge gained from the study of Theremins, our design group wanted to be able to

create a Virtual Theremin that made use of real-time object tracking.

Our Virtual Theremin will work by tracking a performer’s hands in real time and then,

depending on the locations of the hands, create the sound a Theremin would produce.

Several problems arise from this venture that must be made known. With a physical

Theremin, the music is created by the performer’s hands changing the capacitance of a

set of antennae. Our Virtual Theremin we will not be using antennae. Instead we will


Page 11 of 43

track the performer’s hands using our camera and treat the hands as if they were actually

causing changes in a real Theremin. We will also have to determine how the different

hand locations affect the sounds that should be created. The time delay from when the

performer’s hands move, to when the Virtual Theremin outputs a new sound will have to

be small so as not to be noticeable to the performer. In addition, any user feedback video

should not have any perceivable time delay between hand movement and sound output.

2.3 Project Objectives

Our primary project objective was to create a Virtual Theremin instrument that closely

mimics the ideal Theremin. This includes tracking players’ hands in real time and

translating their position relative to a set of Virtual antennae to allow for an appropriate

sound response from the system.

In addition, we wanted to create an interface for the system that gives Theremin players

feedback as to how the system is tracking and allows for control over system parameters.

This includes:

•Adjustments to account for different lighting conditions.

•Controls to set the video feed parameters.

•Toggles to switch between multiple user feed back modes including hand location

markers and skin highlighting.

2.4 Report Outline

This report will explain the development of the Virtual Theremin system. Section 2 gives

the introduction and motivation for the Virtual Theremin project. Explanations on


Page 12 of 43

system development and testing are discussed in Section 3. The section is divided into

the separate system modules: image acquisition and GUI development, image processing

and tracking, and sound generation.

Finally, Section 4 concludes the final report and discusses future work and other possible

applications that this system could be used for, given further development.

2.5 Literature Review

As mentioned previously our project was broken down into three main components:

image acquisition, image processing and tracking, and sound generation, each of which

was thoroughly investigated. The following works comprises our research into each area.

2.5.1 Image Acquisition The Embedded Systems Programming: Fundamentals of FireWire webpage contains an

overview of how the IEEE1394 protocol operates and is hosted by CMP Media LLC [1].

It discusses the protocol’s topology, data transfer and transaction processes, protocol

layers and configuration. It also goes into details about how the bus is managed and the

methods used to provide an easy-to-use, low-cost, high-speed connection. This website

also provides technical diagrams and links to other IEEE1394 references and

documentation.

The 1394 Trade Association is incorporated as a non-profit trade organization founded to

support the development of computer and consumer electronics systems that are easily

connected with each other via a single serial multimedia link. The association’s webpage


Page 13 of 43

provides information into IEEE1394 technology and development [2]. A short history of

the IEEE1394 technology is a provided along with the benefits and future of using such a

technology. Links to other IEEE1394 documentation are also available.

The Linux1394.org webpage is run by Dan Dennedy, a maintainer of several Linux

libraries and applications [3]. The site is devoted to providing 1394 hardware access

under the Linux operating system and contains information and frequently asked

questions on how to get started using IEEE1394. Also available on the website, in the

form of downloads, are several software packages and C libraries that allow direct access

to IEEE1394 devices. The software packages and C libraries come with their own

documentation on how to access data coming through the IEEE1394 port. In our project

we used several of these libraries to communicate with the Pyro camera in a Linux

environment. Links to other web-based references and data archives were also available

through this webpage.

2.5.2 Image Processing and Tracking In Digital Image Processing by Pratt, and the similarly titled book by Gonzales, we found

in-depth information on many image processing concepts useful to our project, such as

filtering, image segmentation, pattern recognition and color space conversions. Image

segmentation methods found in both books were used to detect the location of the hands

in space and the accuracy will be improved with an application of basic filters.

In Pratt, there is detailed information and algorithms for transforming images from Red

Green Blue (RGB) space to hue, saturation and intensity (HSI) space [4]. Since we


Page 14 of 43

originally planned to be using the HSI space to look for skin tone colours, the

transformation algorithms available in this book were extremely important.

The decision to switch from HSI to YUV required us to find documentation relating to

this new color space. Joe Maller’s webpage [17] discussed the history of YUV and its

use in various filters used by the Final Cut Pro video editing software package. He also

provides the formulae necessary to convert an RGB value into the YUV color space.

The Naked People Skin Filter site has information on the different ranges of skin color

that can easily be searched for in an HSI and YUV domain [5]. This information was

relevant to our project because it will allow us to locate the different areas of skin-tone,

such as the hands and head of the performer, and make our program more efficient when

trying to locate the hands.

Bare-Hand Human-Computer Interaction describes one possible way to track movement

by processing each frame in its entirety, as our prototype will do [24]. This is a simple

yet effective way to implement basic tracking functiona lity. It is robust since the tracking

algorithm can never really “lose” the object it is tracking, since it is performing a full

search every time. The document also establishes that the maximum acceptable latency

is 50ms for ease of use, and endeavored to produce an algorithm that could process the

video faster than 20Hz.

2.5.3 Sound Generation Described at the Take a Look at Theremins website are methods which the Theremin can

produce music without physically touching the instrument [15]. This site is an excellent


Page 15 of 43

resource for Theremin technical documents and playing techniques. It provided us with

an excellent resource for learning how to mimic the Theremin’s responses to stimulus on

the computer.

The Open Sound System Programmer’s Guide manual gives detailed instructions on how

to program applications for the Open Sound System (OSS) [7]. The Open Sound

System(OSS) is a device driver for sound cards and other sound devices under UNIX and

UNIX-Compatible operating systems. The manual progresses from background

information on OSS devices and programming techniques to a detailed description of

programming of specific aspects of sound cards, such as the mixer. Both aspects of using

either a digitized voice or a synthesizer for output are explained, which helped in

determining which type of sound output will be best for this project.

This online Linux Sound HOWTO resource gives useful information on sound support

for Linux [8]. To be able to program an application for outputting sound, we need to

make sure that we have our soundcard installed correctly and functioning correctly. This

document lists supported Linux hardware, describes how to configure the kernel drivers,

and answers frequently asked questions relating to Linux sound. An overview of how a

sound card functions is also present, which was useful in helping to determine to use

either a digitized voice or synthesis.

The Method for the Theremin instructional manual written by Robert B. Sexton on how

the Theremin musical instrument is played gave us two important pieces of

information[15]. First, it helped in our tracking design model. We needed to know how

specific hand movements affect the Theremin. With this information, we can correctly


Page 16 of 43

respond to the movements of the Virtual Theremin player’s hands, so as to duplicate the

response that is to be expected of a physical Theremin instrument. Secondly, it served as

a guide, to aid in our learning how to play the instrument effectively so as to demonstrate

that our system correctly duplicates the use of a physical Theremin.

Sine Wave Modulation Synthesis for Programmers was created by Ian Miller [12]. It is a

valuable resource for determining how sine waves can be used to synthesize sounds.

Equations on creating an appropriate sine wave at a particular frequency based on

sampling rates is present as well as more advanced information on sine wave

manipulation.

John Simonton writes an interesting article on the properties of Theremin sound, as well

as how the sounds are produced. It is an interesting article [25] for anyone wondering

why the Theremin sounds the way that it does, as well as those who are interested in

harmonics and pitch sensitivity.

2.5.4 Programming References Documentation was needed to explain the usage of several IEEE1394 programming

libraries. Geocrawler [18] was found to be most beneficial as it provided examples and

group discussions. Geocrawler is the leading news, collaboration and distribution

community for IT and Open Source development, implementation and innovation. It

currently boasts approximately 6,000,000 emails archived that discuss open source

development.


Page 17 of 43

The DFS’s C page webpage was created by DF Stermole [14]. It contains basic

information about Linux systems, as well as information about programming in C. The

pertinent information from this source comes in the form of a sample C program to allow

a user to obtain input from the keyboard using a single key press under a Linux

environment.

The A Pthreads Tutorial page was created by Andrae Muys [10]. The website contains

information on how to create multi-threaded applications under a Unix-type environment.

Programming concepts are explained by providing short easy to understand sample

programs. Topics covered include benefits of concurrency, creating threads, mutexes and

synchronization and examples of classical concurrency problems.

The Debian website contains information about the Debian operating system [11].

Documentation on the setup of the operating system is available, as well as information

on packaging and usage of the system. It is an invaluable resource for learning about

getting started with your Debian Linux system.

During the development of the simulation in Octave, the Octave Online Documentation

[9] was found to be an invaluable resource, as it covers all of the built- in functions in

detail. Correct syntactical use and valid parameter options are described, and many usage

examples are provided on the site.


Page 18 of 43

Section 3: Materials and Programming Methods

3.1 Materials Used

3.1.1 Hardware

For our project, we decided to use the following hardware components:

• Standard PC, 2GHz processor, 256MB RAM

• IEEE1394 PCI interface card, with OHCI compliant chipset

• IEEE1394 video camera

3.1.2 Software

The following software was also used in the development of the Virtual Theremin

• GNU GCC and G++ compilers, version 2.96

• DDD graphical debugger, version 3.3.1

• Pthreads library, version 1003.1c

• Coriander source code, version 0.26

3.2 Programming Techniques

During software development, we learned and applied the following programming

techniques. The use of these techniques as they were applied in our software will be

discussed in the relevant sections below.

3.2.1 Freestore

A freestore is used to improve access times to dynamically allocated memory space

during run time. Instead of allocating and deallocating memory blocks as they are

needed, a large amount of memory blocks are allocated and stored in a linked list. When


Page 19 of 43

a new block of memory is required by the program, it calls a macro that returns the next

available block from the linked list, instead of having to find appropriate space in

memory. This offers very large performance gains, since dynamically allocating new

storage is one of the most time intensive system calls.

3.2.2 Concurrent Code Development

To make our code development more efficient, we modularized our Virtual Theremin

design which allowed each member to work independently. Each module was planned

such that it had rigid set of inputs and outputs requirements placed on it. Each member

was aware of what they needed to take as inputs and what was expected at the outputs.

By taking this approach, our group was able to maximize our overall design efficiency

and minimize integration time.

3.2.3 Incremental Software Development

To ensure that our code was of the highest quality, we decided to implement incremental

software development methods. We agreed that each team member would write their

code in such a way that every existing part of the module was tested before beginning

development of a new one. During the earlier stages of development, we daily backups of

our code, and once integration began and integration problems were solved, we backed

up our code every few hours. The version of our software found in Appendix C is the

most recent version, Mar19-2.


Page 20 of 43

Section 4: Design

This project was made up of three main components: image acquisition, image

processing and tracking, and sound generation. The methods used to develop, integrate

and test each of these components are listed below.

4.1 Testing the IEEE1394 Bus

After completing the installation of Debian as outlined in Appendix B, the IEEE1394 bus

was checked using two separate tests to ensure correct operation. The first test used a

small graphical program called gscanbus, which can be found at [3]. Gscanbus was

created to scan the IEEE1394 bus, check it exists and then provide information about any

connected devices. The second test used a video capturing program called Coriander

which can also be found at [3]. Coriander was created to output video feed from an

IEEE1394 video device to the screen.

The gscanbus test was successful and found the IEEE1394 bus on the first trial, verifying

that the IEEE1394 bus existed and was functioning properly. Information about our Pyro

camera connected to the bus was displayed to the screen.

The Coriander test was successful in verifying the operation of the bus, as video feed

from the Pyro camera was outputted to the screen in an appropriate manner, but after a

few minutes the video transmission from the camera would terminate. The transmission

would simply stop for no apparent reason and would resume only after a clean reboot of

the entire system. Thinking other IEEE1394 users may have experienced this problem,


Page 21 of 43

we searched through [3] for some information. It was soon discovered that this

transmission problem was caused by the existing OHCI1394 driver.

The driver had been written in such a way that control of the IEEE1394 bus would be

handled by attached devices, not the Linux kernel. This wouldn’t have been a problem

had our camera been designed to control and manage the bus. Instead the camera was

designed assuming the kernel would be in control. Both kernel and camera were getting

confused about who had control of the bus and at a certain point during transmission

neither the kernel nor the camera would have control and the bus would lock, ceasing all

transmissions.

This problem was solved by editing the OHCI1394 driver to force the kernel to take

control of the bus. Following instructions listed on [3], the ohci1394.c driver file was

edited and the value of the bus control variable, attempt_root on line 162, was changed

from 0 to 1 as depicted in Figure 4.1 below.

It should be noted that this modification is considered a “hack” by [3] and can cause

major problems for anyone who is connecting more than one PC to the IEEE1394 bus.

Currently a more complete implementation of the bus management is being developed.

159 /* Module Parameters */ 160 MODULE_PARM(attempt_root,"i"); 161 MODULE_PARM_DESC(attempt_root, "Attempt to make the host root (default = 0)."); 162 static int attempt_root = 0;

Figure 4.1: Code fragment from ohci1394.c

Changed to 1


Page 22 of 43

4.2 Graphical User Interface Implementation and Image Acquisition

At the outset of the project, it was

assumed that our group would

have to write the software to

capture images from the camera

and display it to the monitor. It

was also assumed that we would

have to design a graphical user

interface (GUI) to make our

software easier to use. After some

preliminary work, we discovered that creating an image capturing GUI would be more

complex then we anticipated.

To solve this problem and allow more time to focus on the image processing of our

design project, our group came up with a simple solution. Making use of the Linux open

source policy, we modified the original Coriander source code and added our own

functionalities and algorithms to it. This was an ideal solution for several reasons, but

mainly because Coriander already supported a functioning, easy to use GUI.

Furthermore, it also provided all the functionality for capturing images from the camera

and outputting them to the screen.

Figure 4.2: Coriander


Page 23 of 43

Alterations to the Coriander source code included adding interfaces to call the different

functions provided in Daniel's and Jeremy's modules. Modifications where also made to

Coriander’s GUI using a user- interface code builder called Glade . These changes can

be viewed when Figure 4.3 is compared to that of the original, displayed in Figure 4.2.

The new version includes several toggle buttons

as well as several spin boxes, each of which are

explained below.

4.2.1 Toggle Buttons

As can be seen in Figure 4.3, five extra toggle

buttons were added to the Coriander GUI, these

were: “Save RGB & YUV”, “Toggle Tracking”,

“Toggle Sound”, “Toggle Skin Highlighting” and

“Toggle Hand Markers”.

The “Save RGB & YUV” toggle button was

added purely for the benefit of grabbing a snap

shot of what the camera was displaying and writing it to disk. The image was saved in

both RGB and YUV formats and mainly used by Daniel for testing in the skin detection

algorithms.

The “Toggle Tracking” button enables the hand tracking algorithm that constitutes the

main component of the Virtual Theremin. Clicking this button will pass images from the

Pyro camera to Daniel's hand tracking module.

Figure 4.3: Modified Coriander


Page 24 of 43

Enabling tracking also enables, the “Toggle Sound”, “Toggle Skin Highlighting” and

“Toggle Hand Markers” buttons. Toggling the sound feature enables the sound

generation provided by Jeremy's sound module. “Toggle Skin Highlighting” and

“Toggle Hand Markers” are present in the GUI to provide visual feedback to the

performing user.

The skin highlighting feature shows

the user that the program is working

by displaying what is currently being

tracked as skin, in pink. The hand

markers feature is another visual aid,

that when enabled displays hand

markers to the screen showing the

user what the computer is

interpreting as hands. The left hand

gets a blue square, while the right gets a green one. An example of these buttons in

action can be seen in Figure 4.4.

The exact processes for performing the tracking, highlighting and sound generation are

explained in more detail by Daniel and Jeremy, later in this section.

Figure 4.4: Visual example of toggling skin highlighting and hand markers


Page 25 of 43

4.2.2 Spin Boxes

Along with the toggle buttons, four spin buttons were also added to the Coriander GUI,

and include: “U MIN”, “U MAX”, “V MIN”, and “V MAX”. Since our program is

affected by different lighting conditions, as will be explained in section 4.3.1, these spin

boxes allow a user to customize the program to different lighting environments.

4.3 Image Processing

The image processing module makes use of the information acquired by the image

acquisition stage to determine the location of skin colored pixels in the image. Once this

information is generated, the image will be searched by the processing module to

compute the locations of the player’s hands in the image and return this information to

the program so it can be used to simulate the Theremin sound in the sound generation

module.

4.3.1 Skin Detection

The skin detection for each image is performed in YUV space. Since the determination of

skin colored pixels depends very little on the intensity of the pixel, we can safe ly ignore

the Y component of each pixel and use the color information stored in the U and V

channels to decide whether a pixel is skin colored or not. This reduces the skin detection

problem from a three-dimensional problem in RGB space to a two-dimensional problem

in UV space.

Originally, we had planned to perform skin detection in the HSI color space, but we soon

discovered that the conversion from HSI to RGB was very complex and time consuming.


Page 26 of 43

After discussing the situation with Professor MacLean, we began looking into other color

spaces that would be acceptable for our project. We decided to use the YUV color space

because our camera could be configured to output images in this format, removing the

need for any color space conversion to take place.

In addition to the reduction in complexity, YUV and similar color spaces which separate

intensity from color information offer another advantage in the skin detection problem. In

these color spaces, skin colored pixels of any background or ethnic origin all fall within a

small connected region of the color information space [5]. Since we did not have any

information regarding the values of skin tone in the YUV color space, it was necessary to

take sample images of skin and determine their chrominance values. Using Octave, we

created a function that would take a directory of images and extract the necessary

chrominance values (U and V), assuming that the images only contained skin colored

pixels or black pixels. The number of occurrences of each set of values was recorded and

plotted. The preliminary results can be seen in the following plot.


Page 27 of 43

We used the information collected to determine an appropriate range of U and V which

was used to segment the image into skin colored and non-skin colored pixels. Results of

this segmentation can be seen below:

Figure 4.4: Relative Frequency of U, V pairs for skin colored pixels


Page 28 of 43

Figure 4.5: Test Image A Figure 4.6: Pixels detected as skin-color

The results obtained from our filter are quite adequate for our project, but it is worth

noting that the output of the filter is affected by the lighting conditions in the playing

environment. In particular, we see in Figure 4.6 that the palm of the hand is not

completely identified as skin. Looking to Figure 4.5, we see that the palm has a blue tint,

due to the subject being positioned close to the computer monitor at the time the image

was taken. This is a significant property of the skin detection algorithm, so we have

integrated a manual adjustment of the U-V range into the GUI so we can adjust the

sensitivity of the detection algorithm based on the current lighting conditions

4.3.2 Hand Tracking

Once the image has been segmented with the skin detection algorithm, it must be

searched to determine the location of the left and right hands in the image. The hand

location section of the module consists of three main parts: a blob growing function

which find all connected skin pixels from a given start pixel, a full image search

algorithm which creates a list of blobs from the entire image and determines which blobs


Page 29 of 43

are the hands, and a quick search which searches the image based on the previous known

location of the hands.

The first function is used by the other functions to determine which pixels are part of a

connected component of skin. Called grow_blob() in handtracker.c (see Appendix X for

source code), this function uses packets from a freestore to create a list of pixels that

should be visited. The algorithm searches the 4-neighbourhood of the current pixel and

adds any skin colored pixels to the list. The algorithm then removes the next pixel from

the list and repeats the same process until the list is empty. At each pixel insertion, the

algorithm also accumulates information on the blob, in particular, the size of the blob in

pixels, and the pixel co-ordinate represent ing the center of mass of the blob.

The full image search function, find_hands() begins at the top left of the image and

searches each pixel until it finds a skin colored one that has not already been added to

another blob. It then passes the selected pixel to the grow_blob() function and stores the

resulting blob into a linked list, sorted by size. Once the entire image has been searched,

the three largest blobs are selected from the list, which are assumed to be the left and

right hands, and the head. We assume that the hands will be the leftmost and rightmost

blobs, and return the center of mass co-ordinates of these two blobs.

Once a full image search has been performed, it is no longer necessary to search the

entire image for the hands, since we assume that the player’s hands will not be

significantly further than in the previous frame. We will begin our search from the

previous known location of the hands, and search the 8-neighbourhood of the pixel for a

skin colored pixel. If one is found, we will call the grow_blob() function on that pixel and


Page 30 of 43

assume that the returned blob is the hand. If for some reason we are unable to locate

either of the two hands based on the previous location, we resort to a full image search.

The time saved by the quick search is significant, and is up to 5 times faster than a full

search if the image is noisy. Fortunately, the penalty incurred by a miss in the quick

search will not affect the performance of the system, since the full image search is still

much faster than the required 50ms. The relevant timing results obtained by searching a

noisy image can be found in the following table:

(All times in ms) Full search Quick search hit Quick search miss

320x240 3 <1 5

640x480 17 5 24

In addition to the 3 major functions discussed above, there is a wrapper function

locate_hands() which functions as an interface to the module, and automatically

determines which of the two hand detection functions should be called, and an

initialization function locator_init() which allocates empty blob and pixel packets to the

respective freestores.

4.3.3 Development Details

Once the module was complete, it was found that the full search algorithm was

excessively slow. The first trial run of the full search took 273ms to completely search a

320x240 image. However, the quick search was found to be quite fast, taking less that

3ms to find the hands. Since there were real time constraints to be dealt with, we decided

that we could only spend time performing one type of optimization. We could optimize


Page 31 of 43

the skin detection filter to minimize the number of full searches required, or we could

improve the processing time of the full search. We decided on the la tter course of action,

since we could apply the optimizations to both the full and the quick searches, and

hopefully gain more improvement for an equivalent time investment. This selection

yielded improvements much greater that we anticipated.

At the start of the optimization process, we spent time removing unneeded variables and

redundant calculations, and using more efficient methods of initializing internal data

storage, but this gave us only marginal improvements, a few milliseconds at best. We

then considered the internal data structures themselves, and realized that the same storage

space was being used every time through the detection process, but new memory was

being allocated every time. This concern was brought up to Professor MacLean, who

suggested we investigate the freestore programming technique, to reuse existing memory

allocations instead of finding new memory every time. We first implemented the pixel

storage and blob information storage as a freestore, with moderate speed improvements

of up to 50ms. However, once we began to reuse large existing arrays used for temporary

image storage, our processing times dropped dramatically. The end result of our

optimizations was a 100-fold improvement over our original full search times, from

296ms to 3ms. The quick search also improved, from 5ms to less than 1ms. As such, we

were able to make our system more robust, by inserting full frame searches at regular

intervals to ensure that, even if the hands were “lost” while tracking, they could be found

again within a certain time frame and there would be no noticeable delay in tracking.


Page 32 of 43

4.4 Sound card support in Linux

One of the major benefits of the Linux operating system is that it is easily customizable.

It allows users to tweak the system to suit their own individual needs. Sound setup in

Linux is more complicated, than most other operating systems, but is more efficient at

managing resources.

Recent kernel distributions for Debian Linux include many of the standard audio device

drivers, but they are not enabled by default. To use them, they must be enabled and

recompiled into the kernel. Our system contains a soundcard that uses one of the drivers

provided by the kernel

Linux devices have the interesting property that they are registered as files in the file

system of the operating system. This is a convenient property, as it allows some of the

standard file operations to be performed on the devices. Therefore, if you wish to write

samples to the soundcard for output, you can just write a file containing a set of samples

to the soundcard device file. This is an excellent way to test to see that the soundcard

installation has been performed correctly. By writing a standard audio file to the device ,

sound should be heard from the system speakers.

In addition, the Linux mixer device for the soundcard needs to be assigned to the users

who can have access to it. In our system, we want any user to have access to mixer

functionality, and thus the devices privileges were set to accommodate this.

To design our software to interface with the soundcard device, the open sound system

OSS application programming interface was used. The open sound system is a device


Page 33 of 43

driver for sound cards operating under a Unix-compatible operating system. Sound cards

normally have different devices or ports which produce or record sound. The OSS

provides a common programming interface that can control all of the devices and ports

associated with the soundcard. By using this API, our Virtual Theremin software will be

able to be run on any Linux system that uses OSS. This is beneficial as it does not

constrain to any specific type of soundcard.

4.4.1 Interactive Program for sound emulation using keyboard support

The first step in generating Theremin sound was to create a test application. This

framework allowed the sound output characteristics to be adjusted until it conformed to

what an ideal Theremin should sound like.

The test application consisted of a thread that would continually output samples from a

waveform, defined below, to the soundcard device. Using the keyboard, two variables

could be adjusted both up and down. These variables correspond to both an amplitude

and a frequency component. In the C programming language, the keyboard input is

buffered so that multiple keys are read until the enter key is pressed. We wanted to have

a direct effect on the sound outputted in relation to the key presses so the buffering was

disabled. The test was successful as the sound thread read the variables and outputted

samples from the waveform to the soundcard.

Once testing application was complete, the function which created samples for a specific

waveform was written. A sine wave was used as our output waveform and this waveform

defines the ideal Theremin output. Since Theremin sound is continuous, complete cycles


Page 34 of 43

of the waveform were defined to eliminate chop, which is an undesirable property where

the output sound temporarily halts.

By observing the above code segment, you can see the manner in which amplitude and

frequency were defined to produce complete cycles of samples from a sine wave.

Amplitude corresponds to the ‘vol’ variable, which is mapped between zero and one

hundred, with zero meaning no amplitude and one hundred being full amplitude. The

frequency of the output waveform is the ‘f’ variable, which corresponds to a set output

frequency.

This application allowed the output waveform to be adjusted until sound produced was

consistent with what our design group believed to be an ideal Theremin.

4.4.2 Mouse Driven Application for Sound Mapping Testing

To create a mapping between hand position and sound, a new test application was created

that used the mouse cursor position to act in place of hand positions. Since we defined

that the left hand would control the volume, by being tracked vertically, and the right

f=(2*PI*hz)/sr; count=(int)floor(sr/hz); /*number of samples in buffer for complete sine wave*/ for (i=0;i<(count/2);i++){ audio_buffer[buffer_index]=vol*sin(f*i)+128; /*so no negative values*/ buffer_index++; audio_buffer[buffer_index]=vol*sin(f*i)+128; /*so no negative values*/ buffer_index++; if (buffer_index==2){ /*write to audio device when buffer is full */ write(audio_fd, audio_buffer, BUF_SIZE); buffer_index=0; } } Figure 4.7: Code segment from keyboard testing program


Page 35 of 43

hand would control the pitch, by being tracked horizontally, the current mouse position

could be used to correspond to both hand positions. We ignored the vertical position of

the right hand, and the horizontal position of the left hand. This test application was

written using openGL taking advantage of the glut library. The glut library provided the

mouse handling functionality for openGL.

To map a given hand position to a certain output amplitude and frequency of our sine

wave, a look-up table was created. This table translates a given x and y position for each

hand to a set of amplitude and frequency for output. The table was created by taking the

given set of pixels of the image, and defining a linear range of output for each dimension.

The vertical pixel positions would map between the required output volume range.

Similarly, the horizontal pixel positions would map out the given frequency range.

While using this test application, it was discovered that too many samples were being

written to the soundcard. This caused a noticeable delay between the movements of the

mouse and the sound generated. This was corrected by implementing a timer for the

sound output thread. The thread was timed to run at the same frequency of the soundcard

sampling rate. This made sure that no extra samples were being written to the audio

buffer that might cause a delay between mouse movement and sound output.

With the hand positions to sound output module complete, it was integrated into the

Virtual Theremin system. The integrated system was thoroughly tested by our design

group to make sure that the sound output module functioned correctly as a system

component. Testing of the sound output system component was done by our design

group by checking to see if hand position in the GUI matched the sound output.


Page 36 of 43

Section 5: Conclusions

Overall, our project was successful. We achieved all the objectives that were outlined in

our proposal [13] and were able to demonstrate our functioning system at the design fair.

Most notable is our processing speed, which far exceeds the desired 50ms. With a

320x240 pixel image, we were able to keep the processing delay under 5ms at all times,

and a full-size 640x480 image could be processed in less than 25ms. Theoretically this

could result in frame rates upwards of 40fps with a full-size video feed, which is

substantially faster than our original requirements. Our system was also able to track the

player’s hands without an explicit setup phase. Though a small amount of initial

calibration was required to compensate for the lighting conditions present in the playing

environment, the player could walk into the camera’s field of vision, and the system

would automatically begin tracking him. The system can track the user's hands through

regular Theremin movements and will track the hands throughout the entire image,

except if the user places his hand over his face. Once the hand is relocated however, the

system will quickly resume correct tracking.

Our Virtual Theremin mimics a real Theremin, both in terms of sound and physical

movements. Even though we are limited to a one octave pitch range, appropriate hand

movements will cause the system to respond with the required changes in pitch and

volume. Since a real Theremin should theoretically produce a pure sine wave, we

consider our computer generated sine wave to be an appropriate approximation.


Page 37 of 43

The UI we developed is very intuitive to use. All the essential features are easily

accessible via buttons and text boxes, and system properties such as video capture size,

skin detection range and user feedback modes are easily customizable.

5.1 Further Work

Though the Virtual Theremin was an instructive and fun way to learn about hand tracking

and human-computer interaction, the applications of our system are very broad, and serve

as a sound basis for further development in computer vision. Our system is quite robust

but could still use some adjustments. In particular, the skin detection algorithm could be

improved by using a data range that more accurately represents the distribution of skin-

colored pixels in U-V space. It may also be useful to augment the system with an

algorithm to automatically adjust the U-V range based on the current lighting conditions

detected by the camera.

5.2 Possible Applications

One particularly useful project could consist of adapting the tracking data provided by

our project to control a cursor on a computer screen in a similar fashion to the control

provided by a mouse. This could be useful in giving presentations in locations where it is

not always convenient to use a mouse (limited space, inconvenient placement of

computer equipment with respect to other apparatus, such as microphones). With further

development, we could use image segmentation techniques to recognize various hand and

finger positions to provide additional controls to the user.


Page 38 of 43

Section 6: References

1. (2002, Sept 1). CMP MEDIA LLC. [Online]. Available: http://www.embedded.com/1999/9906/9906feat2.htm

2. (2002, Sept 22). 1394 Trade Association [Online]. Available:

http://www.1394ta.org/ 3. Daniel Dennedy. (2002, Sept 1). Linux1394. [Online]. Available:

http://www.linux1394.org 4. Pratt, William K., Digital Image Processing, 2nd Ed. New York: John Wiley & Sons,

Inc., 1991. 5. Fleck & Forsyth. (2002, Sept. 22). Naked People Skin Filter. [Online]. Available:

http://www.cs.hmc.edu/~fleck/naked-skin.html 6. Sexton, Robert. “Method for the Theremin – Book 1 Basics,” Texas: Tactus Press,

1996. 7. Tranter, Jeff. (2002, Sept. 26). Open Sound System Programmer’s Guide. (1.11)

[Online]. Available: http://www.4front-tech.com/pguide/oss.pdf 8. Tranter, Jeff. (2002, Sept. 26). The Linux Sound HOWTO. [Online]. Available:

http://www.tldp.org/HOWTO/Sound -HOWTO/index.html 9. Eaton, John W. (2002, Nov 17). GNU Octave – Table of Contents. [Online].

Available: http://www.octave.org/doc/octave_toc.html 10. Muys, Andrae. (2003, Jan. 5). A Pthreads Tutorial. [Online]. Available:

http://www.cs.nmsu.edu/~jcook/Tools/pthreads/pthreads.html 11. (2002, Sept. 22). Debian. [Online]. Available: http://www.debian.org 12. Wilson, Ian. (2003, Jan. 5). Sine Wave Modulation Synthesis for Programmers.

[Online]. Available: http://www.geocities.com/SiliconValley/Campus/8645/synth.html

13. Dargus, DeAraujo & Gillard. Virtual Theremin using IEEE1394 - Technical

Proposal. Toronto: University of Toronto, 2002. 14. Stermole, DF. (2003, Jan. 4). DFS's C Page 2001-2002. [Online]. Available:

http://www.macdonald.egate.net/CompSci/index.html 15. Sexton, Robert. (2002, Sept. 22). Take a Look at Theremins. [Online]. Available:

http://www.ccsi.com/~bobs/theremin.html


Page 39 of 43

16. Hawksley, John. (2003, Jan. 5). Sampling. [Online]. Available:

http://www.armory.com/~greebo/sampling.html 17. (2003, Jan 6). RGB and YUV Color. [Online]. Available:

http://www.joemaller.com/fcp/fxscript_yuv_color.shtml 18. (2003, Jan 8). Geocrawler. [Online]. Available:

http://www.geocrawler.com/archives/3/2711/2000/11/0/4732161/ 19. (2002, Oct 1). The Linux Kernel HOWTO. [Online]. Available:

http://www.linux.org/docs/ldp/howto/Kernel-HOWTO.html 20. (2002, Nov 22). Octave-forge Combined Index. [Online]. Available:

http://octave.sourceforge.net/index/index.html 21. Daniel Dennedy. (2002, Sept 1). Linux1394. [Online]. Available:

http://www.linux1394.org 22. Gonzales, Rafael C. & Woods, Richard E., Digital Image Processing, New Ed. New

York: Addison-Wesley Publishing Company Inc., 1992. 23. Hardenberg & Bérard. (2002, Sept. 26). Bare-Hand Human-Computer Interaction.

Berlin, Germany. [Online]. Available: http://iihm.imag.fr/publs/2001/PUI01_Hardenberg.pdf

24. imonton, John (2003, Feb. 10 ) On Theremin Tone. [Online]. Available:

http://www.paia.com/thereton.htm 25. (2003, April 4). Wepopedia [Online]. Available:

http://www.webopedia.com/TERM/I/IEEE_1394.html


Page 40 of 43

Appendix A: Computer System and Software Versions A.1 Computer System

•i386-based processor, 2GHz •256MB DDR RAM •VIA KT333 chipset-based Motherboard •NVidia GeForce4 MX 440-based video card with 64MB DDR RAM •D-Link DFW-500 Rev. 1.69 1394 PCI card •SoundBlaster compatible soundcard on the motherboard •7200 RPM ATA-133 hard disk

A.2 Software Versions Software name Version Description [3] Coriander 0.26 A graphical utility that lets you control all of the features

of an IEEE-1394 Digital Camera gscanbus 0.7.1-1.1 Utility to display connected devices and do transactions libraw1394-d 0.9.0-2 A library to control A/V devices using the 1394ta AV/C

commands. This library also contains librom1394 for reading and decoding the CSR Config ROM of any device on the bus.

libdc1394 0.9.0-2 A library that is intended to provide a high level programming interface for application developers who wish to control IEEE1394 based cameras that conform to the 1394-based Digital Camera Specification (found at http://www.1394ta.org/).

Linux kernel 2.4.16 Linux kernel that supports IEEE1394


Page 41 of 43

Appendix B: Setup and Installation of Linux with IEEE1394 Support

Prior to starting the image acquisition process, a suitable Linux environment was created.

A complete listing of all hardware and all software components used in the system can be

found in appendix A.

B.1 Installation of Debian

As outlined in [13] our group selected the Debian version of Linux because of its

somewhat simple installation procedures. Debian makes use of a command called

“dselect” that selects, downloads and installs packages available to the operating system

as well as determine if a package has any dependencies. These dependencies are then

downloaded and installed as well. Debian was also selected because it was readily

available for download from the University of Toronto’s network, which provided our

group with quick access times.

B.2 Installation and Setup of IEEE1394 and Sound

After completing the Debian installation process it was discovered that the kernel used by

the OS was insufficient for our needs, as it did not support IEEE1394. Since IEEE1394

is the basis behind our real- time image capturing, a compatible kernel version was

required. Presently our system is using kernel build 2.4.16 and was selected for two

reasons. The first was that any version below build 2.4 did not support the IEEE1394

technology, and the second was that any build preceding and proceeding 2.4.16 had

issues with other pieces of hardware in the computer system. For example, the IEEE1394

would work, but the sound card would not, and vice versa.


Page 42 of 43

With an IEEE1394-compatible kernel installed, the IEEE1394 drivers were activated by

accessing the kernel’s configuration menu (screenshot?) and enabling: IEEE1394

support, OHCI support and RAW1394 I/O support. Sound support was also enabled at

this point.


Page 43 of 43

Appendix C: Source Code

The relavent source code for all the modules can be found on the following pages

virtual theremin design project nicholas dargus, daniel · pdf file ·...

Documents