ece532 group report paint with vision

ECE532 GROUP REPORT

Paint With Vision By Team #1

Amanjit Dhillon, Daniel Di Felice, Yusheng Wu

2

Table of Contents

Overview ................................................................................................................................. 3

Block Diagram ........................................................................................................................ 4

Brief Description of IP ............................................................................................................ 5

Project Outcome ...................................................................................................................... 8

Project Schedule .................................................................................................................... 10

Descriptions of System Components .................................................................................... 13

Description of Design Tree ................................................................................................... 18

Tips and Tricks ..................................................................................................................... 19

3

Overview

The paint with vision project aimed to create an interactive experience for the user by solving a

computer vision challenge using custom hardware and software solutions. On a high level, an

LED is used to simulate a paint brush, while the LED’s position is tracked by a camera to draw

on the screen. The drawing will be overlaid on top of the video feed, which provides real time

feedback to the user while creating the virtual experience.

Object tracking is a challenging problem in computer vision. While many algorithms exist, the

implementations are mostly in software and in general it would require high-performing

computer hardware to handle data processing for realtime applications. Therefore our

motivation for this project were to explore the possibilities using custom hardware to aid the

computer vision task in software. The goals of this project are (1), creating an interactive

augmented reality experience for the user with the painting application; (2), implementing

custom hardware and software to realize LED tracking and overlay painting on camera video

stream; (3), design the painting application with usability considerations by including handy

features such as erasing, dynamic brush size change; (4), handling noise in object tracking and

ensure the robustness of the tracking algorithm against environment variables.

AXI Crossbar

Interconnect

Memory Interface

Generator

MicroBlaze Processor

BRAM

Instruction Memory

BRAM

Stack Memory

AXI Video DMA

VGA MM2S

AXI Video DMA

Camera S2MM

Compositor

GPIO

UART Lite

AXI Stream

Data FIFO vga640x320

OV7670

Decoder

VGA Video Output

Off-Chip DDR SDRAM Frame Buffers, Heap

OV7670 Camera Video Input

Switches Enable Draw/Erase

USB Programming Interface

Block Diagram Legend

Existing Xilinx IP

Custom IP

Custom Software IP

Partial Frame

Filter

Connected

Component

Labelling

Blob Select Draw or Erase

Data Cache

Dra

win

g O

ut

GP

IO

Fram

e In

4

5

Brief Description of IP

Table 1: List of All IP Blocks used in the Design

Instance Name Block Name Version Origin

axi_gpio_0 AXI GPIO 2.0 Xilinx

axi_intc_0 AXI Interrupt Controller 4.1 Xilinx

axi_uartlite_0 AXI Uartlite 2.0 Xilinx

axi_vdma_0 AXI Video Direct Memory Access

6.2 Xilinx

axi_vdma_1 AXI Video Direct Memory Access

6.2 Xilinx

axis_data_fifo_0 AXI4-Stream Data FIFO 1.1 Xilinx

clk_wiz_1 Clocking Wizard 5.1 Xilinx

compositor_w_burst_0 compositor_w_burst_v1.0 1.0 Full Custom

mdm_1 MicroBlaze Debug Module 3.1 Xilinx

microblaze_0 MicroBlaze 9.3 Xilinx

microblaze_0_axi_periph AXI Crossbar Subsystem

AXI Data Width Converter 2.1 Xilinx

AXI Crossbar 2.1 Xilinx

microblaze_0_local_memory Block Memory Generator Subsystem

Local Memory Bus 1.0:3.0 Xilinx

LMB BRAM Controller 4.0 Xilinx

Block Memory Generator 8.2 Xilinx

mig_7series_0 Memory Interface Generator (MIG 7 Series)

2.0 Xilinx

OV7670_0 OV7670 1.0 Semi-Custom (some instructor-provided IP)

proc_sys_reset_0 Processor System Reset 5.0 Xilinx

rst_clk_wiz_1_100M Processor System Reset 5.0 Xilinx

rst_mig_7series_0_81M Processor System Reset 5.0 Xilinx

vga640x480_0 vga640x320_v1_0 1.0 Semi-Custom (with online IP)

6

• VIDEO INPUT TO MEMORY

A semi-custom OV7670 (OV7670_0) decoder block is used to configure the OV7670 Camera

peripheral component and convert incoming signals to an AXI-Stream. Most of the internal

components in the block were adapted from code that the course instructors provided, with

modifications made to increase resolution, bit depth, and make the output signals AXI-Stream

compliant. The stream is then pushed into an AXI Direct Memory Access block (axi_vdma_0)

which writes out the camera frame buffer to main memory (DDR SDRAM) at 30 frames per

second.

• PROCESSOR CODE

The MicroBlaze processor (microblaze_0) reads in the frame buffer and runs a detection

algorithm to determine the position, size, and colour of the LED if visible in the image. It then

writes pixel data for a square to a second drawing frame buffer in memory which is initialized to

be transparent at the start of system operation. The square is written at the same coordinates

as the coordinates of the LED in the camera frame buffer and has the same size and colour.

The processor also occasionally polls the AXI GPIO (axi_gpio_0) to check the position of board

switches. Switch 0 enables drawing while switch 1 enables the erase function. The processor

also has an internal data cache for main memory accesses.

• VIDEO OUTPUT FROM MEMORY

This display buffer is then read by another AXI Video Direct Memory Access block (axi_vdma_1)

which converts it into a stream. The stream is then passed through an AXI Stream-Data FIFO

(axis_data_fifo_0) for synchronization before continuing to the semi-custom vga640x320

(vga640x320_0) module. This module then converts the stream to VGA-compliant signals to

display to a screen at 60Hz. Much of the code for the vga640x320 block was taken from an

online tutorial [1], specifically the various pulse timings.

• MISCELLANEOUS

The Memory Interface Generator (mig_7series_0) block is the memory controller for off-chip DDR

SDRAM. The Block Memory Generator Subsystem (microblaze_0_local_memory), consisting of

Local Memory Buses, LMB BRAM Controllers and BRAM Generator blocks, is an interface to

BRAM used by the processor to store instructions and the stack. AXI Uartlite (axi_uartlite_0) is

the programming interface used to modify instructions or debug the system. The AXI Crossbar

7

Subsystem (microblaze_0_axi_periph), consisting of AXI Data Width Converters on the ports of

an AXI Crossbar, is the crossbar interconnect which interfaces all major system components as

shown in the block diagram. The remainder are clock, reset, or interrupt modules.

8

Project Outcome

The application was completed in terms of proposed functionalities. The position of the LED can

be tracked correctly and smoothly. The size of the LED as it appears on the screen is also

obtained during tracking, which allows the paint brush to change size dynamically. The user can

also choose to draw, erase or standby using toggling switches. The application successfully

delivers the proposed features in terms of creating an augmented reality drawing experience for

the user with integrated custom hardware and software.

The implementation solves the challenge of performing a realtime computer vision task using

combine solutions in hardware and software. Initially the LED tracking algorithm was simulated

in software and tested on a PC. Simulation results suggested that although the algorithm can

function as an LED detector, it is prone to background noise and requires a powerful processor.

Therefore, much of the development effort was spent on improving tracking accuracy and

optimizing the performance of the algorithm through integrated hardware and software

solutions. The successful adaptation of such algorithm provides ground for developing the

application to support an interactive drawing experience.

However there are areas for improvements. Although the application can complete the tracking

& drawing task, the quality of the experience could be improved as the frame rate does not keep

up with the camera (30 fps). This is mainly due to (1) Accumulated delays in the hardware

components, such as the compositor, the VDMA and the MicroBlaze processor, when

performing intensive memory operations on the AXI bus; (2) Limited BRAM on the MicroBlaze

processor forces the software to extend processing buffers into DDR memory, which is slower in

terms of accessing. In addition, the TFT module from Xilinx was used to display the output via

VGA, but it causes flickering when memory is accessed while the TFT is issuing reads.

These areas can be improved if the hardware platform allows more computing power for

processing data. It could also help if multiple memory buses can be used to avoid congestion,

particularly display underflow, on the AXI interconnect. Switching to a higher-end FPGA platform

could definitely improve the performance. In addition, the architecture may be changed from

memory based operation to streaming operation, where a large FIFO is used to hold video data

for one frame, and pixels are processed in sequence. It could help to reduce the memory bus

traffic, but memory is still required for the compositor to overlay the drawing as well as used by

the software component. For future development, we would like to trying the following:

9

1. Change the encoding of the frame buffer in order to occupy less memory and use less

bandwidth

2. Use a FIFO to hold data for one image frame to reduce memory accessing and delay

3. Use multiple buses and memory to separate independent tasks for parallel processing as

well as reducing memory traffic

4. Experiment with migrating more software components into hardware, and parallelizing

image processing into patches to speed up the process.

10

Project Schedule

Table 2: List of Original and Modified Weekly Milestones

Milestone Original Tasks Revised Tasks

#1 Create and test VGA module

and output a solid color to a

PC monitor

Create the infrastructure with

Microblaze processor

Created Microblaze System

Created TFT module

#2 Integrate the video camera to

the design and display the

captured frames to the PC

monitor

Create and test the

compositor with video input

and color patterns in

memory. Display the frames

on a PC monitor

Create LED pens with

triggers

Imported DDR memory into block design.

Displayed pixels on VGA and wrote a

short program drawing a bouncing

horizontal line across the screen.

Imported Interrupts into block design for a

possible configuration of double

buffering.

Attempted to build the camera and

searched for resources

Soldered LED circuits

Created the compositor custom IP block

Tested the compositor on a fixed solid

colour camera frame buffer

Changing overlay frame buffer, with

delayed double buffering and it appears

to work

#3 Implement a preliminary LED

detection algorithm to detect

the position of the brush

Create and test the LED

detector with the compositor

and output the frame to a

monitor

researched and compared different video

tracking algorithm

Implemented a tacking algorithm based

on the simple video tracking example in

lecture slides, and connected-component

connected-component labeling.

Built a system using AxiDataMover to

capture camera stream and write to

memory (consists of AxiDataMover, 2 Axi

FIFOs for commands and data, a

Camera IP core which wraps OV7670

and generates commands.

ran simulations on individual parts and

11

built up to final system

#4 Improve the LED detector

design to detect the area of

the brush

Implement the drawing and

erasing functions

Improved LED tracking algorithm

robustness in software. The current

build can handle background noise (i.e.

bright spots, light sources other than

out LED)

Experimented with a different tracking

algorithm and compared to the working

algorithm

Made hacks to fine tune the algorithm

to run on FPGA with limited resources

Separated portion of the code and

helped to implement it in hardware as

the PixelFilter block

#5 Improve the LED detector

module’s accuracy and

noise reduction

Adding support for colors

and drawing with multiple

colors

P1 Testing

Started seeing something on the screen

Flipped bit in I2C Config file to change

from QVGA to VGA + corresponding

changes to AxiDataMover command

generatio

Now getting full-screen output, but with

incorrect colour and a circular shift of

lines

#6 Complete outstanding tasks

Complete P2, P3 Tests

Discovered a few bugs in the composition

code including the done signal always set

to high and the wrong reset value coming

into the composition

VDMA Logic Debugging

#7 Report writing Recreate the Compositor Block and the

VDMA Blocks

12

Major Changes to Original Schedule The main difference between the two schedules is that most tasks on the ideal schedule

only last one week. On the real schedule most tasks take multiple weeks to complete. For

example, the Compositor which was created in week 2 had a bug permeate until week 6.

Also the reading data from camera took at least half of the entire scheduled time to

complete. Our team did not allocate the correct amount of time on each block. Also, there

was a lot of datasheet reading, and simulation results required to complete the project and

we did not account for them in our schedule.

13

Descriptions of System Components

The project used both IP supplied by Xilinx, as well as custom IP’s and components. The IP’s

supplied by Xilinx are shown in the appendix, and the IP’s and components developed by our

team are shown below:

• VIDEO INPUT SUBSYSTEM

This subsystem consists of the OV7670 decoder block, and the AXI VDMA memory interfacing

block. The OV7670 decoder is a semi-custom IP, consisting of the following internal modules:

▪ I2C_AV_Config – this was provided by the course instructors in a tutorial and it functions

to send a configuration bitstream to the OV7670 Camera. The only changes made were

to some of the configuration registers. COM7[4] was cleared to switch from QVGA to

VGA, and RGB444[1] was cleared and COM15[5:4] set to 01 to switch from RGB444 to

RGB565. Please see the camera data sheet for more details [2].

▪ ov7670_capture – this was also provided. The changes were to the state machine so

that it would generate addresses for the wider VGA range instead of the QVGA range.

▪ debounce – this was also provided to debounce one of the board push buttons. No

changes were made.

▪ ov7670_top – this was also provided. A state machine was added in order to set AXI-

Stream signals to communicate with the VDMA (tready, tvalid, tlast, fsync). Internal

BRAM and clock generators were removed. Finally, the output pixel data was a 32-bit

value in which extra zeroes were added in order to convert to RGB888 format required

by the TFT in the original system. The padding is shown below:

OV7670 Camera output (RGB565): RRRR RGGG GGGB BBBB

OV7670 decoder output (RGB888): 0000 0000 RRRR R000 GGGG GG00 BBBB B000

The decoder block passively runs continuously at 30 frames per second without any regard to

whether the VDMA is accepting data or not. At the beginning of each frame, it sends the VDMA

an fsync pulse for frame synchronization.

The VDMA is a Xilinx IP and version 6.2 [3] was used. Only the stream to memory-mapped

mode (S2MM) write channel is enabled with support for only a single frame buffer. The

maximum burst rate was increased to 256 to reduce protocol overhead following display

underflows caused by the TFT Controller in the original design. A line buffer of 4096 is used to

14

store an entire display line. No tests were done to optimize this, but the buffer needs to be

sufficiently large to keep up with the passive decoder’s data stream. fsync synchronization was

used, and asynchronous slave and master clocks were used to cross from the camera and

decoder’s 25.175MHz domain into the main system’s 100MHz domain.

• VIDEO OUTPUT SUBSYSTEM

This subsystem consists of the vga640x320 encoder block, and the AXI VDMA memory

interfacing block with an AXI Stream-Data FIFO in between.

Similar to the Video Input Subsystem, a Xilinx VDMA 6.2 was used, but this time in memory-

mapped to stream (MM2S) read mode. The same 256 maximum burst size, and fsync

synchronization were used. The VDMA block supports asynchronous clocks, but simulations

with asynchronous clocks were not working late in development, so a FIFO was used instead.

The FIFO and VDMA both have 512-deep data buffers for a total depth of 1024 which was

sufficient in tests.

The vga640x320 encoder block is a semi-custom block which was adapted from an online tutorial

by Drew Fustini [1] which showed how to display colour bars. The colour bars were replaced

with stream data, bit depth was increased, and signals for AXI-Stream compliance were added.

The block sets RGB444 signals as well as VGA vsync and hsync signals for a 60Hz refresh

rate. Like the OV7670 decoder block, it runs passively, and signals the VDMA with an fsync

pulse at the start of the frame. While it is still susceptible to underflow, it is much better than the

Xilinx TFT Controller in this regard. The suspected reason for this is that it uses significantly less

bandwidth since it only reads the part of the buffer which is on the screen (~1.2MB) and ignores

the porches (2MB altogether). The large buffer could also be a factor in this. As such, this

subsystem was used to replace the TFT Controller. Since the frame buffer encoding was set

based on TFT’s RGB888 requirement, it could have changed to a more memory- and

bandwidth-efficient encoding.

15

• COMPOSITOR

AXI Full Peripheral with 256 burst transactions. Block is composed of a slave and master

access. Slave component is used to obtain the address of the frame buffer for the camera, the

address of the draw buffer for the Microblaze, and the display buffer for the VGA. It also has

start and done registers for more control over the device.

Base Address +

Offset

Function

0x00 Frame Buffer Address – Base address of the data from the video stream.

0x04 Draw Buffer Address – Base address of the data from the drawing stream

0x08 Display Buffer Address – Base address of the display stream.

0x0C Start – The compositor will run when there is a 1 in the 0th bit. The compositor will

stop when there is a 0 in the 0th bit.

0x10

0x14

0x18

0x1C

The master uses the addresses obtained from the slave and reads the block locations in

memory to obtain the data. The master reads 640 x 480 data from the frame buffer, reads the

same amount from the draw buffer, and writes the same amount to the display buffer. It sets the

done signal high when transfer is complete.

16

• SOFTWARE COMPONENT

The software implements the Connected-Component Labelling (CLL) algorithm to search to the

LED. The CCL algorithm is heavily optimized to operate with limited hardware resources, and it

also integrates with hardware blocks to improve efficiency. The CCL algorithm is chosen for its

ability to detect regions of interests (LED) and record the size of the region. CLL makes it

possible to implement the feature to dynamically resizing the paint brush, and it also reliably

tracks the LED.

On a high level, the CCL algorithm labels regions that connected by comparison neighbour

pixels at every pixel, and determine if they shares the same label, which implies they represents

the same object. The original CCL algorithms typically consist of two steps. The first pass, which

labels all pixels in the image that is of interest (in our case, white pixels), and the second pass,

which aggregates equivalent labels and finalize the output blobs. A visual demonstration is

shown below.

The CCL requires data structure to keep track of the equivalent labels as this information is

required in the second pass to remove label duplication. It’s typically implemented as a graph

and a union-find algorithm is used to aggregate the labels. Due to the limitation on the hardware

resources, the algorithm is modified to use an array to keep track of the label equivalency, and it

effectively saves memory usage and improves cache reuse.

17

The completed data flow diagram is shown below:

The new frame is reduced in size through the partial frame filter, which allows the search region

to be restricted to a small window where the LED is likely to present. The CCL algorithm will

operate on the reduced frame to output list of blobs that are likely to be the LED. Lastly, the

average position for the color of the LED is returned by the pixel filter in hardware, which gives a

rough estimate of where the LED is. This information is used to select the blob that is most likely

to be the LED. Once the LED is found, its position and size are used to draw on the screen as

well as defining the search window for the next frame. With this implementation, the processing

load is significantly reduced. LED discovery employs an ensemble approach, where LED

position outputted by CCL in software is combined with LED position estimated in hardware to

make the decision. Noise is largely reduced by using this approach along with a reduced search

region. Overall, the software is optimized for this application to run on this platform, and it is

shown to perform as expected.

18

Description of Design Tree

The project files can be viewed on GitHub here: https://github.com/ddifelice/G01_PaintWithVision.

PaintWithVision: Main project directory that consists of integrated IP’s and software.

PixelFilter_1.0: Custom hardware IP for detecting the average position of all blue pixels in the image. It is used to assist the LED (blue) detection in software.

VGA: Custom hardware IP for outputting an image via VGA

compositor_w_burst_0: Custom hardware IP directory of the burst compositor

https://github.com/ddifelice/G01_PaintWithVision

19

Tips and Tricks

• Know your system. The DDR became a major stumbling block for a video to memory

processing and will be an issue in design systems that have high bandwidth.

• See if there are other IP blocks available before resorting to creating your own custom blocks.

• Simulate before synthesis. A lot of time can be saved by ensuring your system works as

expected before synthesizing.

20

References

[1] D. Fustini. (2013, April 12). Draw VGA color bars with FPGA in Verilog. [Forum] Available:

http://www.element14.com/community/thread/23394/l/draw-vga-color-bars-with-fpga-in-verilog?displayFullThread=true

[2] OmniVision. (2006, August 21). OV7670/Ov7171 CMOS VGA (640x480) CameraChip Sensor with

OmniPixel Technology. [Datasheet].

[3] Xilinx. (2014, April 2). LogiCore IP AXI Video Direct Memory Access v6.2. [Datasheet]. Available:

http://www.xilinx.com/support/documentation/ip_documentation/axi_vdma/v6_2/pg020_axi_vdma.pdf



ece532 group report paint with vision

Documents