ece532 group report paint with vision
TRANSCRIPT
ECE532 GROUP REPORT
Paint With Vision By Team #1
Amanjit Dhillon, Daniel Di Felice, Yusheng Wu
2
Table of Contents
Overview ................................................................................................................................. 3
Block Diagram ........................................................................................................................ 4
Brief Description of IP ............................................................................................................ 5
Project Outcome ...................................................................................................................... 8
Project Schedule .................................................................................................................... 10
Descriptions of System Components .................................................................................... 13
Description of Design Tree ................................................................................................... 18
Tips and Tricks ..................................................................................................................... 19
3
Overview
The paint with vision project aimed to create an interactive experience for the user by solving a
computer vision challenge using custom hardware and software solutions. On a high level, an
LED is used to simulate a paint brush, while the LED’s position is tracked by a camera to draw
on the screen. The drawing will be overlaid on top of the video feed, which provides real time
feedback to the user while creating the virtual experience.
Object tracking is a challenging problem in computer vision. While many algorithms exist, the
implementations are mostly in software and in general it would require high-performing
computer hardware to handle data processing for realtime applications. Therefore our
motivation for this project were to explore the possibilities using custom hardware to aid the
computer vision task in software. The goals of this project are (1), creating an interactive
augmented reality experience for the user with the painting application; (2), implementing
custom hardware and software to realize LED tracking and overlay painting on camera video
stream; (3), design the painting application with usability considerations by including handy
features such as erasing, dynamic brush size change; (4), handling noise in object tracking and
ensure the robustness of the tracking algorithm against environment variables.
AXI Crossbar
Interconnect
Memory Interface
Generator
MicroBlaze Processor
BRAM
Instruction Memory
BRAM
Stack Memory
AXI Video DMA
VGA MM2S
AXI Video DMA
Camera S2MM
Compositor
GPIO
UART Lite
AXI Stream
Data FIFO vga640x320
OV7670
Decoder
VGA Video Output
Off-Chip DDR SDRAM Frame Buffers, Heap
OV7670 Camera Video Input
Switches Enable Draw/Erase
USB Programming Interface
Block Diagram Legend
Existing Xilinx IP
Custom IP
Custom Software IP
Partial Frame
Filter
Connected
Component
Labelling
Blob Select Draw or Erase
Data Cache
Dra
win
g O
ut
GP
IO
Fram
e In
4
5
Brief Description of IP
Table 1: List of All IP Blocks used in the Design
Instance Name Block Name Version Origin
axi_gpio_0 AXI GPIO 2.0 Xilinx
axi_intc_0 AXI Interrupt Controller 4.1 Xilinx
axi_uartlite_0 AXI Uartlite 2.0 Xilinx
axi_vdma_0 AXI Video Direct Memory Access
6.2 Xilinx
axi_vdma_1 AXI Video Direct Memory Access
6.2 Xilinx
axis_data_fifo_0 AXI4-Stream Data FIFO 1.1 Xilinx
clk_wiz_1 Clocking Wizard 5.1 Xilinx
compositor_w_burst_0 compositor_w_burst_v1.0 1.0 Full Custom
mdm_1 MicroBlaze Debug Module 3.1 Xilinx
microblaze_0 MicroBlaze 9.3 Xilinx
microblaze_0_axi_periph AXI Crossbar Subsystem
AXI Data Width Converter 2.1 Xilinx
AXI Crossbar 2.1 Xilinx
microblaze_0_local_memory Block Memory Generator Subsystem
Local Memory Bus 1.0:3.0 Xilinx
LMB BRAM Controller 4.0 Xilinx
Block Memory Generator 8.2 Xilinx
mig_7series_0 Memory Interface Generator (MIG 7 Series)
2.0 Xilinx
OV7670_0 OV7670 1.0 Semi-Custom (some instructor-provided IP)
proc_sys_reset_0 Processor System Reset 5.0 Xilinx
rst_clk_wiz_1_100M Processor System Reset 5.0 Xilinx
rst_mig_7series_0_81M Processor System Reset 5.0 Xilinx
vga640x480_0 vga640x320_v1_0 1.0 Semi-Custom (with online IP)
6
• VIDEO INPUT TO MEMORY
A semi-custom OV7670 (OV7670_0) decoder block is used to configure the OV7670 Camera
peripheral component and convert incoming signals to an AXI-Stream. Most of the internal
components in the block were adapted from code that the course instructors provided, with
modifications made to increase resolution, bit depth, and make the output signals AXI-Stream
compliant. The stream is then pushed into an AXI Direct Memory Access block (axi_vdma_0)
which writes out the camera frame buffer to main memory (DDR SDRAM) at 30 frames per
second.
• PROCESSOR CODE
The MicroBlaze processor (microblaze_0) reads in the frame buffer and runs a detection
algorithm to determine the position, size, and colour of the LED if visible in the image. It then
writes pixel data for a square to a second drawing frame buffer in memory which is initialized to
be transparent at the start of system operation. The square is written at the same coordinates
as the coordinates of the LED in the camera frame buffer and has the same size and colour.
The processor also occasionally polls the AXI GPIO (axi_gpio_0) to check the position of board
switches. Switch 0 enables drawing while switch 1 enables the erase function. The processor
also has an internal data cache for main memory accesses.
• VIDEO OUTPUT FROM MEMORY
This display buffer is then read by another AXI Video Direct Memory Access block (axi_vdma_1)
which converts it into a stream. The stream is then passed through an AXI Stream-Data FIFO
(axis_data_fifo_0) for synchronization before continuing to the semi-custom vga640x320
(vga640x320_0) module. This module then converts the stream to VGA-compliant signals to
display to a screen at 60Hz. Much of the code for the vga640x320 block was taken from an
online tutorial [1], specifically the various pulse timings.
• MISCELLANEOUS
The Memory Interface Generator (mig_7series_0) block is the memory controller for off-chip DDR
SDRAM. The Block Memory Generator Subsystem (microblaze_0_local_memory), consisting of
Local Memory Buses, LMB BRAM Controllers and BRAM Generator blocks, is an interface to
BRAM used by the processor to store instructions and the stack. AXI Uartlite (axi_uartlite_0) is
the programming interface used to modify instructions or debug the system. The AXI Crossbar
7
Subsystem (microblaze_0_axi_periph), consisting of AXI Data Width Converters on the ports of
an AXI Crossbar, is the crossbar interconnect which interfaces all major system components as
shown in the block diagram. The remainder are clock, reset, or interrupt modules.
8
Project Outcome
The application was completed in terms of proposed functionalities. The position of the LED can
be tracked correctly and smoothly. The size of the LED as it appears on the screen is also
obtained during tracking, which allows the paint brush to change size dynamically. The user can
also choose to draw, erase or standby using toggling switches. The application successfully
delivers the proposed features in terms of creating an augmented reality drawing experience for
the user with integrated custom hardware and software.
The implementation solves the challenge of performing a realtime computer vision task using
combine solutions in hardware and software. Initially the LED tracking algorithm was simulated
in software and tested on a PC. Simulation results suggested that although the algorithm can
function as an LED detector, it is prone to background noise and requires a powerful processor.
Therefore, much of the development effort was spent on improving tracking accuracy and
optimizing the performance of the algorithm through integrated hardware and software
solutions. The successful adaptation of such algorithm provides ground for developing the
application to support an interactive drawing experience.
However there are areas for improvements. Although the application can complete the tracking
& drawing task, the quality of the experience could be improved as the frame rate does not keep
up with the camera (30 fps). This is mainly due to (1) Accumulated delays in the hardware
components, such as the compositor, the VDMA and the MicroBlaze processor, when
performing intensive memory operations on the AXI bus; (2) Limited BRAM on the MicroBlaze
processor forces the software to extend processing buffers into DDR memory, which is slower in
terms of accessing. In addition, the TFT module from Xilinx was used to display the output via
VGA, but it causes flickering when memory is accessed while the TFT is issuing reads.
These areas can be improved if the hardware platform allows more computing power for
processing data. It could also help if multiple memory buses can be used to avoid congestion,
particularly display underflow, on the AXI interconnect. Switching to a higher-end FPGA platform
could definitely improve the performance. In addition, the architecture may be changed from
memory based operation to streaming operation, where a large FIFO is used to hold video data
for one frame, and pixels are processed in sequence. It could help to reduce the memory bus
traffic, but memory is still required for the compositor to overlay the drawing as well as used by
the software component. For future development, we would like to trying the following:
9
1. Change the encoding of the frame buffer in order to occupy less memory and use less
bandwidth
2. Use a FIFO to hold data for one image frame to reduce memory accessing and delay
3. Use multiple buses and memory to separate independent tasks for parallel processing as
well as reducing memory traffic
4. Experiment with migrating more software components into hardware, and parallelizing
image processing into patches to speed up the process.
10
Project Schedule
Table 2: List of Original and Modified Weekly Milestones
Milestone Original Tasks Revised Tasks
#1 Create and test VGA module
and output a solid color to a
PC monitor
Create the infrastructure with
Microblaze processor
Created Microblaze System
Created TFT module
#2 Integrate the video camera to
the design and display the
captured frames to the PC
monitor
Create and test the
compositor with video input
and color patterns in
memory. Display the frames
on a PC monitor
Create LED pens with
triggers
Imported DDR memory into block design.
Displayed pixels on VGA and wrote a
short program drawing a bouncing
horizontal line across the screen.
Imported Interrupts into block design for a
possible configuration of double
buffering.
Attempted to build the camera and
searched for resources
Soldered LED circuits
Created the compositor custom IP block
Tested the compositor on a fixed solid
colour camera frame buffer
Changing overlay frame buffer, with
delayed double buffering and it appears
to work
#3 Implement a preliminary LED
detection algorithm to detect
the position of the brush
Create and test the LED
detector with the compositor
and output the frame to a
monitor
researched and compared different video
tracking algorithm
Implemented a tacking algorithm based
on the simple video tracking example in
lecture slides, and connected-component
connected-component labeling.
Built a system using AxiDataMover to
capture camera stream and write to
memory (consists of AxiDataMover, 2 Axi
FIFOs for commands and data, a
Camera IP core which wraps OV7670
and generates commands.
ran simulations on individual parts and
11
built up to final system
#4 Improve the LED detector
design to detect the area of
the brush
Implement the drawing and
erasing functions
Improved LED tracking algorithm
robustness in software. The current
build can handle background noise (i.e.
bright spots, light sources other than
out LED)
Experimented with a different tracking
algorithm and compared to the working
algorithm
Made hacks to fine tune the algorithm
to run on FPGA with limited resources
Separated portion of the code and
helped to implement it in hardware as
the PixelFilter block
#5 Improve the LED detector
module’s accuracy and
noise reduction
Adding support for colors
and drawing with multiple
colors
P1 Testing
Started seeing something on the screen
Flipped bit in I2C Config file to change
from QVGA to VGA + corresponding
changes to AxiDataMover command
generatio
Now getting full-screen output, but with
incorrect colour and a circular shift of
lines
#6 Complete outstanding tasks
Complete P2, P3 Tests
Discovered a few bugs in the composition
code including the done signal always set
to high and the wrong reset value coming
into the composition
VDMA Logic Debugging
#7 Report writing Recreate the Compositor Block and the
VDMA Blocks
12
Major Changes to Original Schedule The main difference between the two schedules is that most tasks on the ideal schedule
only last one week. On the real schedule most tasks take multiple weeks to complete. For
example, the Compositor which was created in week 2 had a bug permeate until week 6.
Also the reading data from camera took at least half of the entire scheduled time to
complete. Our team did not allocate the correct amount of time on each block. Also, there
was a lot of datasheet reading, and simulation results required to complete the project and
we did not account for them in our schedule.
13
Descriptions of System Components
The project used both IP supplied by Xilinx, as well as custom IP’s and components. The IP’s
supplied by Xilinx are shown in the appendix, and the IP’s and components developed by our
team are shown below:
• VIDEO INPUT SUBSYSTEM
This subsystem consists of the OV7670 decoder block, and the AXI VDMA memory interfacing
block. The OV7670 decoder is a semi-custom IP, consisting of the following internal modules:
▪ I2C_AV_Config – this was provided by the course instructors in a tutorial and it functions
to send a configuration bitstream to the OV7670 Camera. The only changes made were
to some of the configuration registers. COM7[4] was cleared to switch from QVGA to
VGA, and RGB444[1] was cleared and COM15[5:4] set to 01 to switch from RGB444 to
RGB565. Please see the camera data sheet for more details [2].
▪ ov7670_capture – this was also provided. The changes were to the state machine so
that it would generate addresses for the wider VGA range instead of the QVGA range.
▪ debounce – this was also provided to debounce one of the board push buttons. No
changes were made.
▪ ov7670_top – this was also provided. A state machine was added in order to set AXI-
Stream signals to communicate with the VDMA (tready, tvalid, tlast, fsync). Internal
BRAM and clock generators were removed. Finally, the output pixel data was a 32-bit
value in which extra zeroes were added in order to convert to RGB888 format required
by the TFT in the original system. The padding is shown below:
OV7670 Camera output (RGB565): RRRR RGGG GGGB BBBB
OV7670 decoder output (RGB888): 0000 0000 RRRR R000 GGGG GG00 BBBB B000
The decoder block passively runs continuously at 30 frames per second without any regard to
whether the VDMA is accepting data or not. At the beginning of each frame, it sends the VDMA
an fsync pulse for frame synchronization.
The VDMA is a Xilinx IP and version 6.2 [3] was used. Only the stream to memory-mapped
mode (S2MM) write channel is enabled with support for only a single frame buffer. The
maximum burst rate was increased to 256 to reduce protocol overhead following display
underflows caused by the TFT Controller in the original design. A line buffer of 4096 is used to
14
store an entire display line. No tests were done to optimize this, but the buffer needs to be
sufficiently large to keep up with the passive decoder’s data stream. fsync synchronization was
used, and asynchronous slave and master clocks were used to cross from the camera and
decoder’s 25.175MHz domain into the main system’s 100MHz domain.
• VIDEO OUTPUT SUBSYSTEM
This subsystem consists of the vga640x320 encoder block, and the AXI VDMA memory
interfacing block with an AXI Stream-Data FIFO in between.
Similar to the Video Input Subsystem, a Xilinx VDMA 6.2 was used, but this time in memory-
mapped to stream (MM2S) read mode. The same 256 maximum burst size, and fsync
synchronization were used. The VDMA block supports asynchronous clocks, but simulations
with asynchronous clocks were not working late in development, so a FIFO was used instead.
The FIFO and VDMA both have 512-deep data buffers for a total depth of 1024 which was
sufficient in tests.
The vga640x320 encoder block is a semi-custom block which was adapted from an online tutorial
by Drew Fustini [1] which showed how to display colour bars. The colour bars were replaced
with stream data, bit depth was increased, and signals for AXI-Stream compliance were added.
The block sets RGB444 signals as well as VGA vsync and hsync signals for a 60Hz refresh
rate. Like the OV7670 decoder block, it runs passively, and signals the VDMA with an fsync
pulse at the start of the frame. While it is still susceptible to underflow, it is much better than the
Xilinx TFT Controller in this regard. The suspected reason for this is that it uses significantly less
bandwidth since it only reads the part of the buffer which is on the screen (~1.2MB) and ignores
the porches (2MB altogether). The large buffer could also be a factor in this. As such, this
subsystem was used to replace the TFT Controller. Since the frame buffer encoding was set
based on TFT’s RGB888 requirement, it could have changed to a more memory- and
bandwidth-efficient encoding.
15
• COMPOSITOR
AXI Full Peripheral with 256 burst transactions. Block is composed of a slave and master
access. Slave component is used to obtain the address of the frame buffer for the camera, the
address of the draw buffer for the Microblaze, and the display buffer for the VGA. It also has
start and done registers for more control over the device.
Base Address +
Offset
Function
0x00 Frame Buffer Address – Base address of the data from the video stream.
0x04 Draw Buffer Address – Base address of the data from the drawing stream
0x08 Display Buffer Address – Base address of the display stream.
0x0C Start – The compositor will run when there is a 1 in the 0th bit. The compositor will
stop when there is a 0 in the 0th bit.
0x10
0x14
0x18
0x1C
The master uses the addresses obtained from the slave and reads the block locations in
memory to obtain the data. The master reads 640 x 480 data from the frame buffer, reads the
same amount from the draw buffer, and writes the same amount to the display buffer. It sets the
done signal high when transfer is complete.
16
• SOFTWARE COMPONENT
The software implements the Connected-Component Labelling (CLL) algorithm to search to the
LED. The CCL algorithm is heavily optimized to operate with limited hardware resources, and it
also integrates with hardware blocks to improve efficiency. The CCL algorithm is chosen for its
ability to detect regions of interests (LED) and record the size of the region. CLL makes it
possible to implement the feature to dynamically resizing the paint brush, and it also reliably
tracks the LED.
On a high level, the CCL algorithm labels regions that connected by comparison neighbour
pixels at every pixel, and determine if they shares the same label, which implies they represents
the same object. The original CCL algorithms typically consist of two steps. The first pass, which
labels all pixels in the image that is of interest (in our case, white pixels), and the second pass,
which aggregates equivalent labels and finalize the output blobs. A visual demonstration is
shown below.
The CCL requires data structure to keep track of the equivalent labels as this information is
required in the second pass to remove label duplication. It’s typically implemented as a graph
and a union-find algorithm is used to aggregate the labels. Due to the limitation on the hardware
resources, the algorithm is modified to use an array to keep track of the label equivalency, and it
effectively saves memory usage and improves cache reuse.
17
The completed data flow diagram is shown below:
The new frame is reduced in size through the partial frame filter, which allows the search region
to be restricted to a small window where the LED is likely to present. The CCL algorithm will
operate on the reduced frame to output list of blobs that are likely to be the LED. Lastly, the
average position for the color of the LED is returned by the pixel filter in hardware, which gives a
rough estimate of where the LED is. This information is used to select the blob that is most likely
to be the LED. Once the LED is found, its position and size are used to draw on the screen as
well as defining the search window for the next frame. With this implementation, the processing
load is significantly reduced. LED discovery employs an ensemble approach, where LED
position outputted by CCL in software is combined with LED position estimated in hardware to
make the decision. Noise is largely reduced by using this approach along with a reduced search
region. Overall, the software is optimized for this application to run on this platform, and it is
shown to perform as expected.
18
Description of Design Tree
The project files can be viewed on GitHub here: https://github.com/ddifelice/G01_PaintWithVision.
PaintWithVision: Main project directory that consists of integrated IP’s and software.
PixelFilter_1.0: Custom hardware IP for detecting the average position of all blue pixels in the image. It is used to assist the LED (blue) detection in software.
VGA: Custom hardware IP for outputting an image via VGA
compositor_w_burst_0: Custom hardware IP directory of the burst compositor
19
Tips and Tricks
• Know your system. The DDR became a major stumbling block for a video to memory
processing and will be an issue in design systems that have high bandwidth.
• See if there are other IP blocks available before resorting to creating your own custom blocks.
• Simulate before synthesis. A lot of time can be saved by ensuring your system works as
expected before synthesizing.
20
References
[1] D. Fustini. (2013, April 12). Draw VGA color bars with FPGA in Verilog. [Forum] Available:
http://www.element14.com/community/thread/23394/l/draw-vga-color-bars-with-fpga-in-verilog?displayFullThread=true
[2] OmniVision. (2006, August 21). OV7670/Ov7171 CMOS VGA (640x480) CameraChip Sensor with
OmniPixel Technology. [Datasheet].
[3] Xilinx. (2014, April 2). LogiCore IP AXI Video Direct Memory Access v6.2. [Datasheet]. Available:
http://www.xilinx.com/support/documentation/ip_documentation/axi_vdma/v6_2/pg020_axi_vdma.pdf