hardware implementation of cascade svm
TRANSCRIPT
1
Hardware Implementation of Cascade Support Vector Machine
Qian WangTexas A&M University
3/6/2015
2
Outline
Motivation
Support Vector Machine
– Basic Support Vector Machine
– Cascade Support Vector Machine
– Hardware Architecture of Cascade SVM
– Experimental results
Relevant Works in Our Group
– Memristor-based Neuromorphic Processor
– Liquid State Machine
3
Everything is becoming more and more data-intensive:
• Bioinformatics researchers often need to process tens of billions points of data.
• The world’s quickest radio telescope is collecting up to 360 TB of data per day.
• Wearable devices processes the data obtained from our bodies every day.
What can we do with the “Big Data” ?• Machine learning from a large set of data to reveal relationships, dependencies and
to perform predictions of outcomes and behaviors;
• The obtained predictive model is used to interpret and predict new data.
Human Genome Project Astronomy Research Smart Healthcare Devices Big Data Market
4
“Curiosity rover” on Mars Speech Recognition Social Networks Bioinformatics
Machine Learning (Mitchell 1997)– Learn from past experiences to improve the performance of a certain task
– Applications of Machine learning:
– Integrating human expertise into Artificial Intelligence System;
– It enables “Mars rovers” to navigate themselves;
– Speech Recognition;
– Extracting hidden information from complex large data sets
– Social media analysis; Bioinformatics;
5
Challenges
Machine Learning Applications on General-purpose CPU:
• Takes a huge amount of CPU time (e.g. several weeks or even months).
• Very high energy consumption.
SOFTWARE SIMULATION
6
A specific task: Y = AX2 + BX +C
5-bit fixed point numbers
Program :
VS
CPUDedicated Hardware
(assume the same Clock rate)
Our Solutions– A dedicated VLSI hardware design is usually much more time and
energy-efficient than general purpose CPUs
Not limited by Instruction Set;
Necessary functional logics for specific tasks;
No need of Instruction memory (program codes);
Fully exploit hardware parallelism
7
Application Specific Integrated Circuit (ASIC) Field Programmable Gate Array (FPGA)
Dedicated Hardware Designs
Speed
Power
Area
Software Algorithms
Reconfigurability Potential Parallelism Reusability
Scalability Hardware Friendly Algorithm Binary Arithmetic's (Precision)
Storage OrganizationAnalog-to-Digital ConversionMemory Access Styles
Resilience Various interesting features of the ML algorithm to be realized in HW
How do we design hardware?
8
Publications Support Vector Machine
– [TVLSI’14] Qian Wang, Peng Li and Yongtae Kim, “A parallel digital VLSI architecture for integrated support vector machine training and classification,” in IEEE Trans. on Very Large Scale Integration Systems.
Spiking Neural Network– [IEEENano'14] *Qian Wang, *Yongtae Kim and Peng Li, “Architectural
design exploration for neuromorphic processors with memristive synapses,” In Proc. of the 14th Intl. Conf. on Nanotechnology, August 2014.
– [IEEETNANO’14] *Qian Wang, *Yongtae Kim and Peng Li, “Neuromorphic Processors with Memristive Synapses: Synaptic Crossbar Interface and Architectural Exploration” (Under Review)
– [TVLSI’15] *Qian Wang, *Youjie Li, *Botang Shao, *Siddharta Dey and Peng Li, “Energy Efficient Parallel Neuromorphic Architectures with Approximate Arithmetic on FPGA” (Under Review)
9
Outline
Motivation
Support Vector Machine
– Basic Support Vector Machine
– Cascade Support Vector Machine
– Hardware Architecture of Cascade SVM
– Experimental results
Relevant Works in Our Group
– Memristor-based Neuromorphic Processor
– Liquid State Machine
10
x1
x2
wT x + b = 0
Support Vector Machine (SVM)
MarginBasic idea: To construct a separating hyper-plane, where the margin of separation between “+” and “-” samples are maximized.
𝑀𝑎𝑥𝑖𝑚𝑖𝑧𝑒∑𝑖=1
𝑛
𝛼𝑖−12∑𝑖=1
𝑛
∑𝑗=1
𝑛
𝛼 𝑖𝛼 𝑗 𝑦 𝑖 𝑦 𝑗𝐾 (𝑥 𝑖 ,𝑥 𝑗)
C and =0
𝑘 (𝑥𝑖 , 𝑥 𝑗 )=¿𝜙 (𝑥 𝑖 ) ,𝜙 (𝑥 𝑗)>¿
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒Φ (𝑤 ,𝜉 )=12‖𝑤‖2
+𝐶∑𝑖=1
𝑛
𝜉 𝑖
S .t.Method of Lagrange multipliers
A learning and classification algorithm successfully applied to a wide range of real-world pattern recognition problems
Support Vectors
Separating Hyperplane
ClassifyFuture input
vectors
“+”
“-”
11
x1
x2
x1
x2
Support Vector
Machine(Training)
Labeled samples
“ Filtering Process! ”
x1
x2
wT x + b = 0
Margin
Support Vector
Machine(Testing)
x1
x2Unlabeled samples
Accurate predictions
Kernel Method: between any of 2 training samples. During SVM training, if there are n samples, the total number of kernel calculations is n2!
12
Cascade SVM
SVM SVM SVM SVM
SVM SVM
SVM
SV1 SV2 SV3 SV4
SV SV
SV
D1 D2 D3 D4
Di: i-th data setSV: support vectors
Original large data set
[ H. P. Graf, Proc. Adv. Neural Inf. Process. Syst., 2004 ]
Training process of basic SVM– SVM training is time consuming:
Dominated by kernel evaluations;O(n2) time complexity;
Parallel SVM (Cascade)– Parallel processing of multiple smaller sub
data sets – Partial results are combined in 2nd 3rd layer
workload in 2nd &3rd layers is small.
Global Convergence:– Feed the 3rd layer result to 1st layer to check
the KKT conditions.– The samples violating KKT conditions will
join the next round of optimization.
Amdahl’s law:– Significant Speedup can be achieved if the
runtime of the 1st layer dominates;
13
Array of basic SVM units;
Distributed Cache Memories;
Multi-layer System Bus;
Global FSM as Controller;
– Critical issues for the detailed implementation: How to use moderate number of SVMs to construct HW architecture?
How to make efficient use of on-Chip memories?
Flexibility of each SVM unit in processing variable sized data sets
Configure differently to tradeoff between Power, Area and Throughput;
Overall HW Architecture
𝑦 𝑖 ,𝛼 𝑖 ,𝑥 𝑖Binary OperandsMEM MEM MEMMEM
MEM MEM MEMMEM
SVM SVM SVMSVMSVM SVM SVMSVM
Global Controller
SVM SVM SVM
Read/write interface, Address mapping control
MEM MEM MEM
SVM
MEM
Multi-layer System Bus
SVM Array
DistributedMemory
14
How to use moderate number of SVMs to construct HW architecture?
SVM
SV1 SV2 SV3 SV4
SV12 SV34
SVM SVM SVM
SVM SVM
SVM
Software data flow of a Cascade SVM
We should fully exploit the concept of HW Reusability !
The 7 SVMs are not working simultaneously !
D1 D2 D3 D4
• We implement 4 SVMs to perform 1st layer training:• D1~D4 stored in distributed memories.• SVMs access their private memories in parallel.
SVM SVM SVM SVM
D1 D2 D3 D4
• For the 2nd layer, just reuse 2 of the 4 SVMs. But how can they find SV1 U SV2 or SV3 U SV4?
SVM SVM SVM SVM
SV1 SV2 SV3 SV4
• Considering , we simply need to enable each “reused SVM” to access multiple memory blocks:
15
MEM
x(1)
SVM
MMUy
Results
SVM
MMU
Results
MMU
SVM
MMU
Results
MMU
(a) 1st layer
(b) 2nd layer
x(2)
SVM
MMU
Results
MMU MMU MMU
MEM MEM MEM MEM
(c) 3rd layer
MEM
x(1)
SVM
MMUy
Results
x(2)
MEM
x(1)
SVM
MMUy
Results
x(2)
MEM
x(1)
SVM
MMUy
Results
x(2)
x(1)
y
x(2)
x(1)
y
x(2)
x(1)
y
x(2)
new new newnew
new new
newMEM MEM MEM MEM
Data flow of the HW architecture
D1 D2 D3 D4
D1 D2 D3 D4
D1 D2 D3 D4
SVM
SV1 SV2 SV3 SV4
SV12 SV34
SVM SVM SVM
SVM SVM
SVM
D1 D2 D3 D4
Software data flow of a Cascade SVM
• D1~D4 stored in MEM1 ~ MEM4;
• Implement 1st layer SVMs with HW, and reuse them for the following layers;
• Training results saved in MMU (will explain)
• The final data flow is illustrated by the figure to the right:
How to use moderate number of SVMs to construct HW architecture?
16
ABCDEFGH
# of SVs : 50x000000
0x000001
0x000002
0x000003
0x000004
0x000005
0x000006
0x000007
A
BC
D
1346
E8
# of SVs : 3047
F
G
H
0x0000000x0000010x0000020x0000030x0000040x0000050x0000060x0000070x000008
0x0000000x0000010x0000020x0000030x0000040x0000050x0000060x0000070x000008
Virtual Address Space Physical Address Space
Continuous addresses from one SVM unit
Support VectorIndex tables inside MMUs
Physical addresses from two separate
SRAMs
MMU (a)
MMU (b)
SRAM (a)
SRAM (b)
MMU (Memory Management Unit)– Record the address of each SV;– Perform the “address mapping” to help
the reused SVM to locate the SVs;
How to make efficient use of on-Chip memories?The target is to “identify” SVs in the original data set, so we just need to record their locations in the memory. Don’t duplicate and save them to additional storage space.
SVM
MEM
MMU
yx(1)
x(2)
α αnew
result
SVM
MEM
MMU
yx(1)
x(2)
α αnew
result
SVM
MEM
MMUyx(1)
x(2)
α αnew
result
MEM
MMU
αnew
1st layer Parallel Training (MMUs record SV addresses)
2nd layer Partial Results Combination(MMUs perform “Address Mapping” )
D1 D2
17
Implementation of Multi-layer System Bus– According to the data flow explained earlier, we want:
– to reuse SVM units for different layers of Cascade SVM;– to make a reused SVM to access the data stored in multiple memory blocks;
– A multilayer system bus is required to support all the necessary data transmissions.
18
Design of Flexible SVM unit– Single SVM unit might be reused for different layers of the Cascade Tree;
– It should be capable of processing variable sized data sets;
– To apply Serial Processing Scheme for Kernel Calculation;
Memory
Address Generator
yj
xi(1)
xj(1)
xi(2)
xj(2)
Sub
Sub
( )2
( )2
AddLUT
-1
yi32 bit
Multiplier Add Reg
-1Sub
1
LocalFSM
{0, C}
0
3N-1
3N
4N-1
||||||||
y
x(1)
x(2)
Nij
address
dataout
datain
kij
Comp
i
sram
j
Implementation Details– Gaussian Kernel
– 32 bit fixed-point arithmetic's
19
Classification & KKT check– Formulas have a very similar
form with training algorithm;
– We can reuse the logics in SVM units to reduce area overhead;
MEM MEM MEMMEM
AMP
SVMAddress
Indices of Support Vectors
Indices of KKT
violators
Indices of Support Vectors
Indices of Support Vectors
Indices of Support Vectors
AMP AMP AMP
Indices of KKT
violators
Indices of KKT
violators
Indices of KKT
violators
Indices of KKT
violators
𝛼𝑖=0 → 𝑦 𝑖 (∑𝑗=1
𝑁
𝛼 𝑗 𝑦 𝑗𝐾 ( �⃗� 𝑗 ,𝑥 𝑖))≥ 1
0 ≤𝛼 𝑖≤𝐶→ 𝑦 𝑖 (∑𝑗=1
𝑁
𝛼 𝑗 𝑦 𝑗𝐾 ( �⃗� 𝑗 , �⃗� 𝑖 ))=1
𝛼𝑖=𝐶→ 𝑦 𝑖(∑𝑗=1
𝑁
𝛼 𝑗 𝑦 𝑗𝐾 (𝑥 𝑗 , �⃗�𝑖))≤ 1
400Samples
Without Feedback One FeedbackRuntime Accuracy Runtime Accuracy
Flat SVM 0.394s 98% unnecessary2-Core 0.104s 94.25% 0.120s 98%4-Core 32.8ms 92.50% 37.55ms 98%8-Core 13.9ms 89.75% 16.13ms 98%
The KKT violators still have a chance to get back to the optimization !!!
𝑓 (�⃗� )=∑𝑖=1
𝑁 𝑠𝑣
𝛼𝑠𝑣 𝑦𝑠𝑣 𝐾 ( �⃗� , �⃗�𝑠𝑣)
{𝑖 𝑓 𝑓 ( �⃗� )>0 , h𝑡 𝑒𝑛+𝑖𝑓 𝑓 ( �⃗� )<0 , h𝑡 𝑒𝑛−
The address information of KKT violators will be recorded in MMUs :
Impact of the feedback on the training accuracy and runtime.
20
Experimental Results– Synthesized using a commercial 90nm CMOS standard cell library;
– On-Chip memories generated by corresponding SRAM compiler;
– Layout generated using the same library, measure the area, power and maximum clock frequency (178MHz).
Decision boundary obtained from training 400 2-D samples.
The 8-core design including I/O pads6.68mm2
21
200 Samples
P(mW)
Area (um2)
Speed Energy Reduction
Flat SVM 15.52 373,518 1x 1x2-Core 27.74 727.946 3.67x 2.05x4-Core 64.43 1,499,828 10.54x 2.54x8-Core 126 3,143,700 28.79x 3.54x
Experimental Results
Energy = Runtime x Power
50 100 150 200 250 300 350 40010
-4
10-3
10-2
10-1
100
Number of training samples
Run
time
(s)
1-core SVM2-core SVM4-core SVM8-core SVM
50 100 150 200 250 300 350 40010
-5
10-4
10-3
10-2
Number of training samples
Ene
rgy
(J)
1-core SVM2-core SVM4-core SVM8-core SVM
As number of cores increases:– Power & Area are “linearly” increased
– Speedup is increased much faster
Datasets of different sizes to evaluate performance of each HW design
Focus on a fixed dataset
22
Flat SVM (1-Core)
Temporal Reuse (1-
Core)
Fully Parallel (2-Core)
Hybrid (2-Core)
0
1
2
3
4
5
6
7
8Core Area (um2)
Power (mW)
Speedup (1x)
Subset 1 Subset 1 Subset 3
SVM1 SVM2SVM
(a) temporal reuse of one SVMSubset 2
MemorySubset 2 Subset 4
Memory1 Memory2
MMU1 MMU2 MMU1 MMU2 MMU3 MMU4
SVM1 SVM2
SVM5
SVM3 SVM4
SVM6
SVM7
SVM1 SVM2
SVM3
Subset 1 Subset 2 Subset 1 Subset 2 Subset 3 Subset 4
(b) temporal reuse of two SVMs
We can configure the flexible architecture in different ways:
1. Full Parallel Processing; Reuse SVMs for different layers
2. Temporal reuse of SVM unit; Reuse SVMs within same layer
Due to O(n2) of Kernel evaluation, we can still get about 2x speedup !
Integrating “Temporal Reuse Scheme” into Cascade SVM HW
It will introduce a small area/power overhead. It will introduce a further speedup .
A new angle for the tradeoffs between speed and hardware cost !
23
• Even the Intel CPU has a higher Clock frequency, and uses a more advanced technology, our ASIC designs can still outperform it by a lot!
C++ SVM program Intel Pentium T4300 (2.1GHz) (45nm)
ASIC designs of Cascade SVMs (178MHz) (90nm)
VS
Comparison of Runtimes and Energy Consumption
Software Approach and Hardware Approach
24
Thank you! Questions?