mapu: a novel mathematical computing architecture - mapu... · mapu: a novel mathematical computing...
TRANSCRIPT
MaPU: A Novel Mathematical Computing Architecture
Shashank Kedia & Robert Macy III
1
● High performance CPUs and GPUs have good theoretical performance but low power efficiency relative
to performance
● Superscalar and GPGPU have been proven to be power inefficient
● Most systems operate at 60% of peak performance
● Supercomputers using thousands of processors have massive power and space requirements
● Develop a chip that can do mathematical calculations at a good performance to power ratio relative to
gpus and cpus
Why MaPU?
2
Architecture overviewThree main components:
● Scalar Pipeline: communicates
with the system on chip and
controls microcode pipeline.
● Microcode Pipeline: Consists of
functional units (FUs) defining
data flow.
● Multi-Granularity Parallel
Memory System (MGP) allows
efficient custom data access
patterns.
3
Architecture Details: MGP Memory SystemMGP allows efficient data access patterns.
Given parameters W, the number of bytes that can be
accessed in parallel, N, the total capacity in bytes, and
G, the number of bytes available for reading/writing,
the memory system can be partitioned to define
memory accesses.
Physical banks combine to form logic banks.
Each logic bank consists of G physical banks.
4
Architecture Details: MGP Memory System (matrix accesses)Matrices can be accessed in row or column
order.
Matrix accesses in MGP requires storing the
i-th row in the i mod W-th logic bank.
Rows can be accessed by setting G=W and
columns by setting G=1.
5
Architecture Details: Cascading pipeline with state machine-based program modelDataflow can change to fit desired algorithm
Facilitated by customizing FUs used and their
interactions via microcode.
State machines can be used to describe each FU
Allows easier FU organization, user specifies each
FU state machine and a final state machine
specifying delays for ensuring appropriate execution
order.
6
Architecture Details: SoC ArchitectureOverview of tape-out design implemented by
authors.
APE (Algebraic Processing Engine) refers to the
MaPU cores.
CSU is a DMA controller.
7
Results: Comparison with C66xAll comparisons shown here are in simulations
APE runs at 1GHz and C66x at 1.25 GHz
8
Result: Power Usage
9
Results: Power Usage
Figure 15 in the paper seems to be incorrect and a copy of Figure 14
10
Results: Comparison with other processors
Source: M. H. Ionica and D. Gregg, “The movidius myriad architecture’s potential for scientific computing,” Micro, IEEE, vol. 35, no. 1, pp. 6–14, 2015
11
Results: Microcode Statistics
12
ConclusionIntroduces a new architecture for fast and efficient matrix-related computations.
Defines a process for molding architecture to specific uses via defining state machines
in microcode pipeline.
Demonstrates an improvement in power efficiency over CPUs/GPUs.
Few points for comparison against competing architectures.
13
Discussion1. Does the amount of overhead (defining state machine) and compiler
optimizations still make it better than an ASIC?
2. Is this as generic an architecture as claimed?
3. Are simulation results as useful given a physical chip tape out is there?
14
Thank You
15