ishikawa laboratory university of tokyo stage-distributed time- division permutation
TRANSCRIPT
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
http://www.k2.t.u-tokyo.ac.jp/index-e.html
Stage-Distributed Time-Division Permutation Routing in a Multistage Optically Interconnected Fabric
Alvaro Cassinelli(1), Makoto Naruse(2), Alain Goulet(1), and Masatoshi Ishikawa(1)
(1) University of Tokyo, Dept. Information Physics and Computing, 7-3-1 Hongo Bunkyo-ku, Tokyo 113-0033, Japan.
(2) Communications Research Laboratory, 4-2-1 Nukui-kita, Koganei, Tokyo 184-8795, Japan.
Multistage optical hypercube
Processor arrays
XY
W
Z
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
PLAN of the presentation
II. Column-Control in Multistage Interconnection Networks (CCMINs)
III. Folded Optical Implementation of a transparent CCMIN
IV. Packet switching in a buffered CCMIN (“new”)
V. Conclusion and Further Research
I. Introduction: space-domain optical switching fabrics
VI. Some References
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
1) Processor-memory bottleneck in Supercomputers
2) Router bottleneck in Next Generation Optical Internet
I. Introduction: the problem on study
How to design an efficient optical switching fabric for addressing:
These problems have some similarities:
low latency required, synchronization, high bandwidth…
Traffic characteristics changes:
synchronous/asynchronous, regular/arbitrary request patterns, fixed/variable length of data bursts (granularity)
In fact, the above problems are case studies among a continuum of situations…
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
I. Introduction: optics inside routers
Scheme of a router
controller
inp
ut
inte
rface
ou
tpu
t in
terf
ace
switching
fabric
•interconnect router subsystems
• at the (unbuffered) switching fabric (OXC)
•at the interfaces and controller (“all-optical routing”)
Where optics?
This presentation concerns:
SPACE-DOMAIN OPTICAL SWITCHING FABRICS
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
II. Column-Control in Multistage Interconnection Networks
II.1 Multistage Interconnection Networks
II.2 Column-Control in MINs
II.3 Permutation Capacity of CCMIN
II.4 Unbuffered CCMIN for permutation routing
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
II.1 Multistage Interconnection Networks
• O(N2) complexity (using 2x2 switches)• Simultaneous switching noise• Central controller bottleneck • Poor modularity
• Wide-sense non-blocking • Low latency
“Basic” switching fabric:
Full-Crossbar (XC)
Circuit Switching: good for low-latency memory-processor communications.
Packet Switching: Maximum throughput of 63% without buffers (uniform traffic).
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
…It still has point-to-point full connectivity.
Alternative architecture:
Multistage Interconnection Network (MIN)
(and is “self-routing”)
• Internal blocking
• Large optical losses
• Large crosstalk
• Full point-to-point connectivity• O(N.log2N) complexity • Distributed routing possible• Fault tolerance possible (re-routing)• Easier repairing thanks to modularity
II.1 Multistage Interconnection Networks
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
• Column-control simplifies hardware and control
“stage-global switch”column-control lines…
Nice: CCMIN it is still capable of point-to-point connectivity
II.2 Column-Control in MINs
2-states “global” switches with long-range interconnectionssuited for optical implementation (free-space, guided-wave)
• Possible physical-merge of active switching and passive interconnection:
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
…if blocking was a problem for a MIN…
…things are much worse for the CCMIN
“global-stage” blocking
local-blocking
As a consequence of “global-stage” blocking, permutation capacity of the CCMIN is extremely reduced.
II.3 Permutation Capacity of CCMIN
However…
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
•Request serviced by circuit switching, (or by on-the-flight packet switching)
•Input requests are indep. Bernoulli trials (parameter )
• Uniform Traffic: equal probability of requesting any output port
Input request probability per unit time ()
Pro
bab
ilit
y o
f re
qu
est
accep
tan
ce
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CCMIN
Standard MIN
crossbartends to 63% when N, because HOL blocking.
both tend to 0 when N
CCMIN cannot be used to service arbitrary requests in a circuit-switched manner!
64x64 network
II.3 Permutation Capacity of CCMIN
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
II.4 Unbuffered CCMIN for permutation routing
4-D hypercube-connected multiprocessor…
Synchronous, weak-connected parallel computer
(processors use same permutation / time slot)
C2 C3 C4C11234
16
.
.
.
.
.
.
1
3
5
9
6
8
4
711
12
1516
1314
102
C 3
C2
C1
C4
Reduced permutation capacity may still be useful for synchronous “permutation routing” in parallel
computers(*)
(*) issue well studied in the past on “standard” blocking MINs
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
III. Folded Optical Implementation of a transparent CCMIN
III.1 Designing a CCMIN for circuit-switched permutation routing
III.2 “Folded” Optical Implementation
III.3 Experimental Demonstration
III.4 Possible applications
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
3 stage CC-”Baseline Network”
{c3, id} {c2, id} {c1, id}
• Number of permutations: 2n (n=3)
A multistage version of most parallel-computer direct-network topologies (hypercube, cube-connected-cycles, deBruijn, etc.) can be implemented as a CCMIN with properly designed inter-stage permutation modules.
III.1 Designing a CCMIN for circuit-switched permutation routing
• These are {c3, id}x{c2, id}x{c1, id}• These are just the required permutations to implement the (3D) hypercube!
c2
c 3
c1
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
III.2 “Folded” Optical Implementation
Multistage Interconnection Network architecture
Dense & Efficient 3D folded inter-stage optical interconnects
Optical Multistage Architecture Paradigm
(fixed interconnections)
+
shuffle
shuffleshuffle
plane implementation
•electronic
•planar lightwave circuit (PLC)
3D implementation
•free space
•guided-wave
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
• fixed, no broadcast: optical fiber ok.
• better efficiency (and just like free-space optics, no cross-talk in 3D).
• No space-invariance imposed.
• Precise and robust alignment possible.
• Theoretically more volume efficient than free-space counterpart.
• “hard” to build? not fundamentally difficult (can be automated, permutation decomposition possible)
• Alignment of output and input
• Power dissipation fundamental limit very far compared with electronics.
input output
Prototype Fiber module (fibers and holders)
“integrated” 2D folded perfect shuffle permutation module
Wave-guide arrays for fixed, point-to-point and space variant interconnections are an interesting alternative to free-space optics
Guide-wave (fiber-based) Modules vs. Free-Space
III.2 “Folded” Optical Implementation slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Prototype (non-integrated) 4x4 fiber module
Two holder prototypes: Zirconium, SiO2
Pitch: 250±5 mMultimode graded index fibers: NA=0,21
(core 50m, cladding 126m)Transmission loss: 3dB/km
Input (VCSEL
854±4nm)
Output (CCD)
(2)
input output
slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
III.2 Multiple-permutation module
Besides density, reduced crosstalk and optical efficiency, there is another nice feature of the guided-
wave approach to plane-to-plane optical interconnections…
Interleaving multiple permutations is possible
3D bi-permutation module built by stacking planar
lightwave circuits (for instance)
A small mechanical/optical
perturbation produces a drastic change of the
interconnection pattern
Multi-permutation modules as CCMIN’s “global-stage” switches
output
input
(*) and not only CCMIN’s stage permutations
(*)
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Cube Permutations for N=2n
Folded:
Cube Permutation ck :
ck
{bn, … bk+1, bk, bk-1, … b2, b1}
{bn, … bk+1, bk, bk-1, … b2,b1}
If k n/2, exchange only rows; If k>n/2, ck exchange only columns.
The modules are just the same, rotated.
c1 c2 c3 c4
Unfolded: (example with n=4)
slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
III.2 Experimental Demonstration
Unfolded hypercube and identity permutations
Row-Column Folded bi-permutation module
Prototype implementation of using optical fibers
1234
16
.
.
.
.
.
.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
columns
row
s
plane mapping (“folding”) *
(*) not unique!
250mm
dummy fi ber rows
switching pitch: 125mm
250mm
dummy fi ber rows
switching pitch: 125mm
I dC2 I dC2C2
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
“Spanned” 4D hypercube (use four bi-permutation modules)
III.2 Experimental Demonstration
c3
c4
c2
c1
c2
c1
c 3 c4
(processors interconnected trough a 2D optical “socket” –
or laying in a VLSI chip matrix)
four-dimensional hypercube-connected multiprocessor…
{c4, id}
{c2, id}{c1, id}
{c3, id}{c4, id}
{c2, id}{c3, id}
{c1, id}
…topology is mapped on a plane
slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Output (CCD
camera)
Input(VCSEL array)
{c2, id}
{c1, id}
Inter-module Coupling Efficiency: 1.7dB(no additional optics, matching oil or antireflection coating).
Alignment tolerance: 5 m (half peak power).
Commutation pitch: 125 m
Validation of simple cascaded architecture.
Exit first module
Input second module
III.2 Experimental Demonstration slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Id x Id C1 x C2Id x C2 C1 x Id
Selected permutatio
n product
Input
(VCSEL array)
Output
(CCD)
Visualization of 2D permutation switching using a pair of modules
III.2 Experimental Demonstration
C1 or IdC2 or Id
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Resonant frequency = 430 Hz (62.5m)
(can vibrate the module in both X and Y directions – in principle, permutation interleaving is possible in both directions)
III.2 Demonstration: electromechanical actuator
X-Y electro-magnetic actuated device
(Micro electro-mechanical actuators (MEMS) may also be an interesting alternative when switching latency in the millisecond range is tolerable)
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Time slot
timeInte
rcon
nect
1
Inte
rcon
nect
2
Inte
rcon
nect
3
Inte
rcon
nect
N
Resonant-frequency: round-robin permutation scheduling
III.2 Demonstration: electromechanical actuator slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
No electromagnetic actuation:
Electromagnetic actuation:
Input: slow row/column scan of VCSEL array
Fixed Identity permutation Identity & Cube2 permutations alternate at 860 Hz.
III.2 Demonstration: electromechanical actuator slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Actuator position
Photodetector signal
200ms
Input: 635nm laser modulated at 500MHz
Output: High speed photodetector
If 10Gb/s optical link, burst size is 2 Mbits per channel, (every millisecond). Average bandwidth of 2 Gb/s per channel
• Switching latency between interconnections ≈ 0,96 ms (*)
• Time Slot (3dB) ≈ 200ms
III.2 Demonstration: electromechanical actuator
(*) MEMS routers: ms range.
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Possible computing applications:
• The present system is not usable for typical memory-processor communications, which requires low latencies (< 100 ns), unless another switching hardware is used (Acousto-optic cells: s range / electro-optical material: ns range)
• If processing time is large (slow switching latency) and “burst” of data large, the electromechanical system may be used (FFT, large database retrieval, ?…)
Communication networks:
• burst switching at the WAN level (ms range reconfiguration times).
• scientific-dedicated, transparent networks with long holding times and high-bandwidth (TransLight, GLIF). MEMS switches are currently used (reconfiguration times in the range of a second is ok). An optical GSMIN may be used to regularly provide interconnection configurations.
• if switching time is reduced, it can be used to perform cyclic permutation scheduling in an virtual output queued (VOQ) switch, leading to 100% throughput (Standford “Tiny-Tera Switch”)
III.4 Possible applications of an optical CCMIN
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
time
Burst Interconnects
Computationone-stage(ex. 1 ms)
Burst interconnection within “short” time slot
(Ex. 10Gbps, 100nsec 1kbit)
Inte
rco
nn
ect
1
Inte
rco
nn
ect
2
Interconnection switching interval
(Ex. 1ms)=
…Slow switching may be okay
slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
IV. Packet switching in a buffered CCMIN
IV.1 Buffering in blocking networks
IV.2 FIFO Buffered CCMIN architecture
IV.3 Performance evaluation
IV.4 Delay-line “buffered” architecture
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
• Unbuffered networks (even wide-sense non-blocking) suffer from HOL blocking: buffering is unavoidable.
•Input queues, Output Queues and Virtual Output Queues and internal buffering has been explored in crossbars as well as in MINs;
• However, an advantage of buffered MINs over buffered crossbars is that the stage-distributed switching marries well with the distribution of buffering (thus avoiding large buffers)
Blocking is a serious drawback for circuit switching
…Less serious for packet switching
Buffering is a solution adopted in “usual” MINs…
IV.1 Buffering for packet switching
…how much a CCMIN is improved by buffering?
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
…
…
…
Total length of buffer
arbitration
Buffer 1
Buffer N
depth of analysis
Length of transferred packets/cycle
Switc
h G
S-E
n(k
)
…
…
…
…
…
Total length of buffer
arbitration
Buffer 1
Buffer N
depth of analysis
Length of transferred packets/cycle
Switc
h G
S-E
n(k
)
…
…
inp
ut
ou
tpu
t
inter-stage FIFO
buffers
Why this architecture may compare well with “standard” buffered MINs?
• For uniform traffic, at each stage half of the packets wait, and half pass: individual switch/buffer control is, presumably, not really required…
IV.2 FIFO Buffered CCMIN architecture
What’s more: • Arbitration for configuring the Global Switches may not be necessary at all !
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
00
1
2
3
4
56
crossbar
standard MINGlobal Switched MIN
1
2
3
45
0
• GSMIN performance evolve quicker with buffer size
• For buffer size = 5 packets, equivalent performances
• For buffer size = 3 packets, performances are better than Xbar
IV.3 Performance: global control vs. local control
Seven stage - 128x128 Input/Output fabrics(rem: inter-stage transfer with maximum speed-up equal to the size of the buffer)
Performance of Global Switched MIN compares very well with that of a standard MIN.
Input request probability per unit time ()
Pro
bab
ilit
y o
f p
acket
accep
tan
ce
Bu
ffer s
ize
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
This is very interesting, because it means that a Standard MIN can be operated “blindly” if traffic is uniform enough.
Interconnection scheduling bottleneck is eliminated (CLOS, etc.) by using a Time-Division Permutation Routing strategy.
IV.3 Performance: global control with blind alternate
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0
1
2
3
4
5
Input request probability per unit time ()
Pro
bab
ilit
y o
f p
acket
accep
tan
ce
crossbar
“fair” switching
“blind” alternate
Bu
ffer s
ize
“Blind” Switch alternation of a GSMIN
As expected “blind” alternation of switch states gives same performance than a “fair” switch-selection
(for uniform traffic)
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
inp
ut
ou
tpu
t
delay-line “buffer”
IV.4 Delay-line “buffered” architecture
What about just delaying packets?
Reliable optical memories are still too difficult to implement...
(since there are only two states per stage, only a single delay-line may give good performance)
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
inp
ut
ou
tpu
t
Switch
delay-line “buffer”… we didn’t study a “standard” MIN with delay-lines
slide not shown in main presentation
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0
1
2
3
4
5
delay-line
crossbar
Global
Switched MIN
Input request probability per unit time ()
Pro
bab
ilit
y o
f p
acket
accep
tan
ce
Bu
ffer s
ize
(we didn’t study a “standard” MIN with delay-lines)
Using a single selectable delay per channel and per stage, performance lies somewhere in between one and two-packet sized FIFO buffered architecture.
Blind alternation of global witch states is assumed
IV.4 Performance of a delay-line “buffered” architecture
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
V. Conclusion
V.1 Results
V.2 Further Research
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
V.1 Conclusion
Summarizing:
• Column-Control simplifies MIN hardware and control;
• Column-Controlled MIN can be efficiently implemented using dense plane-to-plane optical interconnections;
• Column-Control MIN may have enough permutation capacity for specific applications (highly parallel algorithms);
• Column-Controlled MIN can be used for packet switching if buffered, giving roughly the same performance than “standard” MINs;• Path-selection mechanism may be “blind” (i.e. round-robin, time-division permutation routing) without appreciable degradation of performance.
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
V.2 Further Research
• Other models of buffers: in particular, inter-stage virtual output queues (VOQ) may gives very good performance in CCMIN (because with a speed-up of only 2, each stage will have 100% throughput). Two parallel delay-line buffers ?
On transparent circuit switched CCMINs
On buffered packet switched CCMINs:
• An arbitrary permutation request may be serviced by multiplexing in time the available set of permutations. This needs input buffers and speed-up (i.e. short switching latency). This has been explored in standard MINs using 2x2 switches…
• Design of “active” modules, and multi-function modules (containing more than two permutations, but also other optical functions - e.g. optical delay lines)
• How heavily the the studied architectures rely on the URM assumption? Study more realistic traffic models / ways to balance the non-regular traffic.
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
stack of PLC layers coupled in the normal direction
cross state
by-pass state
cross
straight
input
switching region
input
switching region
• Simulation of a crossbar by speed-up (TDM connections for local area networks)
•Core of a permutation routing switches for inter-processor communications in a parallel computer
Reconfiguration time can be of the order of nanoseconds!
slide not shown in main
presentation
V.2 Fast switching permutation modules
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
Based on the observation that VOQ and speed-up, plus optimal permutation decomposition are the basic ingredients of the Birkhof-von Newmann Switch (plus load-balancing to simplify the decomposition => Tiny-Tera switch) with 100% throughput, it will be interesting to study then:
1) a “constrained” decomposition of a rate matrix onto the set of available CCMIN permutations
2) a multistage version of the BVN switch, where the permutation decomposition is done:
a) at each stage (using bi-permutation modules, this will probably lead to simple forced-alternate mode, and reduce the size of the VOQ, to only 2, which may be accommodated by simple delay-lines!),
b) every some stages, so that the available set of permutations will be very reduced, but still larger than 2. This may optimize the design of buffer functions (no need to put in all stages).
slide not shown in main
presentation
Thank you for your attention
V.2 …“advanced” further research
Ishikawa LaboratoryUNIVERSITY OF TOKYOhttp://www.k2.t.u-tokyo.ac.jp/
VI. Some References
Traffic models:
J. Cao et al., “Internet traffic tends toward Poisson and Independent as load Increases”, Nonlinear Estimation and Classification, eds. C. Holmes et al., Springer, NY, 2002.
thermo-optic matrix [Goh01]
round-robin (TDM). [Thompson91].
Crosstalk can be solved decomposing a permutation into semi-permutations, with an increase of the number of network stages [Qiao]
“Volume-consumption comparisons of free-space and guided-wave optical interconnections”, Y.Li and J. Popelek, p.1815-1825, Appl.Opt. Vol 39, n.11, april 2000.
Study of inter-stage VOQ in MINs:
Kolias, “Dual Banyan Switch”, [Kolias]
W.J. Dainty, “Virtual-Channel Flow Control“, IEEE Trans. Parallel and Distr. Systems, Vol. 3, No. 2, Mar. 1992, pp. 194-205. Dainy studies “DAMQ” (dynamically allocated multi-queue buffers), which looks quite similar to “hop-mode” buffers.
slide not shown in main
presentation