Slide 1
The Architecture of the M40: The Architecture of the M40: A Backbone IP RouterA Backbone IP Router
Pradeep SindhuPradeep Sindhu
March 11, 2004March 11, 2004
Slide 2
The M40: Juniper’s First The M40: Juniper’s First ProductProduct
Put the entire forwarding path in hardware for first Put the entire forwarding path in hardware for first timetime
Achieve line-rate performance for 8 2.5Gbps Achieve line-rate performance for 8 2.5Gbps interfacesinterfaces
Do it against overwhelming competitionDo it against overwhelming competition Do it with a limited budget and a small team (40 at Do it with a limited budget and a small team (40 at
FCS)FCS) Do it in two yearsDo it in two years
We didn’t understand the full complexity
at the beginning - it unfolded only as we went
along!
We succeeded only because the M40 team was
incredibly talented and driven
We didn’t understand the full complexity
at the beginning - it unfolded only as we went
along!
We succeeded only because the M40 team was
incredibly talented and driven
We took 2 years and 4 monthsWe took 2 years and 4 months
Copyright © 2000, Juniper Networks, Inc. Slide 2
Slide 3
So What Is a Backbone IP So What Is a Backbone IP Router Anyway?Router Anyway?
Certain Minimum QualificationsCertain Minimum Qualifications Capable of switching IP & MPLS datagrams: L3 Capable of switching IP & MPLS datagrams: L3
forwardingforwarding Symmetric any-port-to-any-port switching speedSymmetric any-port-to-any-port switching speed Delay-bandwidth buffering plus congestion controlDelay-bandwidth buffering plus congestion control Internet scale routing tablesInternet scale routing tables Internet scale IS-IS, OSPF, MPLS, BGP4Internet scale IS-IS, OSPF, MPLS, BGP4
Today’s BenchmarkToday’s Benchmark Line rate forwarding on all ports for 40-byte packetsLine rate forwarding on all ports for 40-byte packets Performance independent of loadPerformance independent of load Support of CoS queuing, shaping and policingSupport of CoS queuing, shaping and policing L2 and L3 VPN’sL2 and L3 VPN’s Traffic engineeringTraffic engineering Classification and filtering at line rateClassification and filtering at line rate
Slide 4
Why Are They Hard to Build?Why Are They Hard to Build? Bottom line: inherent complexityBottom line: inherent complexity
Scaling along multiple dimensionsScaling along multiple dimensions Bandwidth, packets per secondBandwidth, packets per second #interfaces, #channels, #routes, #neighbors, #policies, #filters#interfaces, #channels, #routes, #neighbors, #policies, #filters
Unpredictable, hostile environmentUnpredictable, hostile environment Need for reliable, seamless interoperabilityNeed for reliable, seamless interoperability System design and partitioning is non-intuitiveSystem design and partitioning is non-intuitive Deep technical expertise across multiple disciplinesDeep technical expertise across multiple disciplines
Software: routing protocols, embedded systems, Software: routing protocols, embedded systems, network managementnetwork management
Hardware: ASIC design, board design, high speed circuit designHardware: ASIC design, board design, high speed circuit design Mechanical: power, packaging, thermal, emissionsMechanical: power, packaging, thermal, emissions
Changing requirementsChanging requirements
Building routers requires a special viewpointBuilding routers requires a special viewpoint The network is the system, not the boxThe network is the system, not the box Routers uniquely integrate the network at scaleRouters uniquely integrate the network at scale
Slide 5
SpecificationsSpecifications HardwareHardware
20 Gbps line rate; 40Mpps; 400ms 20 Gbps line rate; 40Mpps; 400ms bufferbuffer
POS 8xOC-48, 32xOC-12, 128xOC-3POS 8xOC-48, 32xOC-12, 128xOC-3 ATM 32xOC-12, 128xOC-3ATM 32xOC-12, 128xOC-3 128xDS-3128xDS-3 32xGbE32xGbE 34” x 19” x 26”34” x 19” x 26”
SoftwareSoftware BGP4, OSPF, IS-IS, MPLS/RSVPBGP4, OSPF, IS-IS, MPLS/RSVP DVMRP, PIM SM & DMDVMRP, PIM SM & DM Control, Configuration & monitoringControl, Configuration & monitoring
Slide 6
Forwarding Engine ApproachForwarding Engine Approach
Design is based on highly integrated siliconDesign is based on highly integrated silicon Treat silicon as an empty canvasTreat silicon as an empty canvas Let technology set the limits, not preconceptionLet technology set the limits, not preconception Apply computer design experienceApply computer design experience Use high volume components where possibleUse high volume components where possible Entire forwarding path in hardware; no corner casesEntire forwarding path in hardware; no corner cases Partition design around clean, stable interfacesPartition design around clean, stable interfaces
Why?Why? Every major advance in systems in the last 30 years Every major advance in systems in the last 30 years
can be traced ultimately to silicon integrationcan be traced ultimately to silicon integration Companies that have bet against the compounding Companies that have bet against the compounding
power of integration have diedpower of integration have died History will repeat itself because exponentials are History will repeat itself because exponentials are
hard for people to understandhard for people to understand
Slide 7
Design PhilosophyDesign Philosophy
No compromise line-rate No compromise line-rate performance. PERIODperformance. PERIOD..
No assumptions needed aboutNo assumptions needed about traffic conditionstraffic conditions packet sizespacket sizes interface typesinterface types encapsulation typesencapsulation types etc.etc.
Hardware needs to handle worst-Hardware needs to handle worst-case conditionscase conditions
Slide 8
System Level PartitioningSystem Level Partitioning
Problem is broken into two roughly equally complex partsProblem is broken into two roughly equally complex partsthat interact infrequentlythat interact infrequently
Loading of one does not affect the other, eliminating a common failure modeLoading of one does not affect the other, eliminating a common failure modeof legacy routersof legacy routers
Facilitates independent hardware and software development and early Facilitates independent hardware and software development and early software testingsoftware testing
RE is standard off-the-shelf Intel platform, so it leverages industry advancesRE is standard off-the-shelf Intel platform, so it leverages industry advancesin computer designin computer design
RE can be leveraged across multiple generations of FE’s with no changeRE can be leveraged across multiple generations of FE’s with no change
Software structure can now be clean because software is not burdened with real-time Software structure can now be clean because software is not burdened with real-time considerationsconsiderations
ForwardingForwardingEngine (FE)Engine (FE)
RoutingRoutingEngine (RE)Engine (RE)
Why this partitioning is Why this partitioning is goodgood
Control Packets OnlyControl Packets OnlyAll PacketsAll Packets
Good architecture is the art of defining clean, stable interfaces;Good architecture is the art of defining clean, stable interfaces;it is the only way we know to build anything complexit is the only way we know to build anything complex
Fast EthernetFast Ethernet
General-purpose General-purpose computer (Pentium computer (Pentium
based)based)
Specialized HardwareSpecialized Hardware
Slide 9
Routing EngineRouting Engine
Standard 233 MHz Pentium PCStandard 233 MHz Pentium PC 256 MB memory256 MB memory Specialized BIOS for bootingSpecialized BIOS for booting LS-120LS-120 Flash memoryFlash memory Hard Disk for dumpsHard Disk for dumps 100BT link to Forwarding Engine100BT link to Forwarding Engine
Slide 10
Software Structure: JunOSSoftware Structure: JunOS
Built for scale using modern OS design principlesBuilt for scale using modern OS design principles Strong protectionStrong protection ModularityModularity Clean, stable interfacesClean, stable interfaces
Reliable, maintainable, serviceableReliable, maintainable, serviceable Average of three to four major releases per yearAverage of three to four major releases per year
RPD DCD MgD AppsChassisD
JUNOS KernelJUNOS Kernel
Routing Routing ProtocolsProtocols
Mgmt AppsMgmt Apps
Slide 11
Forwarding Engine Forwarding Engine ArchitectureArchitecture
M
C
603
A1 A2
BI 0
BI 1
BI 7
DI 0
DI 1
DI 7
BI 0
BI 1
BI 7
DI 0
DI 1
DI 7
SRAM
Slide 12
ChipsetChipset
A: implements A1 & A2 (1x)A: implements A1 & A2 (1x) B: implements BI, BO, and BM (8x)B: implements BI, BO, and BM (8x) C: implements route lookup (1x)C: implements route lookup (1x) D: implements SONET & POSD: implements SONET & POS
Slide 13
Physical StructurePhysical Structure
Active BackplaneActive Backplane contains A1 and A2 chipscontains A1 and A2 chips
Up to 8 FPC’sUp to 8 FPC’s each FPC has up to 4 PIC’seach FPC has up to 4 PIC’s each FPC has 603 control processoreach FPC has 603 control processor each PIC handles up to 622Mbps line each PIC handles up to 622Mbps line
raterate 1 SCB1 SCB
603 control processor, memory, 603 control processor, memory, EthernetEthernet
C chip and route lookup memoryC chip and route lookup memory
Slide 14
Card CageCard Cage
Activebackplane
FPCFPC
SCB
PIC
PIC
PIC
PIC
PIC
PIC
PIC
PIC
PIC
PIC
PIC
PIC
FPCSCB
airflow
Slide 15
TerminologyTerminology
StreamStream source of non-interleaved packetssource of non-interleaved packets
CellCell 64 byte datum64 byte datum
NotificationNotification 16 byte pointer to packet + control bits16 byte pointer to packet + control bits
BankBank unit of main memory on one FPCunit of main memory on one FPC
KeyKey variable length qty used to do route variable length qty used to do route
lookuplookup
Slide 16
Memory OrganizationMemory Organization
Divided into 64 byte cellsDivided into 64 byte cells Logically One giant bufferLogically One giant buffer Physically distributed among line-Physically distributed among line-
cardscards Two 72 bit wide DIMMSTwo 72 bit wide DIMMS 125MHz clock125MHz clock
Packets read and written as cellsPackets read and written as cells Cells written as they arriveCells written as they arrive No garbage collectionNo garbage collection Cells chained together via offsetsCells chained together via offsets
Slide 17
Packet Flow: inputPacket Flow: input
DI BI A1
C
SONET decapsulationPOS/HDLC
Layers 2 and 3CellificationWrite cells to memoryIIF determinationinput accounting
Switch cells to memoryBuild ICellsForward Key to C
Cells tomemory
Key+info
BD interface
Line
Slide 18
Packet Flow: Route Packet Flow: Route LookupLookup
A2
CKey + Info
A1
Result + Info
Key = variable # bits(up to 31 bytes)
Result = nexthop_id + destMask
SRAM
Slide 19
Packet Flow: OutputPacket Flow: Output
A2 BO DO
C
POS/HDLCSONET encapsulation
Output queueingRead cells from memoryPacketizationNexthop lookupsLayers 2 and 3Output accounting
Switch cells from memoryForward notification to BO
Cells frommemory
Result+infoBD interface
Line
Slide 20
Output QueuingOutput Queuing
Arriving notifications queued by BoArriving notifications queued by Bo 4 queues per stream4 queues per stream weighted round robin serviceweighted round robin service random-early droprandom-early drop
Each notification is 16 BytesEach notification is 16 Bytes pointer to start of packetpointer to start of packet first few offsetsfirst few offsets next-hop idnext-hop id control bitscontrol bits
Slide 21
Route LookupRoute Lookup
Generic Problem:Find best (longest) match in table
Our Solution: JTree
Key
0 1
01
Key
ResultPattern + Mask
Slide 22
Input Switch Input Switch OrganizationOrganization
Input switch connects BI’s to Input switch connects BI’s to memorymemory
Memory implemented by multiple Memory implemented by multiple banksbanks
Cells of each stream are written to Cells of each stream are written to increasing bank numberincreasing bank number
Perfect pattern guarantees Perfect pattern guarantees freedom from bank conflictsfreedom from bank conflicts
Simple TDM discipline suffices!Simple TDM discipline suffices!
Slide 23
Output SwitchOutput Switch
Output switch connects BO’s to Output switch connects BO’s to memorymemory
Same multiple bank memorySame multiple bank memory Reads can be a lot more chaoticReads can be a lot more chaotic
deterministic within a packetdeterministic within a packet but not across packetsbut not across packets
Only probabilistically conflict freeOnly probabilistically conflict free Reservation table handles conflict Reservation table handles conflict
casescases
Slide 24
Control ChannelControl Channel
Provides PIO channel to chipsProvides PIO channel to chips Used for booting & configurationUsed for booting & configuration
A1
C
A2
B
Slide 25
High Speed Links and High Speed Links and ClockingClocking
Single synchronous domainSingle synchronous domain Single ended, low voltage (GTL)Single ended, low voltage (GTL) Clock sent with dataClock sent with data 250Mbits/sec per wire250Mbits/sec per wire 16 bit wide data path => 4Gbps16 bit wide data path => 4Gbps
Slide 26
RetrospectiveRetrospective 2.3 years from start to product launch2.3 years from start to product launch Small team: 8 ramping to 40Small team: 8 ramping to 40 Combined a lot of different areas of Combined a lot of different areas of
expertiseexpertise No major mistakes, just a lot of little onesNo major mistakes, just a lot of little ones
System building experience was keySystem building experience was key Average experience was ~10 yearsAverage experience was ~10 years Average person had delivered 2 systemsAverage person had delivered 2 systems
Implementation was incredibly optimizedImplementation was incredibly optimized Great one-time leverage of knowledge from Great one-time leverage of knowledge from
computer industry to networking industrycomputer industry to networking industry This product launched the companyThis product launched the company This product changed the way routers are This product changed the way routers are
builtbuilt