Upload
openidaepklajc
View
214
Download
0
Embed Size (px)
Citation preview
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
1/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1
David Tsiang, Cedrik Begin, Guglielmo Morandin
4/22/13
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
2/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 2
Goals and requirements of switch fabrics Buffering strategies (input, output, CIOQ) Transport (packet vs cell) Topologies (single stage, multi-stage) Congestion management (proactive, reactive) Multicast Service provider examples Enterprise and Datacenter examples
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
3/85
Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. 3
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
4/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
ScaleBandwidth per fabric interface
Number of fabric interfaces
FairnessUsually want non-blocking and fair (sometimes weighted fairness)
Non-blocking no cross-flow interference between src-dest flows
e.g. a congested flow doesnt unduly interfere with a non-congested flow
LatencyService provider 100 us (WAN distances dominate, jitter)
Enterprise 10s of us (Campus distances)
Datacenter 1 us (Datacenter distances compute perf. latency sensitive)
CostSP vs Datacenter vs Enterprise
Redundancy1:1, 1+1, 1:N
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
5/85
Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. 5
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
6/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6
CentralShared memory
InputDeep buffers only on input
OutputDeep buffers only on output
Combined input/outputDeep buffers on input and output
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
7/85 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7
Usually associated with central memory switch fabric designs Bandwidth scale limited by memory bandwidth (can be distributed over
several parallel memory slices to improve)
Limited queue scale (not practical for multi-chassis) Similar performance characteristics to output buffered switch without
need for a large speedup
Examples: Early cisco routers (AGS+, 7000, 7500), smaller routers (ISR,ASR1K, Procket, early Juniper routers M40, M160)
FIA
FIA
Centralmemory
Write
Read
FIA: FabricInterface
Adaptor
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
8/85 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8
Buffers on input Requires Virtual Output Queues to be non-blocking Most common type of buffering (GSR, N7K, ASR9K, Panini)
FIA Switch Fabric
Send
Receive
VOQ
VOQ
VOQ
MemMem
Mem
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
9/85 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9
Only works if there is no congestion within the switch fabric! Can be achieved if speedup is high (path from SendRCV is >>
FIA input BW).
Pure output buffered switch is not practical for large systems(need speed of N)
FIA Switch Fabric
Send
Receive
OQ
OQ
OQ
MemMem
Mem
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
10/85 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10
High speed up (e.g. 2-3X FIA BW) enough most of the time Input Queues for cases where speedup is insufficient - blocking CRS uses this (VOQ scale impractical for input only approach)
FIA Switch Fabric
Send
Receive
OQ
OQOQ
MemMem
Mem
MemMem
Mem
IQ
IQ
IQ
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
11/85Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. 11
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
12/85 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12
Two main methods of transporting data across a switch fabric Packet send whole packets (or even multiple pkts as a frame)
Advantages:
Simpler no reassembly of cells (but may have to reorder packets)
Higher Efficiency per packet overhead vs per cell overhead
Lower average latency (can do cut through on egress)
DisadvantagesSlightly higher complexity for buffered switch chips
Higher WC latency (small packets must wait behind larger packets)
Not as scalable (large scale switches require distribution which requires cells to beefficient packet requires bundling of links to achieve low latency for large packets
which does not allow for large scale distribution)
Cell segment packets into smaller sized cellsAdvantages
Lower WC latency (important for TDM types of traffic)
Scalable (easy to evenly distribute cells)
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
13/85 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13
(Cell continued):
Disadvantages
Higher complexity requires segmentation and reassembly of cells
Higher overhead per cell overhead, packet packing efficiency
Worse average latency reassembly and reordering cell buffer adds latency, cant doegress cut-through
Generally:Packet transport pure packet fabrics, single chassis scale fabrics
Cell transport hybrid packet/TDM switches, most large scale multi-chassisfabrics.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
14/85
Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. 14
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
15/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15
Mesh (pt-pt, bus) Scale limited to bus bandwidth or FIA bandwidth Used in smaller and/or older systems
FIA FIA
Bus
FIA FIA
FIA FIA
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
16/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16
Single stage (crossbar, central memory) Scale limited to number of serdes I/O on a single chip
e.g.224x224 (SM15) of serdes on a chip limits a system to 224 FIAs Paninihas 768 FIAs
Use parallel crossbars for bandwidth scale and redundancy
FIA FIA
CrossbarCrossbar
Crossbar
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
17/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 17
3-stage symmetric CLOS#S1 = #S2 = #S3, Traffic takes two hops always (S1->S2, S2->S3)
For a NxN xbar chip can connect up to N^2 FIAsScale is usually less due to common use of combining S1 and S3 (folded:N^2/2) and number of S1s and FIAs achievable in a LC chassis.
Provably non-blocking via rearrangement (connection oriented) orload balancing (requires some speedup to overcome imperfectrandomized load balancing)
S1
FIA
FIA
FIA
FIA
FIA
FIA
FIA
FIA
S1 S3
S3
S2
S2
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
18/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 18
n-stage topologies (Hyper-cube, Torus, Hyper-torus) Flows can take a variable number of hops from src-destination Lower cost
Typically FIAs interconnect directly less components, no fabric chassis needed
Less interconnection cost but interconnection can be complex for:
less than a fully populated system
system with varying speed nodes
Requires complex scheduling to be non-blockingTypically flow based path selection
Must be able to dynamically change path if flow bandwidths change(reordering?)
Slow to recover from failures (massive path recomputations)
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
19/85
Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. 19
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
20/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 20
Centralized timeslot schedulingAllows bufferless crossbars with no data loss (no collisions)
Difficult to achieve maximum match (O(n^2.5), O(n^3) complexity)Approximate maximum match instead (e.g. ISLIP, PIM, WFA) needs speedup toovercome imperfect match.
Not that scalable (O(n) complexity, but n can be large. Sched. Speed can be an issue as well).
Distributed timeslot schedulingScheduling done by each destination independently
Imprecise sources may receive multiple grants but can only act on one
Results in loss of bandwidth can be overcome with speedup
Scalable since its distributed but somewhat inefficient
Distributed bandwidth schedulingDistribute bandwidth (credits or MTU pkts) on request
src sends when ready
Collisions can occur need buffering in the xbar and some speedup
Scalable since its distributed and efficient
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
21/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 21
i.e. no scheduling just sendSwitch fabric either:
buffers and asserts flow control if buffers get too full
Or just drops if buffers get too full (may require ack + retransmission)
Requires a large speedup to get good performanceoversubscribed scenarios
Is blocking for congestion > speedup because flow control withinthe switch fabric is usually coarse (not to VOQ level)
Can reduce blocking by adding secondary flow control from destination back tosource can be at a VOQ level
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
22/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 22
Hybrid schemesCan combine proactive and reactive schemes
e.g. send speculatively and if congested (reactive) request to re-send (proactive)
Better latency if non-congested
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
23/85
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
24/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 24
Typically very challengingScheduling per MC group not practicalCombinatoric number of groups 2^n-1 where n is the number of FIAs
Usually drop on congestion or reactive flow controlAlternative turn multicast into unicast can now isolate
congestionBut can be blocking of unicast (ingress replication) or expensive (serverreplication).
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
25/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 25
LCSwitchFabric
1
LC
Switch
Fabric
1 2 3 4 Ingress ReplicationCan block ingress if not enough
speedup to overcome replicationdilation
Staggered delivery
12
3
4
1
2
4
3
Fabric multicast
No impact to linecards, scalable to100% multicastDrop on congestion
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
26/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 26
LC
SwitchFabric
1 2
LC
LC
Switch
Fabric
1
MCserver
1 2 3 4
Binary Tree ReplicationCan block ingress if not enough
speedup to overcome replicationdilation (but less chance of this
due to distribution of replication).Very staggered delivery
MC Server ReplicationNo ingress blocking
Additional expense of MC servercards
Staggered delivery
1
2
4
3
1
2
1
3 4
3
4
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
27/85
Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. 27
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
28/85
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
29/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 29
Fabric works in cellperiods of 128ns. Cellclock is distributed to LCs
and switch cards.
Packets are segmentedinto cells in the ingresspath
For each cell a request issent to the SCA(Scheduler Control ASIC)
SCA determines whichinput -> outputconnections to make.Sends grants to the IFIAand controls the XBARs
Cells are sent across theserial links and XBARs tothe EFIA
Packets are reassembledin the egress path
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
30/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 30
Ingress ToFab Queues: Per destination slot HPQ + LPQs, Multicast HPQ + LPQs. (MDRR) FIA (toFab): H/L Unicast queues per destination LC + H/L Multicast Q SCA algorithm used to insure fairness + maximize throughput over the fabric
Schedules between UC/MC requests (alternates priority between UC/MC).
Within a priority, Input LCs get their fair share of traffic towards output linecards
FIA (FrFab) has: Per source UC/MC reassembly queues that can flow control the SCA if nearing full.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
31/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 31
Multicast to different LCs is performed by the crossbar.A given multicast cell is transmitted to N destinations across the crossbar.
Partial grants supported.If a cell wants to go to destinations 1,2 and 3. The fabric may first grant 1,3 andthen grant 2 in a subsequent cell time.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
32/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 32
Switch cards:Redundant switch card which allows to correct errors in a single serial link
stream.Redundant stream carries XOR of 4 other streams
Provides 4+1 redundancy.
CSC (Clock and Scheduler) cards:
2 of these in the system, one is operational and the other standby.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
33/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 33
Generation FabricConnection
FIA SCH XBAR
622M 5x1.25G FIA SCA XBAR
2.5G FIA-48Fusilli
10G 20x1.25G TFIA,FFIASuperfish
SCA192 XBAR192
20G 20x2.5G EROS HADHecate
(priority)
IRIS
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
34/85
Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. 34
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
35/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 35
Cell Based (Fixed 136B cells w. cell packing)
Unscheduled Single stage / 3 stage fabric Single chassis / Multi-chassis capable Architecture scales up to 1536 EFIAs.
2/4 EFIAs per LC. VOQ not feasible (system has 1M+ Output queues) solved with fabric speedup with flow control
3 generations: 40G -> 120G -> 400G per slot.New generations required to support previous generations
Input Buffered. (fabric congestion) Output Buffered. (fabric speedup) Multicast replication in fabric. 2 Priorities per UC/MC in fabric.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
36/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 36
1 of 8
2 of 8
8 of 8
1
2
8
1
2
16
40/120/400 Gbps
Line Card Line Card
136 Bytes cells FabricChassis 2.5X Speedup
Buffered Non-blocking SwitchMulti-stage Interconnect3 Stage Clos Topology
S1 S2 S3
S1 S2 S3
S1 S2 S3
2 LEVELS OF PRIORITY MULTICAST SUPPORT
1M multicast groups
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
37/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37
IFIA:Segments packets into fixed size cells
Distributes cells evenly to planes.
S1 Stage: Distributes all cells evenly to all S2s. S2 Stage:
UC: Directs cell to S3 stage based on Destination address.
MC: Replicates cell to S3 stages based on FGID
S3 Stage:UC: Directs cell to EFIA based on destination address.
MC: Replicates cell to EFIAs based on FGID
EFIA:Receives cells from all planes and reassembles packet per source/cast.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
38/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 38
Single chassis can have: 3-stage topology if switching element does not have enough links to do single stage. Single stage topology:
Full mesh between IFIAs and EFIAs and Fabric chips. Fabric chips in S123 mode whereby incoming cells are routed directly to the EFIAs.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
39/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 39
Linecard Chassis: Fabric Cards contain S1 and S3 stages (these maybe combined into ASICs doing both stages). Fabric Chassis:
Fabric cards contain S2 Stages. 1 or more fabric cards may implement S2 stage of plane.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
40/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 40
Full mesh between IFIAs,EFIAs and Fabric chips. Fabric chips in S13 mode:
Traffic local to a chassis does not go over optical links. Traffic destined to other chassis goes over optical links.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
41/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 41
Input buffering: Per destination (EFIA) H/L queue. System scale (~1M OQ) deemed too large for VOQ.
Output buffering: Per faceplate port configurable number of queues.
S3
Reseq&Reassembly
EFIA
DiscardFilter
8k shaped
Queues
..
IFIA
Fabr
icDestinationBP
S1 S2
Packetsfrom NPU
3072 High priorityfabric Destination(EFIA) queues
S2 Queues perpriority per S3group
S3 Queues perpriority per fabricdestination (EFIA)
4K Raw queuesin EFIA
Packetsto NPU
3072 Low priorityfabric destination(EFIA) queues
S1 has a singledata queue
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
42/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 42
S3
Reseq&Re
assembly
EFIA
DiscardFilter
8k shapedQueues
..
IFIA
FabricDestinationBP
S1 S2
Packets
fromNPU
1 High priority
Multicast queue
S2 Queues per S3group per priority andcast. (i.e. separatequeues for MC)
Some number ofMC Raw queues
in EFIA
Packetsto NPU
1 Low priorityMulticast queue
S1 has a single
data queue forboth UC and MCdata cells
S3 Queues perdestination per priorityand cast. (i.e. separatequeues for MC)
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
43/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 43
IFIAs send traffic into the fabric. 2 Main flow controls to regulate.
Handles cases where fabric speedup is insufficient Destination Backpressure:
Used to minimize buffer occupancy in the fabric for short term congestion. Operates at per destination EFIA granularity. S3 Queue congestion + S2 Feed forward counts contribute. Ingress FIAs implement a slow start algorithm to minimize overshoot.
Discard: Operates at a per faceplate port granularity. Used to alleviate potential fabric congestion by reducing the amount ofcongested traffic from entering the fabric.
That is we do not want to send packets across the fabric which are going to bediscarded at EFIA anyway.
To minimize the amount of queuing delay at IFIA due to congestion at thedestination.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
44/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 44
Other flow controls in the fabric as well:Stages may backpressure upstream stages if they are out of resources.
Not expected.
S3
Reseq&Reassembly
EFIA
DiscardFilter
8k shapedQueues
..
IFIA
FabricDestinationBP
S1 S2
Discard
DestBP
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
45/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 45
Multicast is performed in the fabric at the S2 and S3 stages. IFIAs have 2 queues (H/L) Cell Header contains FGID field which is used by S2 and S3
stages as an index to replication table.
1M fabric groups.
S2 and S3 replicate cells
No flow control mechanisms. Separate queues for H/L multicast. If there is congestion multicast cells are dropped.
Scheduling between unicast and multicast cells is WRR at S2 andS3 stages.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
46/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 46
4/8 planes in the system.FQ: 8 planes: 1 plane per fabric card
HQ: 8 planes: 2 planes per fabric card
QQ: 4 planes: 1 plane per fabric card
Enough speedup in fabric to handle one plane down and notadversely affect performance.
Further planes down result in reduced fabric performance Fabric can operate with a minimum of 2 planes.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
47/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 47
Generation Fabric links FIA XBAR40G 2.5G
(8b10b)SprayerSponge
SEA(36x72)
120G 5G(scrambler +
8b10b)
SealCrab
Superstar(128x144)
400G 8.625G(scrambler)
Inbar Sapir(128x128)
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
48/85
Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. 48
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
49/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 49
Cell Based (Variable Cell size) Distributed scheduling Single stage / 3 stage fabric topologies Single chassis / Multi-chassis capableArchitecture scales up to 768 FIAs
Up to 6 FIAs per LC. Input Buffered: 64K VOQs (4 COS per 10GE) Multicast replication in fabric. (512K groups) 2 independent pipes in fabric.
OTN, Data UC, MC
Panini Multi chassis Architecture
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
50/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 50
S1
S1
S1
S1
S3
S3
S3
S3
S2
S2
S2
6 of 6
S1
S1
S1
S1
S3
S3
S3
S3
S2
S2
S2
2 of 6
Panini Multi-chassis Architecture- Multi-Chassis Fabric Architecture
S1
S1
S1
S1
S3
S3
S3
S3
S2
S2
S2
1 of 6
nx200G
64~256B Cells nx200G(1x Speedup)
FabricChassis
3 Stage CLOS Topology
Single Priority2 Paths in fabric
Multicast Support512K Multicast Groups
Replication at S2 and S3
240Gbps
240Gbps240Gbps
240Gbps
CXPCXP
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
51/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 51
IFIA:Segments packets into variable sized cells
Distributes cells evenly to planes.
S1 Stage: Distributes all cells evenly to all S2s. S2 Stage:
UC: Directs cell to S3 stage based on Destination address.
MC: Replicates cell to S3 stages based on FGID
S3 Stage:UC: Directs cell to EFIA based on destination address.
MC: Replicates cell to EFIAs based on FGID
EFIA:Receives cells from all planes and reassembles packet per source/cast.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
52/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 52
Single stage topology Full mesh between FIAs and Fabric chips. FIAs spray cells to Fabric chips which will send them to correct
FIA.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
53/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 53
3 stage topology. In each Linecard Chassis of the FIA are connected to each S1S3. Full mesh between S1S3s and S2s on a plane. Fabric Chassis:
Fabric cards contain S2 Stages. 1 or multiple Fabric cards may implement S2 stage of plane.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
54/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 54
3 stage topology In each Linecard Chassis of the FIAs are connected to each S1S3. Full mesh between S1S3 and S2 stages S2 stage shared between 2 chassis fabric cards.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
55/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 55
Unicast: Distributed scheduling VOQs in IFIA indicate occupancy state to EFIAs they are associated with.
EFIAs will issue credits fairly (WFQ) to IFIAs based on: VOQ state from IFIAs Number of links up between S3 and itself Congestion indication from fabric
IFIAs will send cells of packets into fabric for VOQs with credit. Multicast: Unscheduled
Packets sent into the fabric Congestion may result in drops or global flow control depending on priority.
Congestion in fabric between UC/MC: When only UC traffic, should be no sustained congestion as the EFIAs
control the traffic towards it.
When MC is introduced: congestion may occur and fabric will indicate to EFIA EFIA will adapt UC credits down to a configured value to make room for MC traffic
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
56/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 56
Ingress Unicast: 64K VOQs:Enough for 4 COS Queues per 10GE
Ingress Multicast:4 class queues towards the fabric
Fabric: Queues per pipe per destination. S2: queue per S3 destination. S3: queue per EFIA.
Egress: Per cast/class queues towards the egress NPU
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
57/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 57
Multicast is performed in the fabric at the S2 and S3 stages. Cell Header contains FGID field which is used by S2 and S3
stages as an index to replication table.
512K Fabric groups.
S2 and S3 replicate cells to wanted destinations
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
58/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 58
6 planes in the system. Enough speedup in fabric to handle one plane down and not
adversely affect performance.
Further planes down result in reduced fabric performance Fabric can operate with a minimum of 1 plane.
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
59/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 59
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
60/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 60
Packets are segmentedinto 64B cells
48B of payload 8B of cell header 8B of CRC
No cell packing: a givencell may only have data forone packet.
Cell is split (in 16-bitchunks) over 4 serial links.One to each XBAR
A fifth redundant serial linkcontains information forerror correction
Links are 8b10b encoded
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
61/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 61
Packets are segmented into136B cells:
12B of header 120B of payload 4B of RS code (for error
correction)
Unicast cells can be packedsuch that a cell can contain
data from 2 packets
Multicast cells are not packed Control Cells
Idle,Discard,SRCC Cells are:
8b10b encoded for 2.5G links Scrambled + 8b10b encoded for
5G links
Scrambled for 8.625G links
Packet 1 Packet 2Two Packet Payloads
Packet 1 Packet 2
Packet 1 Packet 2
30 bytes 30 bytes 30 bytes 30 bytes
Cell Payload (120B)
Packet 1 (120 bytes)
Single Packet Payload
(4 bytes)
Header (12 bytes)
RS
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
62/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 62
13 byte header | 64-256 bytes payload | 1 byte CRC Idle cells sent if no data to send 11.5G Serdes 64/66 encoding FEC covers a group of cells (for optical links) Retransmit used for electrical link error correction CRC-32 covers the packet
Data payload 64-255 bytescrc8Fabric hdr
13 bytes
Pkt payload up to 9.6k bytescrc32Fabric
hdr
4 bytes
Cell
Packet
Pkt hdr
14 bytes
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
63/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 63
S3
Reseq&Reassembly
EFIA
Disca
rdFilter
8k shapedQueues
..
IFIA
FabricDestinationBP
S1 S2
PacketsfromNPU
3072 High priorityfabric destinationqueues
S2 Queues perpriority per fabricgroup
S3 Queues perpriority per fabricdestination
4K raw queuesin EFIA
EFIA raw queuestate controls thediscard filter
Packetsto NPU
S2 OOR stateused to controlscheduling in S1
S1 Hiccups controlper plane schedulingat Sprayer
3072 Low priorityfabric destinationqueues
S3 OOR state usedto control schedulingin S2
S2 queue state andfeed forwardincorporated intodestination BP
S3 queue stategeneratesdestination BP
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
64/85
Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. 64
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
65/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 65
First high performance enterprise switchFCS in 1998
First implementation was shared bus, evolved to single stagefabric
Large set of features, supported also by special service cards Wire rate at minimum packet size
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
66/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 66
Port asics on linecards, decision engine and Dbus arbiter on supervisor card 16 Gb/s total system bandwidth Input and output buffers No backpressure from output
Bus arbiter prevented multiple port asics to write on the bus simultaneously
Two bus priorities to support VoIP
More queuing classes on port asics
DBUS
RBUS
DECISION
ENGINE
PORT ASIC PORT ASIC PORT ASIC..
ARB
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
67/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 67
Ioslice
.
.
.
.
.
.
Input Queue (per priority)
Decision
Engine
Ioslice
.
.
.
.
.
.
Output Queues
CrossbarFabric
interface
Fabric
interface
High speed serial links
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
68/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 68
.
.
.
.
Fabric
interfaceCrossbar
.
.
.
.
Fabric
interface..
DecisionEngine
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
69/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 69
Input queuesNo VOQ, only per-priority input queues
No congestion feedback from egress ports to ingress
Blocking possible
CrossbarCan drop packets when congested
Initially centralized decision engine, later distributed on each linecardTo support line rate at min packet size as newer generations of linecards got faster
.
.
.
.
Fabric
interfaceCrossbar
.
.
.
.
Fabric
interface..
DecisionEngine
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
70/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 70
Packet basedArbiters (conceptually one per outputbuffer) independently decide whichinput is writing to which output
Each crossbar link is a bundle of8 serdes
Lower port count required
Input and output queuesInput queues cause blocking
3x internal overspeed to compensate
Requires store and forwardInput queues can drop
Two priorities supported as twoseparate datapaths and queues
Prio1
.
.
Prio2
.
.
.
.
. .
. .
. .
Egresswrr
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
71/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 71
Crossbar does replication according to FabricPort Of Exit mask
FPOE set by ingress port asic
Done by writing to multiple output queues simultaneously
Multiple retries possible to satisfy replication mask
3x internal overspeed helps to maintain rate
Egress fabric interface and egress port asic useDestination Index in packet header to accesslookup table and perform further replications
.
.
. .
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
72/85
Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. 72
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
73/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 73
Support for no-drop protocols (Fibrechannel over Ethernet) Bandwidth optimized More than 16 slots per chassis High density, including oversubscribed linecards
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
74/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 74
Centrally scheduled Packet based 3 stage fabric Buffered Crossbar Input Buffered Single chassis topology, up to 16 linecards slots + 2 supervisor
slots
Multicast replication in fabric One priority per cast in fabric 8 priorities per cast in Ioslices
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
75/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 75
. .
.
.
.
.
.
.
schedulercredit returnrequest
grant
crossbar with inputand output buffer
(one of three stages shown)
VOQper port,classoutput queues
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
76/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 76
Packets are queued into VOQsVOQ system shared among ports on an ingress ioslice
Multiple small packets are accumulated into a superframeUp to a max size of about 3000 bytes
Requests are made to central schedulerNo size information, MTU assumed
Destination port and priority
Grants are generated according to egress buffer availabilityCentral scheduler keeps track of buffer availability on every egress queue in thesystem
Superframes are sent to fabric upon grant receptionSmaller than max size if grants arrives quickly, when little or no congestion present
Split into fragments if packet bigger than max sf size
No drops in crossbars and outputsOptionally no drops in VOQs by issuing pause frames
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
77/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 77
Up to 8 VOQs per destination port at ingress iosliceShared across ingress portsIngress drops according to tail drop or some form of AQM (WRED, AFD)
Ingress Ioslice determines load balancing over fabric planesRound robin or pseudo-random
One egress queue for each egress port, priorityNo drops at egress
One credit loop per egress port, priority
Buffer hard-partitioned
Credits are returned to central scheduler whenpacket leaves system, creating available buffer
Egress scheduler controls egress queue
drain rate, so it controls credit return rate
Central scheduler distributes aggregategrant rate across requesting voqs
voq42,1
voq42,1
voq42,1
eg q 42,1
central
scheduler
eg
scheduler
eg q 42,2
voq42,2
voq42,2
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
78/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 78
Three stage folded CLOSnetwork
Links are bundles of 8 serdes Aggregator asic in linecard to
reduce number of arbitrationlinks to central schedulers(active/standby)
Depending on ioslicebandwidth, multiple links
between ioslices andcrossbar(s) on linecard Linecard
Spine card
Spine card
xbar
.
.
SupSup
central scheduler
.
.
.
.
xbar ..
credit
aggregator
central scheduler
FIA
FIA
.
.
.
.
xbar
Stage 1 and 3
Stage 2
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
79/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 79
Not scheduledNo superframing possible
no voqs, only input queues, independent of group membership
Replication performed at S2 and S3 stagesSeparate internal datapath in xbar for multicast
Packet header contains index into replication table
Multiple retries to satisfy all required destination crossbar ports
Limited by timer. On timeout, drop
Egress ioslices do further replication to individual ports Load balanced to fabric planes using flow hash
No reordering required
Max rate of single flow limited by link bundle capacity
Fabric can also be programmed to flow control multicastMore blocking, but preferable for financial applications where MC is low averagebandwidth but very bursty
Scheduling between unicast and multicast is DWRR
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
80/85
Cisco Confidential 2010 Cisco and/or its affiliates. All rights reserved. 80
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
81/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 81
Low latencySub microsecond
Cut through operation
Single chassis, multiple chips Support for no-drop classes (FCoE)
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
82/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 82
Packet-based, with full cut-through Speculative transmission
Similar to shared Ethernet: collision detection and retransmission
Single crossbar stage with large overspeed Unbuffered Crossbar Input and output Buffered
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
83/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 83
Single stage crossbars Fabric latency-optimized for 10 Gbps or 40 Gbps
Single links or bundles of 4
Fabric speedup 3.6 Out of band Ack/nack from crossbar for each transmission Out of band Xon/Xoff broadcast from each unicast output queue to all ingress
ack/nackXon/Xoff
48 48
4
12 x 10G
or 3 x 40G
12 x 10G
or 3 x 40G
576 x 10Gor
144 x 40G
IFIA XBAR EFIA
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
84/85
2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 84
Packets are queued into VOQsSeparate VOQ system for each input port on an ingress ioslice to support ingress cut-through
8 classes of service across entire switch
Superframing when congested
If destination not congested, unicast packet is sent immediatelyRandom path selection
Crossbar Nacks packet if no downlink available, ingress stops and retries on different path
Only one packet in flight per voq.
No need to reorder packets at egress
Large speedup in egress buffer and egress downlinks to reduce collisionprobability
Egress queue per (priority,port)Broadcasts Xoff before it gets too full, to avoid egress drops
After getting Xon ingress waits random time before attempting send
7/30/2019 A-975f5095-4d33-4907-8af5-ed1e7e50b5a0-8cfe9203-f3d3-4a98-823b-50b02ebc2d97_130423_35702_24
85/85
Sent immediately by ingresssubject only to uplink availability, no egress xon/xoff
Replication performed in crossbarOne copy to each ioslice participating in the mc group
Crossbar nacks with success maskIf some destinations did not get a copy, ingress retries
No memory in crossbar
Egress chip has shared memoryOne memory write, multiple reads according to group membershipSome destinations may be pruned based on queue lengths