The Trees of Networking

Submitted by rayc on Mon, 10/25/2021 - 09:12

Spanning Tree is a mechanism for Layer 2 switches to prevent switching loops over redundant switch links. Switches learn about other switches in the network and the ports they are connected to by sending out Bridge Protocol Data Units (BPDU), advertising STP information. These BPDUs are used to determine which switch ports should forward traffic and which switch ports should block traffic. In a simple 3 Switch topology as shown below, SW1 connects to SW2 on G1/0/2 and SW3 on G1/0/3, SW2 Connects to SW1 on G1/0/1, and SW3 on G1/0/3, and SW3 Connects to SW1 on G1/0/1 and SW2 on G1/0/2. Without STP all of the link will forward frames. Lets take a look at why that can be a bad thing. 

3 Switch Topology

Without STP

Think about how a frame get's sent around a layer 2 network. Say a PC connected to SW2 wants to send data to a PC connected to SW3. Now in this first scenario, let's assume that all Layer 2 MAC information is know by all devices. PC1 knows that destination MAC address for PC2 so there is no need to perform an ARP request for PC2's MAC. The Frame is building using the known destination MAC and sent to SW2. SW2 looks at the Layer 2 header, sees the destination MAC is of PC2 and looks at it's ARP table and knows that PC2 is found out port G1/0/3 and is sent to SW3. SW3 does the same lookup and data is forwarded to PC2 out the known egress interface as per SW3's ARP table. 

PC1 sending data to PC2 without STP

Great that was easy and having a fully redundant Switching network works perfects. But, what if we don't know the MAC address? Same topology, but this time PC1 doesn't know the MAC address of the host PC2. PC1 will send an ARP request out to the network. Remember that an ARP request is a broadcast request. SW2 sees the ARP request and forwards it to bot SW1 and SW3. SW1 and SW3 will also broadcast the ARP request out all of it's network interfaces. PC2 will of course receive the ARP request and reply, but SW1 also receives the ARP request from SW3 and continues to forward it to SW2 again. SW2 will forward the broadcast frame and so on and so on. As you can see this ends up in a big loop called a broadcast storm. This is what STP prevents. 

STP Versions

As mentioned earlier Spanning Tree sends out BPDU messages to all neighbouring switches in order to discover the switching topology and find any redundant links and block them to prevent broadcast storms etc. There have been several iterations of STP over the years which I will go through in future blog entries. For now, I will just list them.
•    802.1D STP: This is the original IEEE Spanning tree standard
•    PVST: Cisco proprietary standard of 802.1D that supports VLANs.
•    PVST+: Cisco proprietary standard of PVST that allows interoperability between other STP standards.
•    802.1W RSTP: Rapid Spanning Tree. Provides faster failover and additional features of 802.1D STP.
•    802.1S MST: Multiple Spanning Tree. Industry standard STP that allows for load balancing of VLANs etc.

How STP Works

For now we will use 802.1D STP to discuss how STP functions as it is crucial for understanding the other iterations of STP. 802.1 STP is an IEEE standard and supports ensuring a loop free topology for a single VLAN. It does this by transitioning ports through a series of port states in order to determine which ports should forward data and which ports should block data. To begin this process, when a switch is powered on it assumes that at first, it is the STP root bridge. The STP Root bridge is the Bridge (switch and Bridge in this context are interoperable) that is most important in the Layer 2 network and is considered the center of the network and all switches must pass through it. The switch, assuming it is the root bridge, creates a Configuration BPDU advertising itself as the Root bridge and sends it out all active ports. When a neighbouring switch receives this configuration BPDU (After already assuming itself as the Root and sending its own Configuration BPDU advertising as such) it will look at the sending switches Bridge Identifier. The Bridge ID consists of 2 parts:

  • Bridge Priority: 2 byte filed that ranges from 0 - 65535 and is by default set to 32768 + VLAN ID.
  • MAC Address: 6 Byte field that contains the sending switches MAC address.

 

To see what STP BPDU headers look like click here for a pcap of the various STP versions. 

STP Port States

After checking the Bridge ID, the switch will then determine if the BPDU is inferior or superior to its own. A Superior BPDU has a lower Bridge ID than itself. If the BPDU is inferior, then it is ignored. If the BPDU is superior, the switch marks that bridge as root and begins advertising it to other switches in the network. Once the Root Bridge has been identified, the next step in the STP process is to determine the port types. There are only 3 Port types:

  • Root Port (RP): This is the port that is the shortest path to the Root Bridge. There should be only 1 RP per VLAN on a switch. 
  • Designated Port (DP): This is a network port that receives and forwards BPDU frames to other switches. This port provides connectivity to downstream devices and switches. Should be only a single DP per active port link. 
  • Blocking Port: A port that is not forwarding traffic due to STP calculations

 

Note that the Root Bridge will not have an Root Ports (RPs) at all and all ports should be DPs. The Root Port is calculated by finding the shortest path to the Root Bridge. This is found by looking at the Cost value in the STP BPDU. The Cost is the sum of all of the egress interfaces leading to the Root Bridge. As a BPDU is received, the path cost of the ingress interface is added to the cost to the Root. There are other Root Port Election mechanisms that come into play depending on your topology. The following process is taken when selecting a Root Port.

  1. Interface with the lowest path cost
  2. Interface associated to the lowest system priority of the advertising switch
  3. Interface associated with the lowest system MAC address of the advertising switch
  4. Lowest port priority of advertising switch
  5. Lowest port Number of advertising Switch

 

The last 2 steps in the election process come into play when you have multiple links connected to the same switch. For example, in our topology above if SW2 and SW3 had multiple links connecting to each other.

Each interface type has a default cost. STP has 2 modes for path cost, short mode and long mode. To configure long mode use the global configuration command spanning-tree pathcost method long. Short mode was the original standard however as link speeds increased this became useless due to the metric values. For 802.1D STP the path cost is as follows: 

Link Speed Short Mode Cost Long Mode Cost
10Mbps 100 2000000
100Mbps 19 200000
1Gbps 4 20000
10Gbps 2 2000
20Gbps 1 1000
100Gbps 1 200
1Tbps 1 20
10Tbps 1 2

Once the RPs and DPs have been elected, The next step is to identify the port states. There are several states a switch port transitions through in order to reach the final state of forwarding or blocking traffic. 802.1D STP port states are as follows:

  • Disabled: The port is administratively shut down.
  • Blocking: This is a final STP state. The port is enabled but not forwarding any traffic. BPDUs can still be received on a blocking port.
  • Listening: This is a transitional state. The port moves from Blocking to L:istening and can now send or receive BPDUs. Network traffic is still not forwarded. This state by default lasts 15 seconds. 
  • Learning: This is also a transitional state. The port has moved from Listening to Learning and will now modify the MAC address table with network traffic that is received. The Switch still does not forward this network traffic. This state by default lasts for 15 seconds.
  • Forwarding: This is a final state for STP. All network traffic is now forwarded and the MAC table is updated as normal.
  • Broken: The switch has detected a configuration or operational issue on the port. The port discards packets as long as the problem continues.

 

When you have 2 non-root switches connected together using redundant links the switches will need to determine which ports will be forwarding and which will be blocking. To de this, the switch uses the following steps

  1. Interface is a Designated Port and not a Root Port
  2. Switch with the lowest RP cost forwards and the other Blocks
  3. Remote switch system priority is checked and the lowest priority forwards
  4. System MAC addresses are compared, and the lowest MAC forwards

 

Once all Root Ports and Designated Ports have been found and all port states have transitioned to either a forwarding or blocking state. The STP topology is considered stable. BPDUs continue to get advertised by the Root Bride every Hello Time interval which by default is 2 seconds. I'll discus STP failover and timers in a future blog post. 

Port Types

STP has two main port types.

  • P2p: These ports connect to another network device such as a PC or RSTP Switch.
  • P2p Edge: These ports are configured with portfast enabled. I will explain portfast in another post

 

STP Link Failure/Change

When a switch detects a failure or a link change, the switch creates a TCN (Topology Change Notification) BPDU and sends it towards the Root Switch out the Root Port. If there is an upstream switch between the sending switch and the Root switch, the upstream switch will send an acknowledgement to the switch it received the TCN from and continue to forward the TCN BPDU. Once the Root switch has received the TCN, a new Configuration BPDU is sent to all switches with the TCN flag set. When the downstream switches receive the Configuration BPDU with the TCN Flag set, the switches set the MAC address timer to the same as the Forward Delay timer (15 seconds by default). This allows the switch to flush the MAC addresses of any devices not active. Once the switch receives the second Configuration BPDU the MAC address timer is reset back to its original time (300 seconds by default). 

I'd like to mention that TCN BPDUs are generated on a per-VLAN basis. This could mean in some large networks, a single link failure could result in a large number of TCN BPDUs and switch STP calculations. You can view the details of topology changes using the command show spanning-tree [vlan] detail.

sh span vl 10 detail

In the above output you can see that there have been 8 topology changes for VLAN 10 and that the last one occurred 3 minutes and 40 seconds ago. You can also see the details of the hello, max age, and forward delay timers along with how many BPDUs have been sent and received. Notice that for Port 3 and 4, the number of BPDUs received is 1, this is because these ports connect to SW3 and the links on SW3 are in a BLK state. 

STP will handle direct link failures in different ways depending on the topology and the failure. Using our 3 switch topology above with SW1 as the root, let's take a look at the 3 types of direct link failures that can occur.

  1. Link failure between SW2 and SW3: This type of link failure has no impact to data flow as the ports between SW2 and SW3 are not forwarding. Both switches however will still send a TCN BPDU to the root and the switches all flush their MAC address tables.
  2. Link failure between SW1 and SW3: SW3 will remove its best BPDU for SW1 as the RP has gone down. SW1 will advertise a TCN BPDU to SW2. SW2 and SW3 will both receive this TCN BPDU and set their MAC timer to the Forward Delay timer. SW3 must either wait to hear from the Root Switch again or the max age timer expires before the port state can be reset and can listen to BPDUs from SW2.
  3. Link failure between SW1 and SW3: Again SW1 will send a TCN BPDU and SW2 will advertise itself as the Root to SW3. SW3 sets its MAC timer to the max age timer and flushes the CAM table. SW3 will receive SW2's inferior BPDU, discards it and changes the port to SW2 from Blocking to Listening and then forwards SW1 superior BPDU to SW2. SW2 will then mark it's port to SW3 as the new RP.

 

Old STP Path Cost Values

I would just like to add for reference there are 3 scales for STP root path cost. The Old mode, the Short mode, and Long mode.

Link Speed Old Mode Short Mode Long Mode
4Mbps 250 250  
10Mbps 100 100 2000000
16Mbps 63 62  
45Mbps 22 39  
100Mbps 10 19 200000
155Mbps 6 14  
622Mbps 2 6  
1Gbps 1 4 20000
10Gbps 0 2 2000