Some number of weeks ago I helped troubleshoot an FCoE problem involving a flaky twinax cable that was causing excessive bit errors. The symptom being observed was very strange at first glance; CRC errors were being logged on interfaces that should have been idle. At first we thought the switch was simply incrementing the wrong CRC counter, but as we would soon discover, the root cause was much more interesting. In addition, after reviewing the NVGRE, VXLAN and STT encapsulation proposals, I believe a similar sort of problem may exist with all three proposals. If this turns out to be the case, I think now would be the time to address it.
The problematic FCoE topology was similar to the following:
As shown in the above diagram, the defective twinax cable was used to connect Server 1 to interface 1 on the FCoE switch. Since we were performing FCoE I/O between Server 1 and the Storage, and we knew the twinax cable was causing bit errors, we expected to see:
- the “receive CRC error” counter increment on FCoE switch interface 1,
- the “transmit CRC error” counter increment on FCoE switch interface 3; and
- the “receive CRC error” counter increment on the storage array.
Unfortunately, this wasn’t the case and instead we observed:
- the expected behavior from above; AND
- the “transmit CRC error” counter occasionally increment on FCoE switch interface 2! This was completely unexpected because only Server 1 was generating any kind of significant I/O and it was all FCoE based. Therefore this traffic should only have been visible to the storage array and Server 2 should not have been receiving any frames.
Upon reviewing an xgig trace between FCoE switch interface 3 and the storage port, I noticed not only a few CRC errors, but I also noticed a number of ABTS and bad SCSI status frames. As I expected, these were related to an exchange or I/O that was impacted by a corrupted frame. However, in at least one of the cases, the bad SCSI status was being returned by the Storage port when it detected a missing frame. The significance of this wouldn’t be clear until later.
Finally we inserted the analyzer between between Server 2 and FCoE Switch interface 2. We set it to trigger on bad CRC/FCS and when it triggered I was very surprised to see an FCoE data frame from Server 1 that was addressed to D_ID of the storage array! I mean think about this for a second, a frame containing data sent from a host to the storage array was being transmitted to another host on a completely different VLAN!
The reason this happened has to do with one of the basic mechanisms of Ethernet switching known as the unicast flood. In a previous post, I described the process of forwarding in an FCoE environment, so rather than rehash that here, I’ll direct those of you who want more detail to that post. In this case, the FCoE frame was being flooded because the corruption occurred in Ethertype field.
For additional detail, refer to the diagram of a typical FCoE Frame below. Note that with FCoE, the frame typically starts with the Ethernet DA and SA and then is followed by the 802.1Q VLAN Tag.
In general, when an Ethernet frame is received with a VLAN tag, if the DA is known, the frame will be forwarded based upon a MAC Address table that reflects the switches understanding of the correct route to the destination address. If the DA is unknown, the frame will be flooded on that VLAN and eventually the MAC Address table will be updated to allow future frames to be forwarded to the appropriate destination. Taking this a step further, if the frame is a FIP or FCoE frame, additional forwarding rules will be applied that are (hopefully) consistent with FC-BB-5 annex C and D. These additional rules are described in another post that has to do with FIP Snooping Bridges.
Three additional items that I have to mention before getting to the point:
- the 802.1Q tag is actually an Ethertype (0x8100) that allows the switch to interpret the rest of the frame (e.g., VLAN ID) and then actual Ethertype of the frame (e.g., FCoE or 0x8906).
- The FCoE capable switches that I am aware of do not learn Fabric Provided MAC Addresses (FPMAs) in the same way that normal MAC Addresses are learned. Part of the reason has to do with the special handling requirements outlined above.
- Due to performance concerns, the FCoE capable switches that I am aware of all support cut-through routing of FCoE frames, this means that the frame header can be transmitted before the FCS is received. As a result, frames with bad FCS are forwarded..
Getting back to the trace between FCoE Switch interface 2 and server 2; in the trace I captured, I noticed that the frame with a bad FCS had a VLAN Tag that was incorrect (let’s say 8101, to be honest I can’t remember the exact value). As a result, instead of the switch detecting an Ethertype of 0x8100 and handling it appropriately, it detected a different Ethertype (0x8101) and did what any Ethernet switch would do when a frame is received on the default VLAN (remember, no VLAN ID since the tag was corrupted) and the DA is unknown (due to the FPMA learning restriction) and this is to flood it on the default VLAN.
To make matters more interesting, in order for FIP to work properly, a default VLAN must be defined and typically all FCoE devices will use the same default VLAN for FIP VLAN discovery. Note, this isn’t a requirement of the protocol, but it is easiest to configure it this way and it works perfectly fine as long as there are no bit errors.
The bottom line is this; if you are using FCoE and you are experiencing bit errors, there is a chance (albeit a small one) that an FCoE data frame may be flooded on the default VLAN and this VLAN will probably be used by other FCoE capable devices. As a result, ensuring that there are no bit errors logging in your FCoE environment is a VERY good idea.
Does this mean you should avoid FCoE? Of course not! The same problem can occur in any Ethernet network (as long as cut-through routing is used). I’ll go out on a limb and say that this problem could occur any time protocol based isolation is used to partition a physical link. Note that I explicitly said PROTOCOL based, I am not asserting that this could happen with DWDM or TDM based approaches.
The implications to Network Virtualization
Network Virtualization supports multi-tenancy and the current approaches use a protocol based partitioning method. Essentially, all of the proposed tunneling encapsulations (e.g., VXLAN, NVGRE and STT) make use of a “Tenant ID” to signal to the tunnel end points which tenant the frame belongs to. Although I have not had the opportunity to force a similar type of problem to occur with any of the proposed NV approaches, I think it may be possible for a bit in the Tenant ID field to get flipped and for a data frame to be delivered to the wrong Tenant. In fairness, this problem would seem to be much less likely to happen with NV and could probably be prevented from occurring by the NV Control Plane. As a result, I’ll avoid speculating much more on what could happen until I have more data. That having been said, I’m raising the issue now so that it can be investigated by those of us who are interested and a solution can be built in to the protocols (if needed).
Thanks for reading!