Mastering Lossless Networks for RDMA: A Comprehensive Guide

Technology Feast | How to build a lossless network for RDMA

Time: September 10th, 2024

Why do we need lossless networks?

After reading the previous technical article on ECMP, I believe you have a certain understanding of lossless networks. Let’s discuss another technology known as RDMA (Remote Direct Memory Access). Maybe you will ask why we need RDMA. Why do we need lossless networks? What benefits can these advanced technologies bring us?

It may not be possible to get a satisfactory answer only from the network level. The following are simple examples from the front-end business and back-end applications. I believe that everyone can help to solve the doubts.

The first thing I want to say is that a large number of online businesses on the Internet, such as online search, shopping, live streaming, etc., need to respond to high-frequency user requests at a very fast speed. Any delay in any link in the data center will have a great impact on the end user's access experience, thereby affecting its traffic, reputation, active users, etc.

In addition, under the technological trends of machine learning and AI, the demand for computing power is increasing exponentially. In order to meet the increasingly complex neural networks and deep learning models, there will be a large number of distributed computing clusters in data centers. However, the communication delays of a large number of parallel programs will greatly affect the efficiency of the entire computing process.

In addition, in order to solve the problem of explosively growing data storage and reading efficiency in data centers, distributed storage using Ethernet converged networking is becoming more and more popular. However, because the data flow in the storage network is mainly elephant flow, once the packet is lost due to congestion, it will trigger the elephant flow retransmission, which will not only reduce efficiency but also aggravate congestion.

Therefore, from the perspective of front-end user experience and back-end application efficiency, the current requirements for data center networks are the lower latency, the better, and the higher the efficiency, the better.

In order to reduce the internal network latency of the data center and improve processing efficiency, RDMA technology came into being. It allows user-mode applications to directly read and write remote memory without the need for the CPU to intervene and copy the memory multiple times. It can also bypass the kernel and write data directly to the network card, achieving high throughput, ultra-low latency, and low CPU overhead.

Current RDMA transmission protocol on Ethernet is RoCEv2. RoCEv2 is based on the connectionless UDP protocol. Compared with the connection-oriented TCP protocol, the UDP protocol is faster and occupies less CPU resources. However, unlike the TCP protocol, it does not have sliding windows, confirmation responses, and other mechanisms to achieve reliable transmission. Once packet loss occurs, it relies on the upper-layer application to detect and retransmit, which will greatly reduce the transmission efficiency of RDMA.

Therefore, in order to bring out the true performance of RDMA and break through the network performance bottleneck of large-scale distributed systems in data centers, it is necessary to build a lossless network environment with no packet loss for RDMA. The key to achieving no packet loss is to solve network congestion.

Why does congestion occur?

There are many reasons for congestion. The following are three key and common reasons in data center scenarios:

1. Convergence ratio

When designing the data center network architecture, most people will adopt an asymmetric bandwidth design from the perspective of cost and benefit, that is, the upstream and downstream link bandwidths are inconsistent. The convergence ratio of the switch is simply the total input bandwidth divided by the total output bandwidth. Taking Ruijie's 10G switch RG-S6220-48XS6QXS-H as an example, the downstream bandwidth available for server input is 48*10G=480G, the upstream output bandwidth is 6*40G=240G, and the overall convergence ratio is 2:1. For the 25G switch RG-S6510-48VS8CQ, the downstream bandwidth available for server input is 48*25G=1200G, the upstream output bandwidth is 8*100G=800G, and the overall convergence ratio is 1.5:1.

That is to say when the total uplink packet sending rate of the downstream server exceeds the total uplink bandwidth, congestion will occur at the upstream port.

2. ECMP

Currently, most data center networks use Fabric architecture and ECMP to build multiple links with equal load balancing. It is simple to set the disturbance factor and select a link for forwarding by HASH, but this process does not consider whether the selected link itself is congested. ECMP does not have a congestion-aware mechanism but only distributes the flow to different links for forwarding. For links that have already been congested, it is likely to aggravate the congestion of the link.

3. TCP Incest

TCP incast is a Many-to-One communication mode. This communication mode often occurs under the general trend of cloudification of data centers, especially those distributed storage and computing applications implemented in a scale-out manner, including Hadoop, MapReduce, HDFS, etc.

For example, when a Parent Server initiates a request to a group of nodes (server cluster or storage cluster), all nodes in the cluster will receive the request at the same time and respond almost simultaneously. Many nodes send TCP data streams to a machine ( Parent Server ) simultaneously, thus generating a " micro-burst flow ", which makes the outbound port buffer connected to the Parent Server on the switch insufficient, causing congestion.

▲ Figure 1 TCP Incast traffic model

As mentioned earlier, RDMA is different from TCP in that it requires a lossless network. For ordinary microburst traffic, the switch buffer can play a certain role by queuing burst packets in the buffer. However, since the cost of increasing the switch buffer capacity is very high, its role is limited. Once there are too many packets queued in the buffer, packet loss will still occur.

Packet loss caused by buffer overflow in the switch, the switch must introduce other mechanisms, such as flow control, to control the traffic on the link and reduce the pressure on the switch buffer to avoid packet loss.

Does PFC implement flow control?

IEEE 802.1Qbb (Priority-based Flow Control), referred to as PFC, is a technology in the IEEE Data Center Bridge protocol family and an enhanced version of flow control.

Talking about PFC, we can first look at the IEEE 802.3X (Flow Control) flow control mechanism: when the receiver is unable to process the received message, in order to prevent the message from being discarded, the receiver needs to notify the sender of the message to temporarily stop sending the message.

As shown in the following figure, when ports G0/1 and G0/2 forward packets at a rate of 1 Gbps, port F0/1 will be congested. To avoid packet loss, enable the Flow Control function on ports G0/1 and G0/2.

▲Figure 2 Traffic model of port congestion

● When F0/1 is congested while forwarding messages, switch B queues the messages in the port buffer. When the congestion exceeds a certain threshold, port G0/2 sends a PAUSE frame to G0/1, notifying G0/1 to temporarily stop sending messages.

● G0/1 temporarily stops sending packets to G0/2 after receiving the PAUSE frame. The PAUSE frame carries the information about the duration of the pause. Switch A will wait within this timeout range or until it receives a control frame with a Timeout value of 0 before continuing to send packets.

IEEE 802.3X protocol has a disadvantage: once the link is suspended, the sender can no longer send any data packets. If the suspension is caused by some lower-priority data streams, other higher-priority data streams on the link will also be suspended, which is not worth the loss.

As shown in the message analysis in the figure below, PFC is an extension of the basic flow control IEEE 802.3X, allowing the creation of 8 virtual channels on an Ethernet link, specifying a corresponding priority for each virtual channel, allowing any virtual channel to be paused and restarted individually, while allowing the traffic of other virtual channels to pass uninterrupted.

▲ Analysis of Figure 3 PFC protocol message structure

PFC refines the granularity of flow control from physical ports to 8 virtual channels, which correspond to the 8 hardware send queues on the Smart NIC hardware (these queues are named Traffic Class, TC0, TC1 ... TC7 respectively). There are also different mapping methods under different RDMA encapsulation protocols.

RoCEv1:
This protocol encapsulates the RDMA data segment into the Ethernet data segment and adds the Ethernet header, so it belongs to the Layer 2 data packet. In order to classify it, only the 3 bits of the PCP (Priority Code Point) field in the VLAN (IEEE 802.1q) header can be used to set the priority value.

▲ Figure 4 Layer 2 Ethernet frame VLAN header structure

RoCEv2:
This protocol encapsulates the RDMA data segment into the UDP data segment, adds the UDP header, then the IP header, and finally the Ethernet header, which is a three-layer data packet. It can be classified by using the PCP field in the Ethernet VLAN or the DSCP field in the IP header.

▲ Figure 5 Layer 3 IP message header structure

In simple terms, in the case of a Layer 2 network, PFC uses the PCP bit in the VLAN to distinguish data flows. In the case of a Layer 3 network, PFC can use both PCP and DSCP, so that different data flows can enjoy independent flow control. Currently, most data centers use Layer 3 networks, so using DSCP is more advantageous than PCP.

PFC deadlock

Although PFC can implement queue-based flow control by mapping different priorities to different queues, it also introduces new problems, such as PFC deadlock.

PFC deadlock refers to a network state in which data flows on all switches are permanently blocked when congestion occurs simultaneously between multiple switches due to micro loops and other reasons, and the cache consumption of each port exceeds the threshold, while the switches are waiting for each other to release resources.

Under normal circumstances, when a switch port is congested and the XOFF watermark is triggered, the direction in which data enters (i.e., the downstream device) will send a PAUSE frame for back pressure. After receiving the PAUSE frame, the upstream device stops sending data. If its local port buffer consumption exceeds the threshold, it will continue to apply back pressure to the upstream. This level of back pressure is applied until the network terminal server stops sending data within the Pause Time specified in the PAUSE frame, thereby eliminating packet loss caused by congestion in the network node.

However, in special cases, such as when a link failure or device failure occurs, a short loop may occur during the re-convergence of BGP routing, resulting in a circular buffer dependency. As shown in the figure below, when all four switches reach the XOFF watermark, they all send PAUSE frames to the other end at the same time. At this time, all switches in the topology are in a stopped state. Due to the back pressure effect of PFC, the throughput of the entire network or part of the network will become zero.

▲ Figure 6 PFC deadlock diagram

Even when a short loop is formed in a loop-free network, a deadlock may occur. Although short loops disappear quickly after being repaired, the deadlocks they cause are not temporary and cannot be automatically recovered even if the server is restarted to interrupt traffic.

In order to release the deadlock state, on the one hand, it is necessary to prevent loops in the data center, and on the other hand, it can be achieved through the deadlock detection function of the network equipment. The deadlock detection function on Ruijie RG -S6510-48VS8CQ can detect a time after the deadlock state occurs, ignore the received PFC frames, and forward or discard the messages in the buffer (the default is forward).

For example, the monitoring times of the timer can be configured to detect 10 times, and each time detect whether the PFC Pause frame is received within 10ms. If it is received 10 times, it means that Deadlock is generated, and the default operation is performed on the message in the buffer. After that, 100ms will be set as the recovery time before recovery and re-detection. The command is as follows:

Priority-flow-control deadlock cos-value 5 detects 10 recover 100 //10 detections, 100ms recover.

PFC flow control mechanism is used in the RDMA lossless network to suspend the peer traffic before the switch port cache overflows, preventing packet loss. However, because it requires back pressure at each level, the efficiency is low, so a more efficient, end-to-end flow control capability is needed.

Using ECN to achieve end-to-end congestion control

The current RoCE congestion control relies on ECN (Explicit Congestion Notification). ECN was originally defined in RFC 3168. When network devices detect congestion, they embed a congestion indicator in the IP header and a congestion confirmation in the TCP header.

RoCEv2 standard defines RoCEv2 congestion management (RCM). After ECN is enabled, once the network device detects congestion in RoCEv2 traffic, it will mark it in the ECN field of the IP header of the data packet.

▲ Figure 7 IP packet header ECN field structure

This congestion indicator is interpreted by the destination terminal node according to the FECN congestion indicator in the BTH (Base Transport Header, which exists in the IB data segment). In other words, when the ECN-marked packets arrive at their original destination, the congestion notification will be fed back to the source node, and the source node will respond to the congestion notification by limiting the rate of network packets for the problematic Queue Pairs (QP).

ECN interaction process

▲ Figure 8 Schematic diagram of the ECN interaction process

1. The IP message sent by the sender supports ECN (10);

2. When the switch receives the message in a queue-congested situation, it changes the ECN field to 11 and sends it out. Other switches in the network will transparently transmit the message.

3. The receiving end receives a message with ECN 11 and finds congestion, so it processes the message normally.

4. The receiving end generates a congestion notification and sends a CNP ( Congestion Notification Packets ) message every ms. The ECN field is 01, requiring that the message cannot be discarded by the network. The receiving end can send a single CNP for multiple packets marked with the same QP by ECN (see the figure below for the format);

5. After receiving the CNP message, the switch forwards it normally ;

6. After receiving the CNP message with ECN marked as 01, the sender parses it and applies the rate limiting algorithm to the corresponding flow (corresponding to the QP with ECN enabled).

The CNP packet format of RoCEv2 is as follows:

MAC Header

IPv4/IPv6 Header

UDP Header

BTH

DestQP set to QPN for which the RoCEv2 CNP is generated

Opcode set to 0

PSN set to 0

SE set to 0

M set to 0

P-Key set to the same value as in the BTH of the ECN packet marked

(16 bytes)-Reserved. MUST be sent to 0 by sender. Ignored by the receiver

ICRC

FCS

▲ Table 1 CNP message structure

It is worth noting that CNP, as a congestion control message, will also have delays and packet loss. Each hop device and each link from the sender to the receiver will have a certain delay, which will eventually increase the congestion between the sender and the receiver. The time to CNP, and at the same time the congestion under the switch port will gradually increase. If the sender cannot reduce the speed in time, packet loss may still occur. It is recommended that the size of the congestion notification domain should not be too large to avoid ECN control messages. The number of hops in the interactive loop is too large, which affects the sending end's inability to reduce the speed in time, causing congestion.

Summary

RDMA network achieves lossless guarantee by deploying PFC and ECN functions in the network. PFC technology allows us to control the traffic of RDMA exclusive queues on the link and apply back pressure to the upstream device traffic when congestion occurs at the switch ingress port. With ECN technology, we can achieve end-to-end congestion control. When the switch egress port is congested, the data packets are marked with ECN, and the traffic sender is asked to reduce the sending rate.

From the perspective of giving full play to the high-performance forwarding of the network, we generally recommend adjusting the buffer waterline of ECN and PFC so that ECN is triggered faster than PFC, that is, the network continues to forward data at full speed, and the server actively reduces the packet sending rate. If the problem still cannot be solved, PFC can be used to suspend the upstream switch from sending packets. Although the throughput performance of the entire network is reduced, packet loss will not occur.

RDMA in data center networks requires not only the lossless network requirements of the forwarding plane but also the refined operation and maintenance to cope with network environments that are sensitive to latency and packet loss.

Related Blog:
Exploration of Data Center Automated Operation and Maintenance Technology: Zero Configuration of Switches
Technology Feast | How to De-Stack Data Center Network Architecture
Technology Feast | A Brief Discussion on 100G Optical Modules in Data Centers
Research on the Application of Equal Cost Multi-Path (ECMP) Technology in Data Center Networks

What Is an Ethernet Switch?

Technology Feast | A Brief Discussion on 100G Optical Modules in Data Centers

Higher Education

5-Star Hotel Solution

Technology Feast | How to build a lossless network for RDMA