RDMA over Converged Ethernet
RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access over an Ethernet network. RoCE is a link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network.
Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network application programming interfaces such as Berkeley sockets are lower latency, lower CPU load and higher bandwidth. The RoCE protocol allows lower latencies than its predecessor, the iWARP protocol. There exist RoCE HCAs with a latency as low as 1.3 microseconds while the lowest known iWARP HCA latency in 2011 was 3 microseconds.
RoCE versus InfiniBand
RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network. RoCE was expected to bring InfiniBand applications which are predominantly based on clusters on to a common Ethernet converged fabric. Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet. While Ethernet is a more familiar technology to most than InfiniBand, the cost of InfiniBand equipment, especially switches, was predicted in 2009 to be lower than that of 40 gigabit Ethernet.
RoCE versus iWARP
While the RoCE protocol defines how to perform RDMA over the Ethernet link layer, the iWARP protocol defines how to perform RDMA over a connection-oriented transport like the Transmission Control Protocol (TCP). That means that unlike RoCE, iWARP is neither bound to Ethernet nor limited to a single Ethernet broadcast domain. However, the memory requirements of many connections along with TCP's flow and reliability controls lead to scalability and performance issues for large-scale datacenters and large-scale applications (i.e. large-scale enterprises, cloud computing, web 2.0 applications etc. Also, multicast is defined in the RoCE specification while the current iWARP specification does not define how to perform multicast RDMA.
Some aspects that could have been defined in the RoCE specification have been left out. These are:
- How to translate between primary RoCE GIDs and Ethernet MAC addresses.
- How to translate between secondary RoCE GIDs and Ethernet MAC addresses. It is not clear whether it is possible to implement secondary GIDs in the RoCE protocol without adding a RoCE-specific address resolution protocol.
- How to implement VLANs for the RoCE protocol. Current implementations store the VLAN ID in the twelfth and thirteenth byte of the sixteen-byte GID, although the RoCE specification does not mention VLANs at all.
- How to translate between RoCE multicast GIDs and Ethernet MAC addresses. Implementations in 2010 used the same address mapping that has been specified for mapping IPv6 multicast addresses to Ethernet MAC addresses.
- How to restrict multicast traffic to a subset of the ports of an Ethernet switch. As of September 2013, an equivalent of the Multicast Listener Discovery protocol has not yet been defined for RoCE.
- At least one vendor that offers an RDMA over Ethernet solution has chosen another wire protocol than RoCE.
- Data center bridging (DCB), sometimes called Converged Ethernet or Converged Enhanced Ethernet
- Remote direct memory access (RDMA)
- InfiniBand Trade Association, InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE, InfiniBand Trade Association, April 2010.
- Cameron, Don, Regnier, Greg. Virtual Interface Architecture, ISBN 978-0-9712887-0-6, Intel Press, 2002.
- Feldman, Michael, RoCE: An Ethernet-InfiniBand Love Story, HPC wire, April 2010.
- Mellanox, End-to-End Lowest Latency Ethernet Solution for Financial Services, March 2011.
- Mellanox, RoCE vs. iWARP Competitive Analysis Brief, November 2010.
- Chelsio, Low Latency Server Connectivity With New Terminator 4 (T4) Adapter, May 25, 2011.
- Rick Merritt, New converged network blends Ethernet, InfiniBand, EE Times, April 2010.
- Sean Michael Kerner, InfiniBand Moving to Ethernet ?, Enterprise Networking Planet, April 2010.
- David Gross, Will New QDR InfiniBand Leap Ahead of 40 Gigabit Ethernet?, Seeking Alpha, January 2009. This is a tertiary source that clearly includes information from other sources but does not name them.
- Chelsio,"RoCE: FAQ's". Chelsio.
- Rashti, Mohammad, iWARP Redefined: Scalable Connectionless Communication over High-Speed Ethernet, International Conference on High Performance Computing (HiPC), 2010.
- H. Shah et al. (October 2007). "Direct Data Placement over Reliable Transports". RFC 5041. Retrieved May 4, 2011.
- C. Bestler et al. (October 2007). "Stream Control Transmission Protocol (SCTP) Direct Data Placement (DDP) Adaptation". RFC 5043. Retrieved May 4, 2011.
- P. Culley et al. (October 2007). "Marker PDU Aligned Framing for TCP Specification". RFC 5044. Retrieved May 4, 2011.
- Dreier, Roland, Two notes on IBoE, Roland Dreier's blog, December 2010.
- Eli Cohen, IB/core: Add VLAN support for IBoE, Linux kernel patch, August 2010.
- Eli Cohen, RDMA/cm: Add RDMA CM support for IBoE devices, October 2010.
- M. Crawford, RFC 2464 - Transmission of IPv6 Packets over Ethernet Networks, IETF, 1998.
- Malhi, Upinder, PATCH Cisco VIC RDMA Node and Transport, linux-rdma mailing list, September 4, 2013.