Voiced by Amazon Polly |
Introduction
On any actual network fabric, packets might be lost or stopped due to momentary congestion and hot spots. This happens quite frequently in real forwarding mechanisms. As a result, transmit/receive couples take a long time (hundreds or even thousands of microseconds) to realize that a packet needs to be sent again. A CPU may be carrying out a quarter of a million instructions in those milliseconds.
TCP is a general-purpose communication library but is not particularly well-optimized for HPC applications. First, TCP runs in the kernel of the OS, which makes it optimal for sharing network connections among all the processes running on your computer, but sharing requires overhead, which introduces latency and variability and is too expensive when microseconds matter and latency consistency is paramount.
Customized Cloud Solutions to Drive your Business Success
- Cloud Migration
- Devops
- AIML & IoT
What is Incast condition?
Incast is a condition where many servers are trying to send a lot of data to one single server, effectively overwhelming it. It can happen during that synchronization phase shown in the below image, and the surge of traffic overloads the receiver and drops packets. TCP handles general networking conditions quite well, but it is not particularly well suited for the challenges that HPC applications introduce.
Figure: Many servers send data to a single server with the same bandwidth
(AWS re:Invent 2022)
Why SRD?
AWS developed its own reliable datagram protocol, which we refer to as the Scalable Reliable Datagram (SRD). Much of it was influenced by what we observed operating in other places, including Infiniband, Ethernet, and other technologies.
SRD is a first-generation Ethernet-based transport. It gives us such a wide range of control over the results that we don’t want to give it up (not easily, anyhow).
Second, and perhaps most importantly, it eased the necessity for in-order packet delivery with the understanding that it could re-impose it in the higher tiers of the stack if required.
In the example below, the fastest sender on the TCP side becomes a slower sender on the Incast EFA side when SRD is utilized. This demonstrates the improvement in latency variability when compared to existing TCP usage and new EFA with SRD.
(AWS re:Invent 2022)
SRD swiftly recognizes the Incast issue and instantly makes the necessary adjustments to distribute each stream’s fair share.
SRD Setup that leverages Nitro Technology without going through the TCP layer
(AWS re:Invent 2022)
Conclusion
HPC/ML applications can be run at scale on the AWS public cloud thanks to EFA. With tail latency orders of magnitude shorter than that of TCP, it offers consistently low latency. This is made possible by SRD’s innovative network transport semantics and an unconventional division of functionality between the network interface card and several layers of host software. It reduces the possibility of packet drops in the network and enables quicker recovery from drops by running SRD multipath load balancing and congestion control on the Nitro card.
The brand-new protocol Designed by AWS for the special data center network used by AWS offers the following advantages:
- Multipath routing that is network aware
- Delivery that is ensured
- Significantly decreased tail latency
Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.
- Cloud Training
- Customized Training
- Experiential Learning
About CloudThat
CloudThat is the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft Gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best-in-industry cloud computing practices and expertize. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
Drop a query if you have any questions regarding AWS Media Services, Cloud Media Platform, or any AWS services, and I will get back to you quickly. To get started, go through our Expert Advisory page and Managed Services Package, which throws light on CloudThat’s offerings.
WRITTEN BY Niravbhai Desai
Click to Comment