Read how we designed and deployed Multipath Reliable Connection (MRC), a multi-company step towards Ultra Ethernet, at 800G in Microsoft's datacenters.
Demonstrated with a 75k GPU pretraining job running stably through multiple faults.
OpenAI blog: buff.ly/X0KL4Xy
Paper: buff.ly/I3OcXWn
OpenAI introduces MRC (Multipath Reliable Connection), a new supercomputer networking protocol released via OCP to improve resilience and performance in large-scale AI training clusters.