//
sign in
Post
by @danabra.mov
PostEmbed
by @danabra.mov
Record
by @jimpick.com
Record
by @atsui.org
+ new component
Post
Read how we designed and deployed Multipath Reliable Connection (MRC), a multi-company step towards Ultra Ethernet, at 800G in Microsoft's datacenters. Demonstrated with a 75k GPU pretraining job running stably through multiple faults. OpenAI blog: buff.ly/X0KL4Xy Paper: buff.ly/I3OcXWn
1mo
OpenAI introduces MRC (Multipath Reliable Connection), a new supercomputer networking protocol released via OCP to improve resilience and performance in large-scale AI training clusters.
buff.ly
Supercomputer networking to accelerate large scale AI training
Torsten Hoefler 🇨🇭