Skip to Main content Skip to Navigation
New interface
Journal articles

High-Quality Fault Resiliency in Fat Trees

Abstract : Coupling regular topologies with optimised routing algorithms is key in pushing the performance of interconnection networks of supercomputers. In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalised Fat-Trees (PGFTs) which minimises congestion risk even under massive network degradation caused by equipment failure. Dmodc computes forwarding tables with a closed-form arithmetic formula by relying on a fast preprocessing phase. This allows complete re-routing of networks with tens of thousands of nodes in less than a second. In turn, this greatly helps centralised fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters.
Complete list of metadata

https://hal-universite-paris-saclay.archives-ouvertes.fr/hal-03861715
Contributor : John Gliksberg Connect in order to contact the contributor
Submitted on : Monday, November 21, 2022 - 5:55:25 PM
Last modification on : Wednesday, November 23, 2022 - 5:41:06 PM

Files

article.pdf
Files produced by the author(s)

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

Citation

John Gliksberg, Antoine Capra, Alexandre Louvet, Pedro Javier Garcia, Devan Sohier. High-Quality Fault Resiliency in Fat Trees. IEEE Micro, 2020, 40 (1), pp.44-49. ⟨10.1109/MM.2019.2949978⟩. ⟨hal-03861715⟩

Share

Metrics

Record views

0

Files downloads

0