High-Quality Fault Resiliency in Fat Trees - Université Paris-Saclay Access content directly
Journal Articles IEEE Micro Year : 2020

High-Quality Fault Resiliency in Fat Trees

Abstract

Coupling regular topologies with optimised routing algorithms is key in pushing the performance of interconnection networks of supercomputers. In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalised Fat-Trees (PGFTs) which minimises congestion risk even under massive network degradation caused by equipment failure. Dmodc computes forwarding tables with a closed-form arithmetic formula by relying on a fast preprocessing phase. This allows complete re-routing of networks with tens of thousands of nodes in less than a second. In turn, this greatly helps centralised fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters.
Fichier principal
Vignette du fichier
article.pdf (268.31 Ko) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-03861715 , version 1 (21-11-2022)

Licence

Attribution

Identifiers

Cite

John Gliksberg, Antoine Capra, Alexandre Louvet, Pedro Javier Garcia, Devan Sohier. High-Quality Fault Resiliency in Fat Trees. IEEE Micro, 2020, 40 (1), pp.44-49. ⟨10.1109/MM.2019.2949978⟩. ⟨hal-03861715⟩
7 View
17 Download

Altmetric

Share

Gmail Facebook Twitter LinkedIn More