High-performance Tensor Contractions for GPUs

Ahmad Abdelfattah; Marc Baboulin; Veselin Dobrev; Jack J Dongarra; Christopher Earl; Joël Falcou; Azzam Haidar; Ian Karlin; Tzanio V Kolev; Ian Masliah; Stanimire Tomov

doi:10.1016/j.procs.2016.05.302

Communication Dans Un Congrès Année : 2016

High-performance Tensor Contractions for GPUs

(1) , (2) , (3) , (1) , (3) , (2) , (1) , (3) , (3) , (2) , (1)

1
2
3

Ahmad Abdelfattah

Fonction : Auteur

Innovative Computing Laboratory [Knoxville]

Marc Baboulin

Fonction : Auteur
PersonId : 16585
IdHAL : marc-baboulin
IdRef : 105979163

Systèmes parallèles (LRI)

Veselin Dobrev

Fonction : Auteur

Lawrence Livermore National Laboratory

Jack J Dongarra

Fonction : Auteur

Innovative Computing Laboratory [Knoxville]

Christopher Earl

Fonction : Auteur

Lawrence Livermore National Laboratory

Joël Falcou

Fonction : Auteur

Systèmes parallèles (LRI)

Azzam Haidar

Fonction : Auteur

Innovative Computing Laboratory [Knoxville]

Ian Karlin

Fonction : Auteur

Lawrence Livermore National Laboratory

Tzanio V Kolev

Fonction : Auteur

Lawrence Livermore National Laboratory

Ian Masliah

Fonction : Auteur

Systèmes parallèles (LRI)

Stanimire Tomov

Fonction : Auteur

Innovative Computing Laboratory [Knoxville]

Résumé

We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.

Mots clés

Applications FEM Batched linear algebra GPU Tensor HPC Tensor contractions

Domaines

Analyse numérique [cs.NA]

Fichier principal

iccs2016.pdf (2.36 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Marc Baboulin : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01409251

Soumis le : lundi 5 décembre 2016-18:04:35

Dernière modification le : mardi 13 février 2024-03:25:12

Archivage à long terme le : mardi 21 mars 2017-01:15:53

Dates et versions

hal-01409251 , version 1 (05-12-2016)

Identifiants

HAL Id : hal-01409251 , version 1
DOI : 10.1016/j.procs.2016.05.302

Citer

Ahmad Abdelfattah, Marc Baboulin, Veselin Dobrev, Jack J Dongarra, Christopher Earl, et al.. High-performance Tensor Contractions for GPUs. International Conference on Computational Science 2016 (ICCS 2016), Jun 2016, San Diego, CA, United States. pp.108-118, ⟨10.1016/j.procs.2016.05.302⟩. ⟨hal-01409251⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS UMR8623 LRI-PARSYS UNIV-PARIS-SACLAY LISN GS-ENGINEERING GS-COMPUTER-SCIENCE LISN-PARSYS

508 Consultations

140 Téléchargements

High-performance Tensor Contractions for GPUs

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager