Vision and Multi-modal Transformers - Laboratoire Interdisciplinaire des Sciences du Numérique Accéder directement au contenu
Chapitre D'ouvrage Année : 2023

Vision and Multi-modal Transformers

Résumé

Transformers that rely on the self-attention mechanism to capture global dependencies have dominated in natural language modelling and their use in other domains, e.g. speech processing, has shown great potential. The impressive results obtained on these domains leads computer vision researchers to apply transformers to visual data. However, the application of an architecture designed for sequential data is not straightforward for data represented as 2-D matrices. This chapter presents how Transformers were introduced in the domain of vision processing, challenging the historical Convolutional Neural Networks based approaches. After a brief reminder about historical methods in computer vision, namely convolution and self-attention, the chapter focuses on the modifications introduced in the Transformers architecture to deal with the peculiarities of visual data, using two different strategies. In a last part, recent work applying Transformer architecture in a multimodal context is also presented.
Fichier principal
Vignette du fichier
Vision_and_Multi-modal_Transformers.pdf (4.69 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04413851 , version 1 (26-01-2024)

Identifiants

Citer

Camille Guinaudeau. Vision and Multi-modal Transformers. Mohamed Chetouani; Virginia Dignum; Paul Lukowicz; Carles Sierra. Human-Centered Artificial Intelligence. Advanced Lectures, 13500, Springer, pp.106 - 122, 2023, Lecture Notes in Computer Science, 978-3-031-24348-6. ⟨10.1007/978-3-031-24349-3_7⟩. ⟨hal-04413851⟩
9 Consultations
14 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More