Vision and Multi-modal Transformers

Camille Guinaudeau

doi:10.1007/978-3-031-24349-3_7

Chapitre D'ouvrage Année : 2023

Vision and Multi-modal Transformers

(1)

Camille Guinaudeau

Fonction : Auteur
PersonId : 20609
IdHAL : camille-guinaudeau
ORCID : 0000-0001-7249-8715
IdRef : 173844340

Laboratoire Interdisciplinaire des Sciences du Numérique

Résumé

Transformers that rely on the self-attention mechanism to capture global dependencies have dominated in natural language modelling and their use in other domains, e.g. speech processing, has shown great potential. The impressive results obtained on these domains leads computer vision researchers to apply transformers to visual data. However, the application of an architecture designed for sequential data is not straightforward for data represented as 2-D matrices. This chapter presents how Transformers were introduced in the domain of vision processing, challenging the historical Convolutional Neural Networks based approaches. After a brief reminder about historical methods in computer vision, namely convolution and self-attention, the chapter focuses on the modifications introduced in the Transformers architecture to deal with the peculiarities of visual data, using two different strategies. In a last part, recent work applying Transformer architecture in a multimodal context is also presented.

Mots clés

Convolutional neural networks Vision transformers Multimodal transformer

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV] Réseau de neurones [cs.NE]

Fichier principal

Vision_and_Multi-modal_Transformers.pdf (4.69 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Camille Guinaudeau : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04413851

Soumis le : vendredi 26 janvier 2024-02:17:16

Dernière modification le : lundi 29 janvier 2024-03:37:33

Dates et versions

hal-04413851 , version 1 (26-01-2024)

Identifiants

HAL Id : hal-04413851 , version 1
DOI : 10.1007/978-3-031-24349-3_7

Citer

Camille Guinaudeau. Vision and Multi-modal Transformers. Mohamed Chetouani; Virginia Dignum; Paul Lukowicz; Carles Sierra. Human-Centered Artificial Intelligence. Advanced Lectures, 13500, Springer, pp.106 - 122, 2023, Lecture Notes in Computer Science, 978-3-031-24348-6. ⟨10.1007/978-3-031-24349-3_7⟩. ⟨hal-04413851⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CENTRALESUPELEC UNIV-PARIS-SACLAY LISN GS-COMPUTER-SCIENCE

9 Consultations

14 Téléchargements

Vision and Multi-modal Transformers

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager