GAN-based data augmentation for transcriptomics: survey and comparative assessment - Laboratoire Interdisciplinaire des Sciences du Numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

GAN-based data augmentation for transcriptomics: survey and comparative assessment

Résumé

MotivationTranscriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models’ full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes.ResultsThis work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly.Availability and implementationAll data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics
Fichier principal
Vignette du fichier
main.pdf (611.08 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte

Dates et versions

hal-04431920 , version 1 (01-02-2024)

Identifiants

Citer

Alice Lacan, Michèle Sebag, Blaise Hanczar. GAN-based data augmentation for transcriptomics: survey and comparative assessment. 31st Intelligent Systems for Molecular Biology (ISMB 2023), Jul 2023, Lyon, France. pp.i111-i120, ⟨10.1093/bioinformatics/btad239⟩. ⟨hal-04431920⟩
23 Consultations
10 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More