Assessing Authenticity and Anonymity of Synthetic User-generated Content in the Medical Domain

Tomohiro Nishiyama; Lisa Raithel; Roland Roller; Pierre Zweigenbaum; Eiji Aramaki

Communication Dans Un Congrès Année : 2024

Assessing Authenticity and Anonymity of Synthetic User-generated Content in the Medical Domain

(1) , (2, 3) , (2) , (3) , (1)

1
2
3

Tomohiro Nishiyama

Fonction : Auteur
PersonId : 1343001

Nara Institute of Science and Technology - Graduate School of Information Science

Lisa Raithel

Fonction : Auteur
PersonId : 1192128

Deutsches Forschungszentrum für Künstliche Intelligenz GmbH = German Research Center for Artificial Intelligence

Sciences et Technologies des Langues - LISN

Roland Roller

Fonction : Auteur
PersonId : 1343006

Deutsches Forschungszentrum für Künstliche Intelligenz GmbH = German Research Center for Artificial Intelligence

Pierre Zweigenbaum

Fonction : Auteur
PersonId : 14995
IdHAL : pierre-zweigenbaum
ORCID : 0000-0001-8410-4808
IdRef : 06664268X

Sciences et Technologies des Langues - LISN

Eiji Aramaki

Fonction : Auteur
PersonId : 1343011

Nara Institute of Science and Technology - Graduate School of Information Science

Résumé

Since medical text cannot be shared easily due to privacy concerns, synthetic data bears much potential for natural language processing applications. In the context of social media and user-generated messages about drug intake and adverse drug effects, this work presents different methods to examine the authenticity of synthetic text. We conclude that the generated tweets are untraceable and show enough authenticity from the medical point of view to be used as a replacement for a real Twitter corpus. However, original data might still be the preferred choice as they contain much more diversity.

Mots clés

Natural Language Processing Natural Language Generation Biomedical domain Adverse Drug Effects

Domaines

Informatique et langage [cs.CL] Intelligence artificielle [cs.AI] Traitement du texte et du document

Fichier principal

Nishiyama_CALDPSEUDO2024.pdf (658.25 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte
licence : CC BY - Paternité

Pierre Zweigenbaum : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04528240

Soumis le : lundi 1 avril 2024-00:37:14

Dernière modification le : vendredi 5 avril 2024-03:24:06

Dates et versions

hal-04528240 , version 1 (01-04-2024)

Licence

Paternité

Identifiants

HAL Id : hal-04528240 , version 1

Citer

Tomohiro Nishiyama, Lisa Raithel, Roland Roller, Pierre Zweigenbaum, Eiji Aramaki. Assessing Authenticity and Anonymity of Synthetic User-generated Content in the Medical Domain. Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo), Mar 2024, St. Julian’s, Malta. pp.8-17. ⟨hal-04528240⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CENTRALESUPELEC UNIV-PARIS-SACLAY ANR LISN GS-COMPUTER-SCIENCE

0 Consultations

0 Téléchargements

Assessing Authenticity and Anonymity of Synthetic User-generated Content in the Medical Domain

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Partager