MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents - Laboratoire Interdisciplinaire des Sciences du Numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents

Montse Cuadros
  • Fonction : Auteur
  • PersonId : 962765
Aitor García-Pablos
  • Fonction : Auteur
  • PersonId : 1194603
Lucie Gianola
Cyril Grouin
Manuel Herranz
  • Fonction : Auteur
  • PersonId : 1087895
Patrick Paroubek

Résumé

This paper presents the outcomes of the MAPA project, a set of annotated corpora for 24 languages of the European Union and an open-source customisable toolkit able to detect and substitute sensitive information in text documents from any domain, using state-of-the art, deep learning-based named entity recognition techniques. In the context of the project, the toolkit has been developed and tested on administrative, legal and medical documents, obtaining state-of-the-art results. As a result of the project, 24 dataset packages have been released and the de-identification toolkit is available as open source.
Fichier principal
Vignette du fichier
Arranz_LEGAL2022.pdf (834.66 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
licence : CC BY NC - Paternité - Pas d'utilisation commerciale

Dates et versions

hal-03873042 , version 1 (26-11-2022)

Licence

Paternité - Pas d'utilisation commerciale

Identifiants

  • HAL Id : hal-03873042 , version 1

Citer

Victoria Arranz, Khalid Choukri, Montse Cuadros, Aitor García-Pablos, Lucie Gianola, et al.. MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents. Joint Workshop on Legal and Ethical Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Language Resources (LEGAL - MDLR 2022), Jun 2022, Marseille, France. pp.64-72. ⟨hal-03873042⟩
42 Consultations
15 Téléchargements

Partager

Gmail Facebook X LinkedIn More