Skip to Main content Skip to Navigation
Conference papers

Bazinga! A Dataset for Multi-Party Dialogues Structuring

Paul Lerner 1 Juliette Bergoënd 1 Camille Guinaudeau 1, 2 Hervé Bredin 3 Benjamin Maurice 1 Sharleyne Lefevre 1 Martin Bouteiller 1 Aman Berhe 1 Léo Galmant 1, 4 Ruiqing Yin 1, 4 Claude Barras 5 
2 TLP - Traitement du Langage Parlé
LISN - Laboratoire Interdisciplinaire des Sciences du Numérique, STL - Sciences et Technologies des Langues
4 ILES - Information, Langue Ecrite et Signée
LISN - Laboratoire Interdisciplinaire des Sciences du Numérique, STL - Sciences et Technologies des Langues
Abstract : We introduce a dataset built around a large collection of TV (and movie) series. Those are filled with challenging multi-party dialogues. Moreover, TV series come with a very active fan base that allows the collection of metadata and accelerates annotation. With 16 TV and movie series, Bazinga! amounts to 400+ hours of speech and 8M+ tokens, including 500K+ tokens annotated with the speaker, addressee, and entity linking information. Along with the dataset, we also provide a baseline for speaker diarization, punctuation restoration, and person entity recognition. The results demonstrate the difficulty of the tasks and of transfer learning from models trained on mono-speaker audio or written text, which is more widely available. This work is a step towards better multi-party dialogue structuring and understanding. Bazinga! is available at hf.co/bazinga. Because (a large) part of Bazinga! is only partially annotated, we also expect this dataset to foster research towards self-or weakly-supervised learning methods.
Document type :
Conference papers
Complete list of metadata

https://hal-universite-paris-saclay.archives-ouvertes.fr/hal-03737453
Contributor : Paul Lerner Connect in order to contact the contributor
Submitted on : Wednesday, July 27, 2022 - 2:50:10 PM
Last modification on : Friday, August 5, 2022 - 9:27:32 AM

File

2022.lrec-1.367.pdf
Publisher files allowed on an open archive

Identifiers

  • HAL Id : hal-03737453, version 1

Citation

Paul Lerner, Juliette Bergoënd, Camille Guinaudeau, Hervé Bredin, Benjamin Maurice, et al.. Bazinga! A Dataset for Multi-Party Dialogues Structuring. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), Jun 2022, Marseille, France. pp.3434-3441. ⟨hal-03737453⟩

Share

Metrics

Record views

0

Files downloads

0