Bazinga! A Dataset for Multi-Party Dialogues Structuring - Archive ouverte HAL Access content directly
Conference Papers Year :

Bazinga! A Dataset for Multi-Party Dialogues Structuring

(1) , (1) , (1, 2) , (3, 4) , (1) , (1) , (1) , (1) , (1, 5) , (1, 5) , (6)
1
2
3
4
5
6
Claude Barras

Abstract

We introduce a dataset built around a large collection of TV (and movie) series. Those are filled with challenging multi-party dialogues. Moreover, TV series come with a very active fan base that allows the collection of metadata and accelerates annotation. With 16 TV and movie series, Bazinga! amounts to 400+ hours of speech and 8M+ tokens, including 500K+ tokens annotated with the speaker, addressee, and entity linking information. Along with the dataset, we also provide a baseline for speaker diarization, punctuation restoration, and person entity recognition. The results demonstrate the difficulty of the tasks and of transfer learning from models trained on mono-speaker audio or written text, which is more widely available. This work is a step towards better multi-party dialogue structuring and understanding. Bazinga! is available at hf.co/bazinga. Because (a large) part of Bazinga! is only partially annotated, we also expect this dataset to foster research towards self-or weakly-supervised learning methods.
Fichier principal
Vignette du fichier
2022.lrec-1.367.pdf (1.58 Mo) Télécharger le fichier
Origin : Publisher files allowed on an open archive

Dates and versions

hal-03737453 , version 1 (27-07-2022)

Licence

Attribution - CC BY 4.0

Identifiers

  • HAL Id : hal-03737453 , version 1

Cite

Paul Lerner, Juliette Bergoënd, Camille Guinaudeau, Hervé Bredin, Benjamin Maurice, et al.. Bazinga! A Dataset for Multi-Party Dialogues Structuring. 13th Conference on Language Resources and Evaluation (LREC 2022), European Language Resources Association (ELRA), Jun 2022, Marseille, France. pp.3434-3441. ⟨hal-03737453⟩
82 View
24 Download

Share

Gmail Facebook Twitter LinkedIn More