Bazinga! A Dataset for Multi-Party Dialogues Structuring

We introduce a dataset built around a large collection of TV (and movie) series. Those are filled with challenging multi-party dialogues. Moreover, TV series come with a very active fan base that allows the collection of metadata and accelerates annotation. With 16 TV and movie series, Bazinga! amounts to 400+ hours of speech and 8M+ tokens, including 500K+ tokens annotated with the speaker, addressee, and entity linking information. Along with the dataset, we also provide a baseline for speaker diarization, punctuation restoration, and person entity recognition. The results demonstrate the difficulty of the tasks and of transfer learning from models trained on mono-speaker audio or written text, which is more widely available. This work is a step towards better multi-party dialogue structuring and understanding. Bazinga! is available at hf.co/bazinga. Because (a large) part of Bazinga! is only partially annotated, we also expect this dataset to foster research towards self-or weakly-supervised learning methods.

Mots clés

Multi-Party Dialogues Speaker Diarization Entity Linking

Domaines

Multimédia [cs.MM]

Fichier principal

2022.lrec-1.367.pdf (1.58 Mo)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Paul Lerner : Connectez-vous pour contacter le contributeur

https://universite-paris-saclay.hal.science/hal-03737453

Soumis le : mercredi 27 juillet 2022-14:50:10

Dernière modification le : mardi 6 février 2024-14:40:08

Archivage à long terme le : vendredi 28 octobre 2022-18:07:02

Dates et versions

hal-03737453 , version 1 (27-07-2022)

Licence

Paternité

Identifiants

HAL Id : hal-03737453 , version 1

Citer

Paul Lerner, Juliette Bergoënd, Camille Guinaudeau, Hervé Bredin, Benjamin Maurice, et al.. Bazinga! A Dataset for Multi-Party Dialogues Structuring. 13th Conference on Language Resources and Evaluation (LREC 2022), European Language Resources Association (ELRA), Jun 2022, Marseille, France. pp.3434-3441. ⟨hal-03737453⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLSE2 CNRS INRIA UT1-CAPITOLE CENTRALESUPELEC UNIV-PARIS-SACLAY IRIT IRIT-SAMOVA ANR LISN GS-ENGINEERING IRIT-SI GS-COMPUTER-SCIENCE LISN-ILES LISN-TLP TOULOUSE-INP UNIV-UT3 UT3-TOULOUSEINP

238 Consultations

151 Téléchargements