GLADIS: A General and Large Acronym Disambiguation Benchmark

Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries. However, existing acronym disambiguation benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. To accelerate the research on acronym disambiguation, we construct a new benchmark named GLADIS with three components: (1) a much larger acronym dictionary with 1.5M acronyms and 6.4M long forms; (2) a pre-training corpus with 160 million sentences; (3) three datasets that cover the general, scientific, and biomedical domains. We then pre-train a language model, AcroBERT, on our constructed corpus for general acronym disambiguation, and show the challenges and values of our new benchmark.

Keywords

Acronym Disambiguation Entity Linking Benchmark

Domains

Computer Science [cs] Computation and Language [cs.CL]

Fichier principal

2302.01860.pdf (694.67 Ko)

Origin : Files produced by the author(s)

Lihu Chen : Connect in order to contact the contributor

https://hal.science/hal-04039173

Submitted on : Tuesday, March 21, 2023-12:43:49 PM

Last modification on : Wednesday, May 15, 2024-1:20:04 PM

Dates and versions

hal-04039173 , version 1 (21-03-2023)

Licence

Attribution - NonCommercial - ShareAlike

Identifiers

HAL Id : hal-04039173 , version 1

Cite

Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek. GLADIS: A General and Large Acronym Disambiguation Benchmark. EACL 2023 - The 17th Conference of the European Chapter of the Association for Computational Linguistics, May 2023, Dubrovnik, Croatia. ⟨hal-04039173⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CEA INSTITUT-TELECOM INRIA INRIA2 CEA-UPSAY UNIV-PARIS-SACLAY JOLIOT CEA-DRF NEUROSPIN LTCI INFRES DIG IP_PARIS ANR GS-COMPUTER-SCIENCE GS-LIFE-SCIENCES-HEALTH

222 View

36 Download