DIALECTBENCH
A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages.
DIALECTBENCH is the first-ever large-scale benchmark for NLP on varieties, aggregating an extensive set of 281 language varieties over 10 text-level task datasets.
Data and Language Varieties Selection We looked through papers published in the ACL Anthology from the last 10 years to find usable language resources, as well as commonly used online data repositories. We selected languages that
have well-established, high-resourced varieties. Varieties may vary by location, ethnicity, or other factors.
Cluster-Variety Mapping We construct several language clusters comprising of both high-resourced dialects and their low-resourced counterparts. We use the Glottolog language database to define clusters and assign varieties.
Tasks
The tasks and data sources of DIALECTBENCH
Language Clusters and Varieties
DIALECTBENCH language clusters with their variety counts per task.
Summaries
Task specific result summary using Maximum Obtainable Score. The varieties with the minimum scores exhibit a noticeable lag in performance across various tasks when compared to the average task performance.
Bibtex
@inproceedings{faisal-etal-2024-dialectbench,
title = "{DIALECTBENCH}: An {NLP} Benchmark for Dialects, Varieties, and Closely-Related Languages",
author = "Faisal, Fahim and
Ahia, Orevaoghene and
Srivastava, Aarohi and
Ahuja, Kabir and
Chiang, David and
Tsvetkov, Yulia and
Anastasopoulos, Antonios",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.777",
doi = "10.18653/v1/2024.acl-long.777",
pages = "14412--14454",
}