DialectBench

Paper

GitHub

Slides

Poster

LLM Eval

DIALECTBENCH is the first-ever large-scale benchmark for NLP on varieties, aggregating an extensive set of 281 language varieties over 10 text-level task datasets.

alt text Data and Language Varieties Selection We looked through papers published in the ACL Anthology from the last 10 years to find usable language resources, as well as commonly used online data repositories. We selected languages that have well-established, high-resourced varieties. Varieties may vary by location, ethnicity, or other factors.
Cluster-Variety Mapping We construct several language clusters comprising of both high-resourced dialects and their low-resourced counterparts. We use the Glottolog language database to define clusters and assign varieties.

Tasks

alt text The tasks and data sources of DIALECTBENCH

Language Clusters and Varieties

alt text DIALECTBENCH language clusters with their variety counts per task.

Summaries

alt text Task specific result summary using Maximum Obtainable Score. The varieties with the minimum scores exhibit a noticeable lag in performance across various tasks when compared to the average task performance.

Bibtex

@inproceedings{faisal-etal-2024-dialectbench,
    title = "{DIALECTBENCH}: An {NLP} Benchmark for Dialects, Varieties, and Closely-Related Languages",
    author = "Faisal, Fahim  and
      Ahia, Orevaoghene  and
      Srivastava, Aarohi  and
      Ahuja, Kabir  and
      Chiang, David  and
      Tsvetkov, Yulia  and
      Anastasopoulos, Antonios",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.777",
    doi = "10.18653/v1/2024.acl-long.777",
    pages = "14412--14454",
}