Fahim Faisal

Hi there, I’m Fahim!

I am a Ph.D. student in the Department of Computer Science and part of the GMU NLP group at George Mason University, where I am advised by Dr. Antonios Anastasopoulos. My research focuses on adapting language models for low-resource languages. More broadly, I am interested in exploring the intersection of language and computational modeling.

I am in the job market. Graduating later this year (Fall 2025) looking for postdoc, machine-learning engineer, research scientist positions as well as summer 2025 internships. (my cv)

News

Nov 1, 2024	I am at EMNLP 2024, presenting my paper at the MRL Workshop.
Aug 25, 2024	DialectBench received the Best Social Impact Award at ACL 2024.
Aug 15, 2024	Wrapped up my summer internship at eBay, where I worked on creating a policy-aligned synthetic dataset for e-commerce LLM safety alignment.
Aug 15, 2022	Paper “Phylogeny-Inspired Adaptation of Multilingual Models to New Languages “ accepted to AACL 2022 main conference.

Selected Publications

ACL
DIALECTBENCH: An NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Faisal, Fahim, Ahia, Orevaoghene, Srivastava, Aarohi, Ahuja, Kabir, Chiang, David, Tsvetkov, Yulia, and Anastasopoulos, Antonios

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Aug 2024

Abs Bib

Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied varieties datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different varieties. We provide substantial proof of performance disparities between standard and non-standard language varieties, and we also identify language clusters with larger performance divergence across tasks.We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for varieties and one step towards advancing it further.
@inproceedings{faisal-etal-2024-dialectbench, title = {{DIALECTBENCH}: An {NLP} Benchmark for Dialects, Varieties, and Closely-Related Languages}, author = {Faisal, Fahim and Ahia, Orevaoghene and Srivastava, Aarohi and Ahuja, Kabir and Chiang, David and Tsvetkov, Yulia and Anastasopoulos, Antonios}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = aug, year = {2024}, address = {Bangkok, Thailand}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.acl-long.777}, doi = {10.18653/v1/2024.acl-long.777}, pages = {14412--14454}, }
ACL
Dataset Geography: Mapping Language Data to Language Users

Faisal, Fahim, Wang, Yinkai, and Anastasopoulos, Antonios

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) May 2022

Abs Bib PDF Code Poster Slides Website

As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems. Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. In doing so, we use entity recognition and linking systems, also making important observations about their cross-lingual consistency and giving suggestions for more robust evaluation. Last, we explore some geographical and economic factors that may explain the observed dataset distributions.
@inproceedings{faisal-etal-2022-dataset, title = {Dataset Geography: Mapping Language Data to Language Users}, author = {Faisal, Fahim and Wang, Yinkai and Anastasopoulos, Antonios}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = may, year = {2022}, address = {Dublin, Ireland}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.acl-long.239}, doi = {10.18653/v1/2022.acl-long.239}, pages = {3381--3411}, }
EMNLP
SD-QA: Spoken Dialectal Question Answering for the Real World

Faisal, Fahim, Keshava, Sharlina, Alam, Md Mahfuz Ibn, and Anastasopoulos, Antonios

In Findings of the Association for Computational Linguistics: EMNLP 2021 Nov 2021

Abs Bib

Question answering (QA) systems are now available through numerous commercial applications for a wide variety of domains, serving millions of users that interact with them via speech interfaces. However, current benchmarks in QA research do not account for the errors that speech recognition models might introduce, nor do they consider the language variations (dialects) of the users. To address this gap, we augment an existing QA dataset to construct a multi-dialect, spoken QA benchmark on five languages (Arabic, Bengali, English, Kiswahili, Korean) with more than 68k audio prompts in 24 dialects from 255 speakers. We provide baseline results showcasing the real-world performance of QA systems and analyze the effect of language variety and other sensitive speaker attributes on downstream performance. Last, we study the fairness of the ASR and QA models with respect to the underlying user populations.
@inproceedings{faisal-etal-2021-sd-qa, title = {{SD}-{QA}: Spoken Dialectal Question Answering for the Real World}, author = {Faisal, Fahim and Keshava, Sharlina and Alam, Md Mahfuz Ibn and Anastasopoulos, Antonios}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2021}, month = nov, year = {2021}, address = {Punta Cana, Dominican Republic}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.findings-emnlp.281}, pages = {3296--3315}, }

AACL

Phylogeny-Inspired Adaptation of Multilingual Models to New Languages

Faisal, Fahim, and Anastasopoulos, Antonios

Accepted for publication in AACL 2022 Nov 2022

arXiv Bib Code Slides

@article{phylogeny,
  author = {Faisal, Fahim and Anastasopoulos, Antonios},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Phylogeny-Inspired Adaptation of Multilingual Models to New Languages},
  journal = {Accepted for publication in AACL 2022},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International},
}

JOI
Mining Temporal Evolution of Knowledge Graphs and Genealogical Features for Literature-based Discovery Prediction

Choudhury, Nazim, Faisal, Fahim, and Khushi, Matloob

Journal of Informetrics Nov 2020

Abs Bib

Literature-based discovery process identifies the important but implicit relations among information embedded in published literature. Existing techniques from Information Retrieval (IR) and Natural Language Processing (NLP) attempt to identify the hidden or unpublished connections between information concepts within published literature, however, these techniques overlooked the concept of predicting the future and emerging relations among scientific knowledge components such as author selected keywords encapsulated within the literature. Keyword Co-occurrence Network (KCN), built upon author selected keywords, is considered as a knowledge graph that focuses both on these knowledge components and knowledge structure of a scientific domain by examining the relationships between knowledge entities. Using data from two multidisciplinary research domains other than the bio-medical domain, and capitalizing on bibliometrics, the dynamicity of temporal KCNs, and a recurrent neural network, this study develops some novel features supportive for the prediction of the future literature-based discoveries - the emerging connections (co-appearances in the same article) among keywords. Temporal importance extracted from both bipartite and unipartite networks, communities defined by genealogical relations, and the relative importance of temporal citation counts were used in the feature construction process. Both node and edge-level features were input into a recurrent neural network to forecast the feature values and predict the future relations between different scientific concepts/topics represented by the author selected keywords. High performance rates, compared both against contemporary heterogeneous network-based method and preferential attachment process, suggest that these features complement both the prediction of future literature-based discoveries and emerging trend analysis.
@article{CHOUDHURY2020101057, title = {Mining Temporal Evolution of Knowledge Graphs and Genealogical Features for Literature-based Discovery Prediction}, journal = {Journal of Informetrics}, volume = {14}, number = {3}, pages = {101057}, year = {2020}, issn = {1751-1577}, doi = {https://doi.org/10.1016/j.joi.2020.101057}, url = {https://www.sciencedirect.com/science/article/pii/S1751157719304468}, author = {Choudhury, Nazim and Faisal, Fahim and Khushi, Matloob}, keywords = {Literature-based Knowledge Discovery, Dynamic Supervised Link Prediction, Keyword Co-occurrence Network (KCN), Genealogical Community, Weighted Temporal Citation}, }