I am a Ph.D. student in the Department of Computer Science and part of the GMU NLP group at George Mason University, where I am advised by Dr. Antonios Anastasopoulos. My research focuses on adapting language models for low-resource languages. More broadly, I am interested in exploring the intersection of language and computational modeling.
I am in the job market. Graduating later this year (Fall 2025) looking for postdoc, machine-learning engineer, research scientist positions as well as summer 2025 internships. (my cv)
News
Nov 1, 2024
I am at EMNLP 2024, presenting my paper at the MRL Workshop.
Aug 25, 2024
DialectBench received the Best Social Impact Award at ACL 2024.
Aug 15, 2024
Wrapped up my summer internship at eBay, where I worked on creating a policy-aligned synthetic dataset for e-commerce LLM safety alignment.
Aug 15, 2022
Paper “Phylogeny-Inspired Adaptation of Multilingual Models to New Languages “ accepted to AACL 2022 main conference.
Selected Publications
ACL
DIALECTBENCH: An NLP Benchmark for Dialects, Varieties, and Closely-Related Languages
Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied varieties datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different varieties. We provide substantial proof of performance disparities between standard and non-standard language varieties, and we also identify language clusters with larger performance divergence across tasks.We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for varieties and one step towards advancing it further.
ACL
Dataset Geography: Mapping Language Data to Language Users
Faisal, Fahim, Wang, Yinkai, and Anastasopoulos, Antonios
In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) May 2022
As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems. Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. In doing so, we use entity recognition and linking systems, also making important observations about their cross-lingual consistency and giving suggestions for more robust evaluation. Last, we explore some geographical and economic factors that may explain the observed dataset distributions.
EMNLP
SD-QA: Spoken Dialectal Question Answering for the Real World
Question answering (QA) systems are now available through numerous commercial applications for a wide variety of domains, serving millions of users that interact with them via speech interfaces. However, current benchmarks in QA research do not account for the errors that speech recognition models might introduce, nor do they consider the language variations (dialects) of the users. To address this gap, we augment an existing QA dataset to construct a multi-dialect, spoken QA benchmark on five languages (Arabic, Bengali, English, Kiswahili, Korean) with more than 68k audio prompts in 24 dialects from 255 speakers. We provide baseline results showcasing the real-world performance of QA systems and analyze the effect of language variety and other sensitive speaker attributes on downstream performance. Last, we study the fairness of the ASR and QA models with respect to the underlying user populations.
AACL
Phylogeny-Inspired Adaptation of Multilingual Models to New Languages
Literature-based discovery process identifies the important but implicit relations among information embedded in published literature. Existing techniques from Information Retrieval (IR) and Natural Language Processing (NLP) attempt to identify the hidden or unpublished connections between information concepts within published literature, however, these techniques overlooked the concept of predicting the future and emerging relations among scientific knowledge components such as author selected keywords encapsulated within the literature. Keyword Co-occurrence Network (KCN), built upon author selected keywords, is considered as a knowledge graph that focuses both on these knowledge components and knowledge structure of a scientific domain by examining the relationships between knowledge entities. Using data from two multidisciplinary research domains other than the bio-medical domain, and capitalizing on bibliometrics, the dynamicity of temporal KCNs, and a recurrent neural network, this study develops some novel features supportive for the prediction of the future literature-based discoveries - the emerging connections (co-appearances in the same article) among keywords. Temporal importance extracted from both bipartite and unipartite networks, communities defined by genealogical relations, and the relative importance of temporal citation counts were used in the feature construction process. Both node and edge-level features were input into a recurrent neural network to forecast the feature values and predict the future relations between different scientific concepts/topics represented by the author selected keywords. High performance rates, compared both against contemporary heterogeneous network-based method and preferential attachment process, suggest that these features complement both the prediction of future literature-based discoveries and emerging trend analysis.