Welo Data has released new research introducing a bilingual evaluation framework that reveals how leading large language models (LLMs) handle complex causal reasoning tasks across multiple languages. The study, “Diagnosing Performance Gaps in Causal Reasoning via Bilingual Prompting in LLMs,” identifies patterns of performance degradation in reasoning when models are tested using prompts that blend languages— conditions that more closely reflects real-world usage.
Eight LLMs from four major developers were evaluated using more than 70,700 prompts across six languages: English, Spanish, Japanese, Korean, Arabic, and Turkish, and four distinct question types: causal discovery, confounder identification, language variation, and norm violation.
Multilingual Evaluation Approach
Existing causal reasoning benchmarks often fall short in both linguistic diversity and task complexity. To address this gap, the Welo Data research team developed a rigorously constructed dataset comprising narrative-based causal reasoning prompts designed by human analysts with advanced degrees and at least five years of professional experience.
Key Findings and Implications
- Bilingual prompts introduced subtle but consistent accuracy drops, with performance declining by an average of 4.6% compared to monolingual prompts.
- Models displayed a recency bias, privileging the question language over the story language, which influenced both response accuracy and the language of model-generated reasoning.
- Across binary causal tasks, models exhibited a negative response bias, tending to reject causal claims—particularly when the correct answer was “yes.”
- Larger models significantly outperformed smaller ones, especially in confounder detection and norm violation scenarios.
“Multilingual evaluation typically relies on monolingual prompts translated into various languages,” said Dr. Abigail Thornton, Head of Research at Welo Data and co-author of the study. “But real-world use cases can often be bilingual or multilingual. Our findings show that prompt structure, language pairing, and model architecture can significantly shape how models interpret causality when reasoning requests are in more than one language.”
“Evaluations like these push beyond surface-level benchmarks and give us real insight into how language models handle complexity,” said Dr. David Harper, Data Scientist at Welo Data and lead author of the study. “The implications for multilingual applications—from global compliance to cross-border customer service—are significant.”
The research demonstrates that bilingual use cases surface inherent deficiencies that are not apparent in traditional benchmarks. These insights are essential for developers aiming to improve model generalization, robustness, and consistency in global applications.
This work represents the latest phase of Welo Data’s ongoing efforts to build transparent and scientifically grounded evaluation methodologies for LLMs. The Model Assessment Suite is designed to stress-test LLM capabilities using complex, domain-specific scenarios in multiple languages.
More information is available at welodata.ai.
Welo Data, a division of Welocalize, stands at the forefront of the AI training data industry, delivering exceptional data quality and security. Supported by a global network of over 500,000 AI training professionals and domain experts, along with cutting-edge technological infrastructure, Welo Data fulfills the growing demand for dependable training data across diverse AI applications. Its service offerings span a variety of critical areas, including data annotation and labeling, large language model (LLM) enhancement, data collection and generation, and relevance and intent assessment. Welo Data's technical expertise ensures that datasets are not only accurate but also culturally aligned, tackling significant AI development challenges like minimizing model bias and improving inclusivity. Its NIMO (Network Identity Management and Operations) framework guarantees the highest level of accuracy and quality in AI training data by leveraging advanced workforce assurance methods. welodata.ai