← Press Releases

New Research from Welo Data Uses Bilingual Prompts to Uncover Performance Differences in LLM Causal Reasoning

PRESS RELEASE FROM WELO DATA

June 5, 2025

Press Contacts

Welo Data Media Relations

NEW YORK —

Welo Data has released new research introducing a bilingual evaluation framework that reveals how leading large language models (LLMs) handle complex causal reasoning tasks across multiple languages. The study, “Diagnosing Performance Gaps in Causal Reasoning via Bilingual Prompting in LLMs,” identifies patterns of performance degradation in reasoning when models are tested using prompts that blend languages— conditions that more closely reflects real-world usage.

Eight LLMs from four major developers were evaluated using more than 70,700 prompts across six languages: English, Spanish, Japanese, Korean, Arabic, and Turkish, and four distinct question types: causal discovery, confounder identification, language variation, and norm violation.

Multilingual Evaluation Approach

Existing causal reasoning benchmarks often fall short in both linguistic diversity and task complexity. To address this gap, the Welo Data research team developed a rigorously constructed dataset comprising narrative-based causal reasoning prompts designed by human analysts with advanced degrees and at least five years of professional experience.

Key Findings and Implications

Bilingual prompts introduced subtle but consistent accuracy drops, with performance declining by an average of 4.6% compared to monolingual prompts.
Models displayed a recency bias, privileging the question language over the story language, which influenced both response accuracy and the language of model-generated reasoning.
Across binary causal tasks, models exhibited a negative response bias, tending to reject causal claims—particularly when the correct answer was “yes.”
Larger models significantly outperformed smaller ones, especially in confounder detection and norm violation scenarios.

“Multilingual evaluation typically relies on monolingual prompts translated into various languages,” said Dr. Abigail Thornton, Head of Research at Welo Data and co-author of the study. “But real-world use cases can often be bilingual or multilingual. Our findings show that prompt structure, language pairing, and model architecture can significantly shape how models interpret causality when reasoning requests are in more than one language.”

“Evaluations like these push beyond surface-level benchmarks and give us real insight into how language models handle complexity,” said Dr. David Harper, Data Scientist at Welo Data and lead author of the study. “The implications for multilingual applications—from global compliance to cross-border customer service—are significant.”

The research demonstrates that bilingual use cases surface inherent deficiencies that are not apparent in traditional benchmarks. These insights are essential for developers aiming to improve model generalization, robustness, and consistency in global applications.

This work represents the latest phase of Welo Data’s ongoing efforts to build transparent and scientifically grounded evaluation methodologies for LLMs. The Model Assessment Suite is designed to stress-test LLM capabilities using complex, domain-specific scenarios in multiple languages.

More information is available at welodata.ai.

###

About

Welo Data, a division of Welocalize, stands at the forefront of the AI training data industry, delivering exceptional data quality and security. Supported by a global network of over 500,000 AI training professionals and domain experts, along with cutting-edge technological infrastructure, Welo Data fulfills the growing demand for dependable training data across diverse AI applications. Its service offerings span a variety of critical areas, including data annotation and labeling, large language model (LLM) enhancement, data collection and generation, and relevance and intent assessment. Welo Data's technical expertise ensures that datasets are not only accurate but also culturally aligned, tackling significant AI development challenges like minimizing model bias and improving inclusivity. Its NIMO (Network Identity Management and Operations) framework guarantees the highest level of accuracy and quality in AI training data by leveraging advanced workforce assurance methods. welodata.ai

Industry Dive is an Informa TechTarget business.

This website is owned and operated by Informa TechTarget, part of a global network that informs, influences and connects the world's technology buyers and sellers. All copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. TechTarget, Inc.'s registered office is 275 Grove St. Newton, MA 02466.

Reach our audience

Related Publications

Don't miss tomorrow's tech industry news

New Research from Welo Data Uses Bilingual Prompts to Uncover Performance Differences in LLM Causal Reasoning