Welocalize, a leader in AI-enabled translation and multilingual content solutions, today announced the release of LangMark, a human-annotated dataset designed to advance research in automatic post-editing (APE) of machine translation (MT) outputs. Comprising more than 206,000 translation triplets across seven language pairs, LangMark addresses a key need for high-quality evaluation data tailored to modern neural machine translation (nMT) systems.
Developed using real enterprise marketing content and annotated by expert linguists, LangMark includes English source segments, nMT outputs, and corresponding human post-edits in Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish.
“LangMark offers a much-needed benchmark that reflects how post-editing actually happens in the real world,” said Diego Velazquez, AI/ML Engineer at Welocalize. “It challenges models to not only improve translations, but to decide when edits are needed—and when they aren’t.”
Key Findings from the Study:
- LLMs Can Outperform nMT in Complex Languages. In few-shot prompting experiments, large language models (LLMs)—particularly GPT-4o—outperformed a strong proprietary nMT baseline in most languages, especially Japanese and Russian.
- APE Remains a Nuanced Task. Most LLMs made significantly fewer edits than human linguists, highlighting the complexity of matching human editorial judgment.
- Conservative Models Performed Best. High-precision models like GPT-4o, which made fewer but more accurate edits, achieved better overall scores than high-recall models such as Claude 3.5-Sonnet, which tended to over-edit.
- Evaluation Metrics Need to Evolve. Traditional metrics like BLEU and CHRF don't fully capture the decision-making process behind post-editing. The study emphasized the need for new metrics that account for whether edits are necessary—not just how similar the output is to a reference.
LangMark was designed to reflect professional translation workflows. Its editing criteria, source materials, and linguist qualifications mirror enterprise content practices, making it a rigorous and realistic benchmark for testing APE models.
“This dataset is a powerful resource for anyone building multilingual AI,” added Konstantinos Karageorgos, AI/ML Engineering Lead at Welocalize. “It helps move us beyond surface-level metrics and toward systems that better emulate human judgment and language nuance.”
LangMark is intended to support future research into context-aware APE, multilingual model evaluation, and human-in-the-loop translation systems.
Download the full study here.
Welocalize, Inc.
Welocalize is a leading technology-enabled provider of translation, localization, and AI-driven content solutions, helping businesses communicate, innovate, and grow globally. Specializing in complex and regulated industries, Welocalize delivers precise, scalable multilingual content through a powerful combination of advanced AI technologies and expert human talent. At the core is Welocalize’s AI-enabled OPAL platform, which transforms translation workflows by integrating machine translation (MT) and large language models (LLMs) to provide fast, accurate, and culturally relevant content in over 300 languages. With a commitment to excellence, Welocalize holds 7 ISO certifications. Welocalize is headquartered in New York with offices all over the globe. welocalize.com