Follow-up on the TranslateGemma subtitle benchmark: human review of segments rated “clean” by MetricX-24 and COMETKiwi [D]

Follow-up on the TranslateGemma subtitle benchmark: human review of segments rated “clean” by MetricX-24 and COMETKiwi A few weeks ago I shared…

By AI Maestro May 12, 2026 2 min read
Follow-up on the TranslateGemma subtitle benchmark: human review of segments rated “clean” by MetricX-24 and COMETKiwi [D]

Follow-up on the TranslateGemma subtitle benchmark: human review of segments rated “clean” by MetricX-24 and COMETKiwi

A few weeks ago I shared the results of a benchmark comparing six large language models (LLMs) on subtitle translation, scored with two reference-free Quality Estimation (QE) metrics – MX from MetricX-24 (~13B mT5-XXL) and CK from COMETKiwi (~10.7B XLM-R-XXL). The scores were combined into a TQI index, which put TranslateGemma-12b first in every language pair.

The original benchmark suggested that these high scores might be overly optimistic. Specifically, the question was whether the metrics are sensitive enough to catch errors at their highest confidence levels. These metrics correlate well with human judgment at a population level (that’s what they’re trained for), but this doesn’t tell us if the segments they label as “clean” actually contain errors.

To address this, we ran an independent human review on 21 English subtitle segments from one tutorial video. TranslateGemma was tested in four languages: Spanish (ES), Japanese (JA), Thai (TH), and Chinese (ZH-CN – Korean and Traditional Chinese were excluded). All translations chosen because they passed the dashboard clean-rule (MX < 5 AND CK ≥ 0.70) in all four languages simultaneously.

We then conducted full MQM annotation by professional linguists, who scored each translation based on its accuracy (mistranslation, omission, addition, untranslated), fluency (grammar, punctuation, inconsistency), style, and terminology. The results are as follows:

  • Auto-flagged: 1 out of 84 translations
  • Human-flagged: 60 out of 84 for any error, 13 out of 84 for only Major errors
  • Metric-blindness rate (auto-clean ∩ human-flagged / auto-clean): 59 out of 83 = 71% for any error, 12 out of 83 = 14.5% for only Major errors
  • All 25 human-found Accuracy-class errors fell in the metric-blind quadrant. Zero overlap with the auto-flagged region (which contained one Style-category Major error).
  • Japanese carries 10 out of 15 total mistranslations across the dataset, all metric-blind, despite having the highest mean COMETKiwi score (0.863) of the four languages.

Note: This analysis is based on a small sample size and one model, so the numbers are directional rather than definitive. However, they provide valuable insights into how well these metrics perform at their highest confidence levels.

For more details, see the original thread: [link]

The full benchmark report can be found in comments.

Key Takeaways

  • High-confidence metrics like MX and CK may still miss some errors, even at their highest confidence levels.
  • Japanese translations are particularly prone to being flagged as clean by these metrics despite containing many errors.
  • The human review highlights the need for additional checks beyond automated metrics when evaluating language models’ performance on sensitive tasks like subtitle translation.

This follow-up underscores the importance of both quantitative and qualitative evaluations in assessing AI model performance, especially for applications where accuracy is paramount.


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top