Translation & Alignment Quality /bleu-comet-metrics
Transparent evaluation of our AI-assisted multilingual pipeline using industry-standard automatic metrics. Updated monthly with full methodology documentation.
Last evaluated: Nov 12, 2025 • Pipeline v4.2.1
BLEU Score Bilingual Evaluation Understudy
Measures n-gram precision between machine-generated translations and reference human translations. Scores range from 0–100. Higher values indicate closer lexical overlap with expert-curated text.
BLEU = BP · exp(∑ log(pₙ))
COMET Score Crosslingual Optimized Metric for Evaluation of Translation
Neural metric that correlates strongly with human judgment (DA score >0.85). Evaluates semantic fidelity, fluency, and cultural adaptation across language pairs.
COMET = f(source, target, reference; θ)
📊 Live Evaluation Dashboard
| Language Pair | BLEU (0-100) | COMET (0-100) | Trend (30d) | Status |
|---|---|---|---|---|
|
🇬🇧
EN → ES
English → Spanish |
87.4
|
91.2
|
▲ +2.1 | Excellent |
|
🇬🇧
EN → ZH
English → Mandarin |
82.9
|
88.5
|
▲ +1.8 | Excellent |
|
🇬🇧
EN → AR
English → Arabic |
76.3
|
84.1
|
● 0.0 | Good |
|
🇬🇧
EN → HI
English → Hindi |
71.8
|
79.4
|
▲ +3.2 | Good |
|
🇬🇧
EN → SW
English → Swahili |
64.2
|
73.8
|
▼ -0.4 | Review |
Evaluation Pipeline
- Source texts sampled from verified academic & editorial domains
- AI-assisted generation using domain-finetuned LLM v4.2
- Parallel reference corpus from peer-reviewed translations
- AUTOMATIC scoring via SacreBLEU & COMET-22-Da
- Monthly re-evaluation with stratified random sampling
Human-in-the-Loop Validation
- Low COMET/bleu pairs (<75) trigger expert review
- Cultural nuance checks by native-speaking editors
- Terminology consistency verified against Aevum glossary
- Feedback loop retrains alignment weights quarterly
- Public audit logs available for academic replication
Note on Metrics: BLEU and COMET are automated statistical measures. While highly correlated with human judgment, they do not replace editorial oversight. Aevum Encyclopedia maintains a mandatory human verification step for all published multilingual entries. Metric thresholds for auto-publishing: BLEU ≥ 78, COMET ≥ 82. Below thresholds require senior editor approval.