Aevum Encyclopedia — BLEU & COMET Metrics

BLEU Score Bilingual Evaluation Understudy

Measures n-gram precision between machine-generated translations and reference human translations. Scores range from 0–100. Higher values indicate closer lexical overlap with expert-curated text.

BLEU = BP · exp(∑ log(pₙ))

COMET Score Crosslingual Optimized Metric for Evaluation of Translation

Neural metric that correlates strongly with human judgment (DA score >0.85). Evaluates semantic fidelity, fluency, and cultural adaptation across language pairs.

COMET = f(source, target, reference; θ)

📊 Live Evaluation Dashboard

Language Pair	BLEU (0-100)	COMET (0-100)	Trend (30d)	Status
🇬🇧 EN → ES English → Spanish	87.4	91.2	▲ +2.1	Excellent
🇬🇧 EN → ZH English → Mandarin	82.9	88.5	▲ +1.8	Excellent
🇬🇧 EN → AR English → Arabic	76.3	84.1	● 0.0	Good
🇬🇧 EN → HI English → Hindi	71.8	79.4	▲ +3.2	Good
🇬🇧 EN → SW English → Swahili	64.2	73.8	▼ -0.4	Review

Evaluation Pipeline

Source texts sampled from verified academic & editorial domains
AI-assisted generation using domain-finetuned LLM v4.2
Parallel reference corpus from peer-reviewed translations
AUTOMATIC scoring via SacreBLEU & COMET-22-Da
Monthly re-evaluation with stratified random sampling

Human-in-the-Loop Validation

Low COMET/bleu pairs (<75) trigger expert review
Cultural nuance checks by native-speaking editors
Terminology consistency verified against Aevum glossary
Feedback loop retrains alignment weights quarterly
Public audit logs available for academic replication

Note on Metrics: BLEU and COMET are automated statistical measures. While highly correlated with human judgment, they do not replace editorial oversight. Aevum Encyclopedia maintains a mandatory human verification step for all published multilingual entries. Metric thresholds for auto-publishing: BLEU ≥ 78, COMET ≥ 82. Below thresholds require senior editor approval.

Translation & Alignment Quality /bleu-comet-metrics

BLEU Score Bilingual Evaluation Understudy

COMET Score Crosslingual Optimized Metric for Evaluation of Translation

📊 Live Evaluation Dashboard

Evaluation Pipeline

Human-in-the-Loop Validation