Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs
Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.
💡 Research Summary
This paper presents a life‑cycle‑aware evaluation of knowledge distillation (KD) for neural machine translation (NMT), focusing on the trade‑off between translation quality and environmental impact measured as carbon footprint. While most prior work reports only quality gains of student models, the authors argue that the additional computational cost of the distillation process must be accounted for, especially when deployment resources or energy consumption are constraints.
To this end, they adopt the Machine Learning Life Cycle Assessment (MLCA) framework, which quantifies both operational emissions (electricity used during training, distillation, and inference) and embodied emissions (manufacturing of hardware). The functional unit is defined as “the impact of producing an MT system that serves X translation requests over one year at a specified quality level.” The system boundary includes data‑center servers (GPU V100) but excludes end‑user devices and network transfer, assuming these are constant across compared systems. Operational emissions are calculated from measured wall‑clock time, GPU power draw, a Power Usage Effectiveness (PUE) of 1.24, and a regional grid emission factor of 0.033 kg CO₂e/kWh. Embodied emissions are amortized over a five‑year GPU lifetime with an active utilization rate of 0.8 during training/distillation and 0.2 during inference, using a manufacturing footprint of 150 kg CO₂e per GPU.
The experimental setup uses the WMT‑2024 English‑to‑Icelandic parallel corpus, evaluating on FLORES+ devtest with the COMET metric. A large Transformer‑Big teacher (205 M parameters) is distilled into two student sizes: Transformer‑Base (65 M) and Transformer‑Tiny (16 M). Six representative KD methods are examined, covering the two main families in NMT: word‑level (logit‑based) KD – Word‑KD, SEL‑KD, TIE‑KD – and sequence‑level (synthetic‑target) KD – Seq‑KD, Seq‑INTER, Seq‑REP. All models are trained on a single V100 GPU using the same optimizer, learning schedule, and early‑stopping criteria, ensuring a fair comparison.
Results are reported as carbon footprints decomposed into teacher training, distillation overhead, and inference emissions for varying token volumes. Key findings include:
-
Distillation overhead dominates at low deployment volumes. For workloads of up to roughly 10⁶ tokens per year, the distillation phase accounts for 60‑80 % of total emissions, making KD environmentally disadvantageous compared with a directly trained small model.
-
Inference dominates at scale, enabling net carbon savings. When the system serves ≥ 10⁹ tokens annually, the reduced inference cost of the student model outweighs the one‑time distillation expense, yielding a 20‑35 % lower total footprint than using the teacher model directly.
-
Word‑level KD offers superior footprint‑quality trade‑offs. Across comparable COMET scores (≈ 0.78), word‑level methods achieve 12‑18 % lower carbon emissions than sequence‑level counterparts. This advantage stems from avoiding the expensive teacher‑side beam‑search decoding required by sequence‑level KD; instead, the teacher’s logits are computed on‑the‑fly during student training, which is computationally cheaper overall.
-
Student model size influences the break‑even point. The more aggressive compression (Transformer‑Tiny) yields larger quality drops but also larger per‑token inference savings, shifting the break‑even token volume to lower levels. Conversely, the moderate compression (Transformer‑Base) maintains higher quality while still offering carbon benefits at high volumes.
Based on these observations, the authors propose practical guidelines: for low‑traffic applications, skip KD and train a compact model directly; for high‑traffic services, prefer word‑level KD and select a student size that balances the required translation quality against the anticipated token volume. They also emphasize that any model‑selection pipeline should incorporate full life‑cycle assessment rather than relying solely on traditional performance metrics.
In conclusion, this work is the first to systematically quantify the environmental cost of KD in MT, demonstrating that the decision to apply KD cannot be made on quality grounds alone. By integrating MLCA into the evaluation of KD methods, the paper provides a reproducible protocol that can guide sustainable AI development and deployment in translation and other sequence‑generation tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment