Application of Knowledge Distillation in Natural Language Processing
DOI:
https://doi.org/10.61173/jkr5zx03Keywords:
Knowledge distillation, Natural language processing, Teacher-student framework, Pre-trained language modelsAbstract
As the technology of AI progresses steadily, natural language processing (NLP) has become an essential field of study in computer science and AI. It includes technologies that allow computers to comprehend and analyze as well as create human language. The introduction of huge pre-trained language models like Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT) has boosted the performance of models to a great extent. However, these models face challenges such as their massive parameter counts, high computational costs, and incompatibility with resource-constrained embedded devices. As one of the effective model compression methods, knowledge distillation (KD) is a method of knowledge transfer between large models and lightweight models in the form of a teacher-student paradigm that has shown considerable performance-to-efficiency benefits. This paper is a review of the fundamental technical strategies in output layer distillation, feature layer distillation and multi-teacher assisted distillation. It synthesizes materials of pertinent research papers; therefore, providing a systematic account of the current situation in this technology. It briefly describes the common application cases and experimental findings of knowledge distillation in natural language processing such as sentiment analysis, text classification, multilingual processing, named entity recognition, and web filtering. This paper ends by summarizing existing knowledge distillation applications on natural language processing and future development perspectives.