Seeking Neural Nuggets: Knowledge Transfer in LLMs from a Parametric Perspective (ICLR 2024)

1University of Illinois Urbana-Champaign, 2The University of Hong Kong, 3Microsoft Corporation
Large Language Models (LLMs) inherently encode extensive knowledge within their parameters. Previous studies have demonstrated that this parametric knowledge can be detected (e.g., via cloze tests) or modified (e.g., through knowledge editing).

Research Question: Taking this further, can task-specific parametric knowledge be transferred across LLMs of different scales?

Absolutely! Our paper provides empirical evidence supporting the transferability of parametric knowledge.

Different Paradigms of Knowledge Transfer

  • Offline Distillation:
    • Utilizing soft logits from the fine-tuned teacher model to guide the training of the student model.
  • Online Distillation:
    • Generating a distilled dataset that encapsulates the knowledge of the teacher model to fine-tune the student model.
  • Parametric Knowledege Transfer:
    • Extracting knowledge-specific parameters from the vanilla teacher model and injecting them into the student model.

Parametric Knowledege Transfer

  • Knowledge Extraction from Teacher Model:
    • Calculating sensitivity score for each parameter using seed samples.
    • Given that models of varying scales often differ in both the number of layers and their dimensions, we adopt a method of layer selection and dimensionality reduction based on sensitivity scores.
  • Knowledge Injection with LoRA:
    • Using LoRA module as a bridge for knowledge transfer.
    • Leveraging parameters extracted from the teacher model to initialize the LoRA module of the student model.

Experimental Results

  • Student models augmented with parametric knowledge from their respective teacher models exhibit substantial improvements.

  • When parameters are extracted from a concrete task, the performance is most significantly amplified for that same task.
  • This highlights the capability of our sensitivity-driven techniques to efficiently target certain types of knowledge, rather than just extracting generalized knowledge.

Analysis

  • Layer Selection Methods:
    • while the layer selection modestly affects student performance, our sensitivity-driven approach excels over the other strategies.
  • Origin of Extracted Parameters:
    • FFN showcases advanced transfer capabilities, intimating that it houses a significant share of the teacher’s knowledge.
    • Optimal results are obtained when transferring knowledge from all available modules.
  • Structure of Extracted Parameters:
    • Maintaining the teacher model’s parameter structure significantly benefits student model performance.

BibTeX

@article{zhong2023seeking,
      title={Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective},
      author={Zhong, Ming and An, Chenxin and Chen, Weizhu and Han, Jiawei and He, Pengcheng},
      journal={arXiv preprint arXiv:2310.11451},
      year={2023}
}