以下是论文《LoRA: Low - Rank Adaptation of Large Language Models》的第三部分“Low - Rank Adaptation”的内容及解释:
3 Low - Rank AdaptationWe propose Low - Rank Adaptation (LoRA) to address the challenge of adapting large pre - trained language models to downstream tasks. Our approach is inspired by the observation that the weights of pre - trained language models often have a low - rank structure after adaptation to new tasks.
We freeze the pre - trained model weights and introduce trainable rank - decomposition matrices into each layer of the Transformer architecture. Specifically, for a dense layer W0∈Rd×k in the Transformer, we assume that the change in the weights during adaptation ΔW can be approximated by a low - rank matrix AB, where A∈Rd×r and B∈Rr×k, and r≪min(d,k). The adapted weights are then given by W=W0+ΔW=W0+AB.
During training, we only update the matrices A and B, while keeping W0 fixed. This significantly reduces the number of trainable parameters. For example, if d=k=12,288 (as in GPT - 3) and r=8, the number of parameters in AB is (12,288×8)+(8×12,288)=196,608, which is much smaller than the number of parameters in W0 (12,288×12,288=151,938,944).
The simple linear design of LoRA allows us to merge the trainable matrices A and B with the frozen weights W0 at deployment time. This means that, unlike other methods such as adapters, LoRA does not introduce additional inference latency.
这部分主要介绍了低秩适应(LoRA)方法。其灵感来源于观察到预训练语言模型在适应新任务后,权重往往具有低秩结构。
具体做法是冻结预训练模型的权重,在Transformer架构的每一层中引入可训练的秩分解矩阵。对于Transformer中的一个密集层W0∈Rd×k,假设在适应过程中权重的变化ΔW可以用一个低秩矩阵AB来近似,其中A∈Rd×r,B∈Rr×k,且r≪min(d,k),那么适应后的权重就是W=W0+ΔW=W0+AB。
在训练过程中,只更新矩阵A和B,而保持W0固定,这样显著减少了可训练参数的数量。例如,在GPT - 3中d=k=12,288,当r=8时,AB中的参数数量远小于W0中的参数数量。
LoRA简单的线性设计使其在部署时可以将可训练矩阵A和B与冻结的权重W0合并,这意味着它不像适配器等其他方法那样会引入额外的推理延迟。