为了便于计算,我们做一些简化假设:
- 假设词嵌入维度 d=3,dk=dv=2。
- 假设经过词嵌入后,“我”“爱”“自然”“语言”“处理” 这五个词对应的嵌入向量分别为:
- “我”:x1=[0.1,0.2,0.3]
- “爱”:x2=[0.4,0.5,0.6]
- “自然”:x3=[0.7,0.8,0.9]
- “语言”:x4=[1.0,1.1,1.2]
- “处理”:x5=[1.3,1.4,1.5]
- 随机初始化权重矩阵 WQ,WK,WV:
- WQ=0.10.30.50.20.40.6
- WK=0.70.91.10.81.01.2
- WV=1.31.51.71.41.61.8
输入矩阵
X 为:
X=0.10.40.71.01.30.20.50.81.11.40.30.60.91.21.5
计算
Q:
Q=XWQ=0.10.40.71.01.30.20.50.81.11.40.30.60.91.21.50.10.30.50.20.40.6=0.220.490.761.031.30.280.620.961.31.66
计算
K:
K=XWK=0.10.40.71.01.30.20.50.81.11.40.30.60.91.21.50.70.91.10.81.01.2=0.581.332.082.833.580.681.542.43.264.18
计算
V:
V=XWV=0.10.40.71.01.30.20.50.81.11.40.30.60.91.21.51.31.51.71.41.61.8=0.82.053.34.555.80.922.33.685.066.44
A=dkQKT,这里 dk=2,所以 dk=2≈1.414
QKT=0.220.490.761.031.30.280.620.961.31.66[0.580.681.331.542.082.42.833.263.584.18]=0.29920.6791.0761.4591.8340.6761.49312.38243.23454.08941.03362.30123.70885.01646.40481.4363.11065.03566.7998.72141.84443.926.36328.581811.0372
A=2QKT≈0.21160.48010.76091.03181.29680.4781.05591.6852.28612.89140.73111.62752.6233.54764.52741.01552.23.5604.8076.16641.30442.7714.56.0687.8064.5001
对 A 按行进行 softmax 操作,softmax 函数定义为:softmax(xi)=∑j=1nexjexi
以第一行为例:
e0.2116≈1.235,
e0.478≈1.613,
e0.7311≈2.078,
e1.0155≈2.762,
e1.3044≈3.685
分母 ∑j=15exj=1.235+1.613+2.078+2.762+3.685=11.373
第一行经过
softmax 后的值为:
11.3731.235≈0.109,
11.3731.613≈0.142,
11.3732.078≈0.183,
11.3732.762≈0.243,
11.3733.685≈0.324
同理可计算其他行,得到
W:
W=0.1090.0740.0530.0420.0340.1420.1440.1070.0830.0680.1830.2050.1610.1270.1040.2430.2640.2150.1720.1410.3240.3130.4640.5760.653
O=WVO=0.1090.0740.0530.0420.0340.1420.1440.1070.0830.0680.1830.2050.1610.1270.1040.2430.2640.2150.1720.1410.3240.3130.4640.5760.6530.82.053.34.555.80.922.33.685.066.44=3.9213.8573.7973.7393.6844.4234.3254.2334.1454.061
以上计算是为了展示 Transformer 中 QKV 计算的基本流程,实际应用中维度和参数会复杂得多。