1.通过调整网络的初始化,优化网络的训练过程,使训练更有效(tanh封锁梯度,初始化不好导致很多的训练次数用于压缩权重)

网络权重初始化

fix softmax confidently wrong

初始参数:

1
2
3
4
5
6
7
g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27, 10), generator=g)
W1 = torch.randn((30, 200), generator=g)
b1 = torch.randn(200, generator=g)
W2 = torch.randn((200, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

训练过程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
for i in range(100):

# minibatch construct
ix = torch.randint(0, Xtr.shape[0], (32,))

# forward pass
emb = C[Xtr[ix]] # (32, 3, 10) 32X3是Xtr[ix]的维度
h = torch.tanh(emb.view(-1, 30) @ W1 + b1) # (32, 200)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ytr[ix]) # Ytr[ix]是32X1的向量,每一行说明了这条数据的groundtruth是哪个
print(loss.item())

# backward pass
for p in parameters:
p.grad = None
loss.backward()

# update
lr = lrs[i]
# lr = 0.1 if i < 100000 else 0.01
for p in parameters:
p.data += -lr * p.grad

# track stats
lri.append(lre[i])

# stepi.append(i)
lossi.append(loss.log10().item())

查看第一轮的loss:

image-20250520101602409

注意这里的loss计算,logits相当于计数被log之后的结果,cross entropy是进行了softmax然后-log平均(softmax是先ex指数然后soft,ex得到计数,soft得到概率,然后-log平均得到loss)

根据问题和模型的设定,我们期待模型在最开始认为所有的下一个字符出现的概率是相等的,也就是均匀分布,由此我们可以先估计均匀分布下的loss

1
-torch.tensor(1/27.0).log()

结果是tensor(3.2958),可以看到差距还是很大的,思考cross entropy得到loss的过程,如果对于每条数据,y对应的项的logits值都大于其他项,则经过softmax,-log后得到的loss越接近于0,如果其他项大于y对应的项,则不然。为了更加接近初始化的均匀分布,我们期待logits中每条数据的所有项都几乎相等,这样softmax后结果大差不差了

1
logits = h @ W2 + b2 # (32, 27)

直观的,为了让logits中每条数据各项都差不多,可以让W2足够小,b2初始化为0,让各项都逼近0

1
2
W2 = torch.randn((200, 27), generator=g) * 0.01
b2 = torch.randn(27, generator=g) * 0

重新运行训练过程:

image-20250602150221148

可以看到loss的变化不再呈现棒球状,因为在不进行初始化优化前,前期loss的迅速降低归功于对logits的压缩,可以想象做了很多步来调整参数使得最终的logits足够小,而现在这些简单的优化步骤替换为了真正有用的优化(我们花了更多的时间在优化神经网络上,而不是仅仅在前几千次迭代中压缩权重)

image-20250602150736359

通过调整初始化,我们能得到比不调整时更小的loss

image-20250602161905414

fix tanh layer too saturated at init

查看训练过程中h的值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
for i in range(20000):
# minibatch construct
ix = torch.randint(0, Xtr.shape[0], (32,))

# forward pass
emb = C[Xtr[ix]] # (32, 3, 10)
hpreact = emb.view(-1, 30) @ W1 + b1
h = torch.tanh(hpreact) # (32, 200)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ytr[ix])


# backward pass
for p in parameters:
p.grad = None
loss.backward()

# update
for p in parameters:
p.data += -0.01 * p.grad

# track stats
lossi.append(loss.log10().item())
break
print(loss.item())
print(h)

image-20250602162555027

可以看到h的值大部分都在-1或者1上,绘制成柱状图:

image-20250602162814675

tanh的结果位于-1和1上,说明输入tanh的值hpreact偏离[-1,1]区间,绘图:

image-20250602163146497

可以看到确实hpreact的分布很广,这导致了h值大部分为-1和1,而y = tanh(x), grad x = (1-tanh(x)^2)*grad y, 由于tanh(x)的值偏向于1和-1,导致grad y无法传递到x上,同样无法反向传播到前面的参数上,导致很多步的优化都无法大幅改变参数(无效的改变)。直接的,我们期待hpreact的值减小,所以修改参数w1b1

1
2
3
4
5
6
7
g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27, 10), generator=g)
W1 = torch.randn((30, 200), generator=g)* 0.1
b1 = torch.randn(200, generator=g) * 0.01
W2 = torch.randn((200, 27), generator=g) * 0.01
b2 = torch.randn(27, generator=g) * 0
parameters = [C, W1, b1, W2, b2]

修改后的h和hpreact:

image-20250602163709045

image-20250602163718848

可以看到经过合理得初始化,h的值更能接受些了,在初始化后重新训练,loss进一步降低

image-20250602164038806

loss log

经过权重初始化每一步后得到的loss的变化

image-20250602162409272

半原则性的参数确定

在之前,我们知道要把wb调小,但并不确定到底要调整为多少,比如对于参数w1,到底该调整为多少,我们希望输入一个标准高斯分布(均值为0,方差为1),经过操作之后输出也是标准高斯分布

1
2
3
4
5
6
7
8
9
10
x = torch.randn(1000,10)
w = torch.randn(10,200)
y = x @ w
print(x.mean(), x.std())
print(y.mean(), y.std())
plt.figure(figsize=(20,5))
plt.subplot(121)
plt.hist(x.view(-1).tolist(),50,density=True)
plt.subplot(122)
plt.hist(y.view(-1).tolist(),50,density=True)

对于上面的代码,结果如下:

1
2
tensor(0.0048) tensor(0.9914)
tensor(0.0011) tensor(3.1403)

image-20250602203904911

在进行矩阵乘法后,标准差变成了3,为了让标准差保持为1,可以乘以一个因子,这里半原则性的,需要除以根号fin,也就是除以根号10

1
2
3
4
5
6
7
8
9
10
x = torch.randn(1000,10)
w = torch.randn(10,200) / (10**0.5) # 这里说的fin就是第一个维度,10
y = x @ w
print(x.mean(), x.std())
print(y.mean(), y.std())
plt.figure(figsize=(20,5))
plt.subplot(121)
plt.hist(x.view(-1).tolist(),50,density=True)
plt.subplot(122)
plt.hist(y.view(-1).tolist(),50,density=True)

效果明显:

1
2
tensor(-0.0029) tensor(1.0011)
tensor(-0.0009) tensor(0.9944)

实际上在早期神经网络的研究中,已经有很多人研究该如何初始化这些参数,该乘以何种因子,比如pytorch中的kaiming_normal,给出了一些半原则性的建议,实际上也就是为了保证标准差不变,做的一些探索

image-20250602204821278

按照论文中的建议:

image-20250602205020337

对于tanh,gain的值是5/3,所以回到我们的mlp中,w1的初始化:

1
2
3
4
5
6
g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((vocab_size, n_embd), generator=g)
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * (5/3)/((n_embd * block_size)**0.5) #* 0.2
#b1 = torch.randn(n_hidden, generator=g) * 0.01
W2 = torch.randn((n_hidden, vocab_size), generator=g) * 0.01
b2 = torch.randn(vocab_size, generator=g) * 0

现代创新 make life easier

批量归一化 batch normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

思路是我们希望hpreact呈现高斯分布(如果hpreact太小,tanh不起作用,为0的时候可以直接通过了,如果hpreact太大,tanh让梯度无法通过),那为什么不将其归一化为高斯分布呢?

image-20250603154128370

虽然通过计算均值和方差就能进行标准高斯分布,但添加参数bngain和bnbias用于在训练过程中根据需要进行偏移和压缩

1
2
3
4
5
6
7
8
9
g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27, 10), generator=g)
W1 = torch.randn((30, 200), generator=g)* 0.09
b1 = torch.randn(200, generator=g) * 0.01
W2 = torch.randn((200, 27), generator=g) * 0.01
b2 = torch.randn(27, generator=g) * 0
bngain = torch.ones((1,200)) # 初始化的时候保持标准高斯分布
bnbias = torch.zeros((1,200))
parameters = [C, W1, b1, W2, b2, bngain, bnbias]

公式的实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
for i in range(20000):

# minibatch construct
ix = torch.randint(0, Xtr.shape[0], (32,))

# forward pass
emb = C[Xtr[ix]] # (32, 3, 10)
hpreact = emb.view(-1, 30) @ W1 + b1
hpreact = bngain * (hpreact-hpreact.mean(0,keepdim=True))/hpreact.std(0,keepdim=True) + bnbias
h = torch.tanh(hpreact) # (32, 200)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ytr[ix])

# backward pass
for p in parameters:
p.grad = None
loss.backward()

# update
for p in parameters:
p.data += -0.01 * p.grad

# track stats
lossi.append(loss.log10().item())
print(loss.item())

这样的batch normalization层通常加在线性层和卷积层后,让训练更加稳定

实际上batch normalization还带来了奇怪的特征,对于每个样本,因为均值和标准差的影响,他还会受到采样的其他样本的影响,这会对原始数据带来一些抖动,虽然听上去不太合理,但实际上这种抖动带来了正则化的效果,让模型难以对某个样本过拟合。

但现在的实现带来了一些问题,在测试的时候,按照上面的batch normalization,测试为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@torch.no_grad() # this decorator disables gradient tracking
def split_loss(split):
x,y = {
'train': (Xtr, Ytr),
'val': (Xdev, Ydev),
'test': (Xte, Yte),
}[split]
emb = C[x] # (N, block_size, n_embd)
embcat = emb.view(emb.shape[0], -1) # concat into (N, block_size * n_embd)
hpreact = embcat @ W1 # + b1
hpreact = bngain * (hpreact - hpreact.mean(0, keepdim=True)) / hpreact.std(0, keepdim=True) + bnbias
h = torch.tanh(hpreact) # (N, n_hidden)
logits = h @ W2 + b2 # (N, vocab_size)
loss = F.cross_entropy(logits, y)
print(split, loss.item())

split_loss('train')
split_loss('val')

但是如果只是测试一条数据呢,这时hpreact的计算就会出问题,论文给出的方法是在模型训练后进行第二步的计算,用训练集的均值和方差来固定batch normalazation层的均值和方差

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# calibrate the batch norm at the end of training

with torch.no_grad():
# pass the training set through
emb = C[Xtr]
embcat = emb.view(emb.shape[0], -1)
hpreact = embcat @ W1 # + b1
# measure the mean/std over the entire training set
bnmean = hpreact.mean(0, keepdim=True)
bnstd = hpreact.std(0, keepdim=True)

@torch.no_grad() # this decorator disables gradient tracking
def split_loss(split):
x,y = {
'train': (Xtr, Ytr),
'val': (Xdev, Ydev),
'test': (Xte, Yte),
}[split]
emb = C[x] # (N, block_size, n_embd)
embcat = emb.view(emb.shape[0], -1) # concat into (N, block_size * n_embd)
hpreact = embcat @ W1 # + b1
hpreact = bngain * (hpreact - bnmean) / bnstd + bnbias
h = torch.tanh(hpreact) # (N, n_hidden)
logits = h @ W2 + b2 # (N, vocab_size)
loss = F.cross_entropy(logits, y)
print(split, loss.item())

split_loss('train')
split_loss('val')

但这种计算两步的方法是很麻烦的,这样的显示校准需要额外再做一步,因此另外一种方式是在训练的过程中就预估mean和std

添加了bnmean_running和bnstd_running参数来预估最终的mean和std,这两个参数的更新采用的是平滑的方式,而不是训练的反向传播

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# MLP revisited
n_embd = 10 # the dimensionality of the character embedding vectors
n_hidden = 200 # the number of neurons in the hidden layer of the MLP

g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((vocab_size, n_embd), generator=g)
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * (5/3)/((n_embd * block_size)**0.5) #* 0.2
#b1 = torch.randn(n_hidden, generator=g) * 0.01
W2 = torch.randn((n_hidden, vocab_size), generator=g) * 0.01
b2 = torch.randn(vocab_size, generator=g) * 0

# BatchNorm parameters
bngain = torch.ones((1, n_hidden))
bnbias = torch.zeros((1, n_hidden))
bnmean_running = torch.zeros((1, n_hidden))
bnstd_running = torch.ones((1, n_hidden))

parameters = [C, W1, W2, b2, bngain, bnbias]
print(sum(p.nelement() for p in parameters)) # number of parameters in total
for p in parameters:
p.requires_grad = True

在训练过程中平滑地更新bnmean_running和bnstd_running,下面的代码还对模型进行了简化,删掉了b1这个参数,因为b1带来的偏差实际上是无用的,他的影响在归一化时被消磨了,而他的功能被bnbias所取代了

实际上如果batch norm层前面有线性层或者卷积层,这些层的bias都是无效的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# same optimization as last time
max_steps = 200000
batch_size = 32
lossi = []

for i in range(max_steps):

# minibatch construct
ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
Xb, Yb = Xtr[ix], Ytr[ix] # batch X,Y

# forward pass
emb = C[Xb] # embed the characters into vectors
embcat = emb.view(emb.shape[0], -1) # concatenate the vectors
# Linear layer
hpreact = embcat @ W1 #+ b1 # hidden layer pre-activation
# BatchNorm layer
# -------------------------------------------------------------
bnmeani = hpreact.mean(0, keepdim=True)
bnstdi = hpreact.std(0, keepdim=True)
hpreact = bngain * (hpreact - bnmeani) / bnstdi + bnbias
with torch.no_grad():
bnmean_running = 0.999 * bnmean_running + 0.001 * bnmeani
bnstd_running = 0.999 * bnstd_running + 0.001 * bnstdi
# -------------------------------------------------------------
# Non-linearity
h = torch.tanh(hpreact) # hidden layer
logits = h @ W2 + b2 # output layer
loss = F.cross_entropy(logits, Yb) # loss function

# backward pass
for p in parameters:
p.grad = None
loss.backward()

# update
lr = 0.1 if i < 100000 else 0.01 # step learning rate decay
for p in parameters:
p.data += -lr * p.grad

# track stats
if i % 10000 == 0: # print every once in a while
print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
lossi.append(loss.log10().item())

注意网络的模式,线性层/卷积层 + 归一化层 + 激活函数,三者作为一块可以不断堆叠形成更加深的网络

1:18:39

整个代码

不仅给出了上面的实现,还让代码更贴合pytorch的API调用

一些前处理的工作,读取数据集

1
2
3
4
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline
1
2
3
# read in all the words
words = open('names.txt', 'r').read().splitlines()
words[:8]
['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

获取stoi和itos字典,做映射

1
2
3
4
5
6
7
8
# build the vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
vocab_size = len(itos)
print(itos)
print(vocab_size)
{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}
27

构建数据集,划分数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# build the dataset
block_size = 3 # context length: how many characters do we take to predict the next one?

def build_dataset(words):
X, Y = [], []

for w in words:
context = [0] * block_size
for ch in w + '.':
ix = stoi[ch]
X.append(context)
Y.append(ix)
context = context[1:] + [ix] # crop and append

X = torch.tensor(X)
Y = torch.tensor(Y)
print(X.shape, Y.shape)
return X, Y

import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))

Xtr, Ytr = build_dataset(words[:n1]) # 80%
Xdev, Ydev = build_dataset(words[n1:n2]) # 10%
Xte, Yte = build_dataset(words[n2:]) # 10%

torch.Size([182625, 3]) torch.Size([182625])
torch.Size([22655, 3]) torch.Size([22655])
torch.Size([22866, 3]) torch.Size([22866])

按照之前的写法构建的完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# MLP revisited
n_embd = 10 # the dimensionality of the character embedding vectors
n_hidden = 200 # the number of neurons in the hidden layer of the MLP

g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((vocab_size, n_embd), generator=g)
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * (5/3)/((n_embd * block_size)**0.5) #* 0.2
#b1 = torch.randn(n_hidden, generator=g) * 0.01
W2 = torch.randn((n_hidden, vocab_size), generator=g) * 0.01
b2 = torch.randn(vocab_size, generator=g) * 0

# BatchNorm parameters
bngain = torch.ones((1, n_hidden))
bnbias = torch.zeros((1, n_hidden))
bnmean_running = torch.zeros((1, n_hidden))
bnstd_running = torch.ones((1, n_hidden))

parameters = [C, W1, W2, b2, bngain, bnbias]
print(sum(p.nelement() for p in parameters)) # number of parameters in total
for p in parameters:
p.requires_grad = True
12097

训练的过程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# same optimization as last time
max_steps = 200000
batch_size = 32
lossi = []

for i in range(max_steps):

# minibatch construct
ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
Xb, Yb = Xtr[ix], Ytr[ix] # batch X,Y

# forward pass
emb = C[Xb] # embed the characters into vectors
embcat = emb.view(emb.shape[0], -1) # concatenate the vectors
# Linear layer
hpreact = embcat @ W1 #+ b1 # hidden layer pre-activation
# BatchNorm layer
# -------------------------------------------------------------
bnmeani = hpreact.mean(0, keepdim=True)
bnstdi = hpreact.std(0, keepdim=True)
hpreact = bngain * (hpreact - bnmeani) / bnstdi + bnbias
with torch.no_grad():
bnmean_running = 0.999 * bnmean_running + 0.001 * bnmeani
bnstd_running = 0.999 * bnstd_running + 0.001 * bnstdi
# -------------------------------------------------------------
# Non-linearity
h = torch.tanh(hpreact) # hidden layer
logits = h @ W2 + b2 # output layer
loss = F.cross_entropy(logits, Yb) # loss function

# backward pass
for p in parameters:
p.grad = None
loss.backward()

# update
lr = 0.1 if i < 100000 else 0.01 # step learning rate decay
for p in parameters:
p.data += -lr * p.grad

# track stats
if i % 10000 == 0: # print every once in a while
print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
lossi.append(loss.log10().item())

      0/ 200000: 3.3239
  10000/ 200000: 2.0322
  20000/ 200000: 2.5675
  30000/ 200000: 2.0125
  40000/ 200000: 2.2446
  50000/ 200000: 1.8897
  60000/ 200000: 2.0785
  70000/ 200000: 2.3681
  80000/ 200000: 2.2918
  90000/ 200000: 2.0238
 100000/ 200000: 2.3673
 110000/ 200000: 2.3132
 120000/ 200000: 1.6414
 130000/ 200000: 1.9311
 140000/ 200000: 2.2231
 150000/ 200000: 2.0027
 160000/ 200000: 2.0997
 170000/ 200000: 2.4949
 180000/ 200000: 2.0199
 190000/ 200000: 2.1707

loss的降低过程比较艰难,进行的有效优化

1
plt.plot(lossi)

png

下面是之前残存的两步代码(先train,然后再计算训练集的mean和std来实现单个样本的推理)

1
2
3
4
5
6
7
8
9
10
# calibrate the batch norm at the end of training

with torch.no_grad():
# pass the training set through
emb = C[Xtr]
embcat = emb.view(emb.shape[0], -1)
hpreact = embcat @ W1 # + b1
# measure the mean/std over the entire training set
bnmean = hpreact.mean(0, keepdim=True)
bnstd = hpreact.std(0, keepdim=True)

下面是测试部分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@torch.no_grad() # this decorator disables gradient tracking
def split_loss(split):
x,y = {
'train': (Xtr, Ytr),
'val': (Xdev, Ydev),
'test': (Xte, Yte),
}[split]
emb = C[x] # (N, block_size, n_embd)
embcat = emb.view(emb.shape[0], -1) # concat into (N, block_size * n_embd)
hpreact = embcat @ W1 # + b1
#hpreact = bngain * (hpreact - hpreact.mean(0, keepdim=True)) / hpreact.std(0, keepdim=True) + bnbias
hpreact = bngain * (hpreact - bnmean_running) / bnstd_running + bnbias
h = torch.tanh(hpreact) # (N, n_hidden)
logits = h @ W2 + b2 # (N, vocab_size)
loss = F.cross_entropy(logits, y)
print(split, loss.item())

split_loss('train')
split_loss('val')
train 2.0674145221710205
val 2.1056840419769287

>

loss log

original:

train 2.1245384216308594
val 2.168196439743042

fix softmax confidently wrong:

train 2.07
val 2.13

fix tanh layer too saturated at init:

train 2.0355966091156006
val 2.1026785373687744

use semi-principled “kaiming init” instead of hacky init:

train 2.0376641750335693
val 2.106989622116089

add batch norm layer

train 2.0668270587921143
val 2.104844808578491

batch norm 并不比上面的更优,但以一种简单的方式达到了稳定训练的效果

下面是pytorch风格的代码,对网络进行了封装

image-20250603154128370

1
# SUMMARY + PYTORCHIFYING -----------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# Let's train a deeper network
# The classes we create here are the same API as nn.Module in PyTorch

class Linear:

def __init__(self, fan_in, fan_out, bias=True):
self.weight = torch.randn((fan_in, fan_out), generator=g) / fan_in**0.5 # 还是进行了初始化
self.bias = torch.zeros(fan_out) if bias else None

def __call__(self, x):
self.out = x @ self.weight
if self.bias is not None:
self.out += self.bias
return self.out

def parameters(self):
return [self.weight] + ([] if self.bias is None else [self.bias])


class BatchNorm1d:

def __init__(self, dim, eps=1e-5, momentum=0.1):
self.eps = eps # 这个eps在之前的实现中没出现,是为了防止方差恰好为0出错的,可以看公式图
self.momentum = momentum # 对应之前的0.01,也就是平滑的时候的参数
self.training = True # 区分训练状态和测试状态
# parameters (trained with backprop)
self.gamma = torch.ones(dim) # bngain
self.beta = torch.zeros(dim) # bnbias
# buffers (trained with a running 'momentum update')
self.running_mean = torch.zeros(dim)
self.running_var = torch.ones(dim)

def __call__(self, x):
# calculate the forward pass
if self.training: # train阶段的mean和var是计算得到的
xmean = x.mean(0, keepdim=True) # batch mean
xvar = x.var(0, keepdim=True) # batch variance
else: # eval阶段的mean和var是预估的
xmean = self.running_mean
xvar = self.running_var
xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
self.out = self.gamma * xhat + self.beta
# update the buffers
if self.training:
with torch.no_grad():
self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmean
self.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvar
return self.out

def parameters(self):
return [self.gamma, self.beta]

class Tanh:
def __call__(self, x):
self.out = torch.tanh(x)
return self.out
def parameters(self):
return []

n_embd = 10 # the dimensionality of the character embedding vectors
n_hidden = 100 # the number of neurons in the hidden layer of the MLP
g = torch.Generator().manual_seed(2147483647) # for reproducibility

C = torch.randn((vocab_size, n_embd), generator=g)
layers = [
Linear(n_embd * block_size, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
Linear( n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
Linear( n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
Linear( n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
Linear( n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
Linear( n_hidden, vocab_size, bias=False), BatchNorm1d(vocab_size),
]
# layers = [
# Linear(n_embd * block_size, n_hidden), Tanh(),
# Linear( n_hidden, n_hidden), Tanh(),
# Linear( n_hidden, n_hidden), Tanh(),
# Linear( n_hidden, n_hidden), Tanh(),
# Linear( n_hidden, n_hidden), Tanh(),
# Linear( n_hidden, vocab_size),
# ]

# 下面对参数进行了进一步的调整
with torch.no_grad():
# last layer: make less confident
layers[-1].gamma *= 0.1
#layers[-1].weight *= 0.1
# all other layers: apply gain
for layer in layers[:-1]:
if isinstance(layer, Linear):
layer.weight *= 1.0 #5/3

parameters = [C] + [p for layer in layers for p in layer.parameters()]
print(sum(p.nelement() for p in parameters)) # number of parameters in total
for p in parameters:
p.requires_grad = True
47024

下面是训练过程,随机梯度下降,先前向传播,计算loss,清空grad,计算grad,反向传播

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# same optimization as last time
max_steps = 200000
batch_size = 32
lossi = []
ud = []

for i in range(max_steps):

# minibatch construct
ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
Xb, Yb = Xtr[ix], Ytr[ix] # batch X,Y

# forward pass
emb = C[Xb] # embed the characters into vectors
x = emb.view(emb.shape[0], -1) # concatenate the vectors
for layer in layers:
x = layer(x)
loss = F.cross_entropy(x, Yb) # loss function

# backward pass
for layer in layers: #强制PyTorch保留神经网络各层输出的梯度信息,即使这些输出是中间计算过程(非叶子节点)。为了后面的可视化
layer.out.retain_grad() # AFTER_DEBUG: would take out retain_graph
for p in parameters:
p.grad = None
loss.backward()

# update
lr = 0.1 if i < 150000 else 0.01 # step learning rate decay
for p in parameters:
p.data += -lr * p.grad

# track stats
if i % 10000 == 0: # print every once in a while
print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
lossi.append(loss.log10().item())
with torch.no_grad():
ud.append([((lr*p.grad).std() / p.data.std()).log10().item() for p in parameters])

if i >= 1000:
break # AFTER_DEBUG: would take out obviously to run full optimization
      0/ 200000: 3.2870

下面是通过可视化来诊断训练过程,一些诊断方法

1
2
3
4
5
6
7
8
9
10
11
12
# visualize histograms
plt.figure(figsize=(20, 4)) # width and height of the plot
legends = []
for i, layer in enumerate(layers[:-1]): # note: exclude the output layer
if isinstance(layer, Tanh): # 这里只对tanh进行可视化,因为比较方便可视化
t = layer.out
print('layer %d (%10s): mean %+.2f, std %.2f, saturated: %.2f%%' % (i, layer.__class__.__name__, t.mean(), t.std(), (t.abs() > 0.97).float().mean()*100)) # 网络输出的均值,方差,饱和度(绝对值超过0.97的占比)
hy, hx = torch.histogram(t, density=True) # 这个函数绘制直方图,density=True是按照概率密度来计算的,这里的概率密度和概率不一样,和区间长度也有关,所以会大于1,频数/(总量*区间长度)
plt.plot(hx[:-1].detach(), hy.detach())
legends.append(f'layer {i} ({layer.__class__.__name__}')
plt.legend(legends);
plt.title('activation distribution')

下面是输出结果,可以看到mean大致为0,std还不错,saturated比较小,tanh比较合理

layer 2 (      Tanh): mean -0.00, std 0.63, saturated: 2.78%
layer 5 (      Tanh): mean +0.00, std 0.64, saturated: 2.56%
layer 8 (      Tanh): mean -0.00, std 0.65, saturated: 2.25%
layer 11 (      Tanh): mean +0.00, std 0.65, saturated: 1.69%
layer 14 (      Tanh): mean +0.00, std 0.65, saturated: 1.88%

image-20250604095428011
上面是对value进行了可视化,下面对grad进行可视化

1
2
3
4
5
6
7
8
9
10
11
12
# visualize histograms
plt.figure(figsize=(20, 4)) # width and height of the plot
legends = []
for i, layer in enumerate(layers[:-1]): # note: exclude the output layer
if isinstance(layer, Tanh):
t = layer.out.grad
print('layer %d (%10s): mean %+f, std %e' % (i, layer.__class__.__name__, t.mean(), t.std()))
hy, hx = torch.histogram(t, density=True)
plt.plot(hx[:-1].detach(), hy.detach())
legends.append(f'layer {i} ({layer.__class__.__name__}')
plt.legend(legends);
plt.title('gradient distribution')
layer 2 (      Tanh): mean -0.000000, std 2.640702e-03
layer 5 (      Tanh): mean +0.000000, std 2.245584e-03
layer 8 (      Tanh): mean -0.000000, std 2.045742e-03
layer 11 (      Tanh): mean +0.000000, std 1.983134e-03
layer 14 (      Tanh): mean -0.000000, std 1.952382e-03

image-20250604095440200
下面对grad/value进行了测试,这个值其实意义不是很明显

1
2
3
4
5
6
7
8
9
10
11
12
# visualize histograms
plt.figure(figsize=(20, 4)) # width and height of the plot
legends = []
for i,p in enumerate(parameters):
t = p.grad
if p.ndim == 2:
print('weight %10s | mean %+f | std %e | grad:data ratio %e' % (tuple(p.shape), t.mean(), t.std(), t.std() / p.std()))
hy, hx = torch.histogram(t, density=True)
plt.plot(hx[:-1].detach(), hy.detach())
legends.append(f'{i} {tuple(p.shape)}')
plt.legend(legends)
plt.title('weights gradient distribution');
weight   (27, 10) | mean +0.000000 | std 8.020534e-03 | grad:data ratio 8.012630e-03
weight  (30, 100) | mean +0.000246 | std 9.241077e-03 | grad:data ratio 4.881091e-02
weight (100, 100) | mean +0.000113 | std 7.132879e-03 | grad:data ratio 6.964619e-02
weight (100, 100) | mean -0.000086 | std 6.234305e-03 | grad:data ratio 6.073741e-02
weight (100, 100) | mean +0.000052 | std 5.742187e-03 | grad:data ratio 5.631483e-02
weight (100, 100) | mean +0.000032 | std 5.672205e-03 | grad:data ratio 5.570125e-02
weight  (100, 27) | mean -0.000082 | std 1.209416e-02 | grad:data ratio 1.160106e-01

png
下面对lr*grad/value(也就是增益/值,一般期待为1e-3)做了可视化,结果最好在-3附近,大于-3说明太快,小于太多则说明太慢。通过绘图来查看是否训练地过慢或过快,-3左右,这些图显示出我们需要做出调整。

在加入batch norm层后一切变得简单,我们只需要合理地设置学习率来得到合适的更新尺度

ud是之前就计算过的~

1
2
3
4
5
6
7
8
9
plt.figure(figsize=(20, 4))
legends = []
for i,p in enumerate(parameters):
if p.ndim == 2:
plt.plot([ud[j][i] for j in range(len(ud))])
legends.append('param %d' % i)
plt.plot([0, len(ud)], [-3, -3], 'k') # these ratios should be ~1e-3, indicate on plot
plt.legend(legends);

image-20250604095828796

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@torch.no_grad() # this decorator disables gradient tracking
def split_loss(split):
x,y = {
'train': (Xtr, Ytr),
'val': (Xdev, Ydev),
'test': (Xte, Yte),
}[split]
emb = C[x] # (N, block_size, n_embd)
x = emb.view(emb.shape[0], -1) # concat into (N, block_size * n_embd)
for layer in layers:
x = layer(x)
loss = F.cross_entropy(x, y)
print(split, loss.item())

# put layers into eval mode
for layer in layers:
layer.training = False
split_loss('train')
split_loss('val')
train 2.4002976417541504
val 2.3982467651367188

下面是用训练好的模型进行采样了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):

out = []
context = [0] * block_size # initialize with all ...
while True:
# forward pass the neural net
emb = C[torch.tensor([context])] # (1,block_size,n_embd)
x = emb.view(emb.shape[0], -1) # concatenate the vectors
for layer in layers:
x = layer(x)
logits = x
probs = F.softmax(logits, dim=1)
# sample from the distribution
ix = torch.multinomial(probs, num_samples=1, generator=g).item()
# shift the context window and track the samples
context = context[1:] + [ix]
out.append(ix)
# if we sample the special '.' token, break
if ix == 0:
break

print(''.join(itos[i] for i in out)) # decode and print the generated word
carpah.
qarlileif.
jmrix.
thty.
sacansa.
jazhnte.
dpn.
arciigqeiunellaia.
chriiv.
kalein.
dhlm.
join.
qhinn.
sroin.
arian.
quiqaelogiearyxix.
kaeklinsan.
ed.
ecoia.
gtleley.