Run this notebook online: or Colab:

• Section 11.4中，我们学习了：随机梯度下降在解决优化问题时比梯度下降更有效。

• Section 11.5中，我们学习了：在一个小批量中使用更大的观测值集，可以通过向量化提供额外效率。这是高效的多机、多GPU和整体并行处理的关键。

• Section 11.6中我们添加了一种机制，用于汇总过去梯度的历史以加速收敛。

• Section 11.7中，我们使用每坐标缩放来实现计算效率的预处理。

• Section 11.8中，我们通过学习率的调整来分离每个坐标的缩放。

## 11.10.1. 算法¶

(11.10.1)\begin{split}\begin{aligned} \mathbf{v}_t & \leftarrow \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t, \\ \mathbf{s}_t & \leftarrow \beta_2 \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2. \end{aligned}\end{split}

(11.10.2)$\hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_1^t} \text{ and } \hat{\mathbf{s}}_t = \frac{\mathbf{s}_t}{1 - \beta_2^t}.$

(11.10.3)$\mathbf{g}_t' = \frac{\eta \hat{\mathbf{v}}_t}{\sqrt{\hat{\mathbf{s}}_t} + \epsilon}.$

(11.10.4)$\mathbf{x}_t \leftarrow \mathbf{x}_{t-1} - \mathbf{g}_t'.$

## 11.10.2. 实现¶

%load ../utils/djl-imports

NDList initAdamStates(int featureDimension) {
NDManager manager = NDManager.newBaseManager();
NDArray vW = manager.zeros(new Shape(featureDimension, 1));
NDArray vB = manager.zeros(new Shape(1));
NDArray sW = manager.zeros(new Shape(featureDimension, 1));
NDArray sB = manager.zeros(new Shape(1));
return new NDList(vW, sW, vB, sB);
}

public class Optimization {
public static void adam(NDList params, NDList states, Map<String, Float> hyperparams) {
float beta1 = 0.9f;
float beta2 = 0.999f;
float eps = (float) 1e-6;
float time = hyperparams.get("time");
for (int i = 0; i < params.size(); i++) {
NDArray param = params.get(i);
NDArray velocity = states.get(2 * i);
NDArray state = states.get(2 * i + 1);
// Update parameter, velocity, and state
// velocity = beta1 * v + (1 - beta1) * param.gradient
// state = beta2 * state + (1 - beta2) * param.gradient^2
// vBiasCorr = velocity / ((1 - beta1)^(time))
NDArray vBiasCorr = velocity.div(1 - Math.pow(beta1, time));
// sBiasCorr = state / ((1 - beta2)^(time))
NDArray sBiasCorr = state.div(1 - Math.pow(beta2, time));
// param -= lr * vBiasCorr / (sBiasCorr^(1/2) + eps)
}
hyperparams.put("time", time + 1);
}
}


AirfoilRandomAccess airfoil = TrainingChapter11.getDataCh11(10, 1500);

public TrainingChapter11.LossTime trainAdam(float lr, float time, int numEpochs) throws IOException, TranslateException {
int featureDimension = airfoil.getColumnNames().size();
Map<String, Float> hyperparams = new HashMap<>();
hyperparams.put("lr", lr);
hyperparams.put("time", time);
hyperparams, airfoil,
featureDimension, numEpochs);
}

TrainingChapter11.LossTime lossTime = trainAdam(0.01f, 1, 2);

loss: 0.243, 0.103 sec/epoch


Tracker lrt = Tracker.fixed(0.01f);


INFO Training on: 1 GPUs.
INFO Load MXNet Engine Version 1.9.0 in 0.088 ms.

Training:    100% |████████████████████████████████████████| Accuracy: 1.00, L2Loss: 0.29
loss: 0.245, 0.150 sec/epoch


## 11.10.3. Yogi¶

Adam算法也存在一些问题： 即使在凸环境下，当$$\mathbf{s}_t$$的第二力矩估计值爆炸时，它可能无法收敛。 [Zaheer et al., 2018]$$\mathbf{s}_t$$提出了的改进更新和参数初始化。 论文中建议我们重写Adam算法更新如下：

(11.10.5)$\mathbf{s}_t \leftarrow \mathbf{s}_{t-1} + (1 - \beta_2) \left(\mathbf{g}_t^2 - \mathbf{s}_{t-1}\right).$

(11.10.6)$\mathbf{s}_t \leftarrow \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2 \odot \mathop{\mathrm{sgn}}(\mathbf{g}_t^2 - \mathbf{s}_{t-1}).$

public class Optimization {
public static void yogi(NDList params, NDList states, Map<String, Float> hyperparams) {
float beta1 = 0.9f;
float beta2 = 0.999f;
float eps = (float) 1e-3;
float time = hyperparams.get("time");
for (int i = 0; i < params.size(); i++) {
NDArray param = params.get(i);
NDArray velocity = states.get(2 * i);
NDArray state = states.get(2 * i + 1);
// Update parameter, velocity, and state
// velocity = beta1 * v + (1 - beta1) * param.gradient
/* Rewritten Update */
// state = state + (1 - beta2) * sign(param.gradient^2 - state)
// vBiasCorr = velocity / ((1 - beta1)^(time))
NDArray vBiasCorr = velocity.div(1 - Math.pow(beta1, time));
// sBiasCorr = state / ((1 - beta2)^(time))
NDArray sBiasCorr = state.div(1 - Math.pow(beta2, time));
// param -= lr * vBiasCorr / (sBiasCorr^(1/2) + eps)
}
hyperparams.put("time", time + 1);
}
}

AirfoilRandomAccess airfoil = TrainingChapter11.getDataCh11(10, 1500);

public TrainingChapter11.LossTime trainYogi(float lr, float time, int numEpochs) throws IOException, TranslateException {
int featureDimension = airfoil.getColumnNames().size();
Map<String, Float> hyperparams = new HashMap<>();
hyperparams.put("lr", lr);
hyperparams.put("time", time);
return TrainingChapter11.trainCh11(Optimization::yogi,
hyperparams, airfoil,
featureDimension, numEpochs);
}

TrainingChapter11.LossTime lossTime = trainYogi(0.01f, 1, 2);

loss: 0.245, 0.095 sec/epoch


## 11.10.4. 小结¶

• 对于具有显著差异的梯度，我们可能会遇到收敛性问题。我们可以通过使用更大的小批量或者切换到改进的估计值$$\mathbf{s}_t$$来修正它们。Yogi提供了这样的替代方案。
3. 当我们收敛时，为什么你需要降低学习率$$\eta$$