Run this notebook online: or Colab:

(11.9.1)\begin{aligned} \mathbf{s}_t & = \rho \mathbf{s}_{t-1} + (1 - \rho) \mathbf{g}_t^2. \end{aligned}

Section 11.8的区别在于，我们使用重新缩放的梯度$$\mathbf{g}_t'$$执行更新，即

(11.9.2)\begin{split}\begin{aligned} \mathbf{x}_t & = \mathbf{x}_{t-1} - \mathbf{g}_t'. \\ \end{aligned}\end{split}

(11.9.3)\begin{split}\begin{aligned} \mathbf{g}_t' & = \frac{\sqrt{\Delta\mathbf{x}_{t-1} + \epsilon}}{\sqrt{{\mathbf{s}_t + \epsilon}}} \odot \mathbf{g}_t, \\ \end{aligned}\end{split}

(11.9.4)\begin{aligned} \Delta \mathbf{x}_t & = \rho \Delta\mathbf{x}_{t-1} + (1 - \rho) {\mathbf{g}_t'}^2, \end{aligned}

$$\epsilon$$（例如$$10^{-5}$$这样的小值）是为了保持数字稳定性而加入的。

11.9.2. 代码实现¶

Adadelta需要为每个变量维护两个状态变量，即$$\mathbf{s}_t$$$$\Delta\mathbf{x}_t$$。这将产生以下实施。

%load ../utils/djl-imports

NDList initAdadeltaStates(int featureDimension) {
NDManager manager = NDManager.newBaseManager();
NDArray sW = manager.zeros(new Shape(featureDimension, 1));
NDArray sB = manager.zeros(new Shape(1));
NDArray deltaW = manager.zeros(new Shape(featureDimension, 1));
NDArray deltaB = manager.zeros(new Shape(1));
return new NDList(sW, deltaW, sB, deltaB);
}

public class Optimization {
float rho = hyperparams.get("rho");
float eps = (float) 1e-5;
for (int i = 0; i < params.size(); i++) {
NDArray param = params.get(i);
NDArray state = states.get(2 * i);
NDArray delta = states.get(2 * i + 1);
// Update parameter, state, and delta
// In-place updates with the '__'i methods (ex. muli)
// state = rho * state + (1 - rho) * param.gradient^2
// rescaledGradient = ((delta + eps)^(1/2) / (state + eps)^(1/2)) * param.gradient
// delta = rho * delta + (1 - rho) * g^2
}
}
}


AirfoilRandomAccess airfoil = TrainingChapter11.getDataCh11(10, 1500);

int featureDimension = airfoil.getColumnNames().size();
Map<String, Float> hyperparams = new HashMap<>();
hyperparams.put("rho", rho);
hyperparams, airfoil,
featureDimension, numEpochs);
}


loss: 0.249, 0.097 sec/epoch


Optimizer adadelta = Optimizer.adadelta().optRho(0.9f).build();


INFO Training on: 1 GPUs.
INFO Load MXNet Engine Version 1.9.0 in 0.088 ms.

Training:    100% |████████████████████████████████████████| Accuracy: 1.00, L2Loss: 0.48
loss: 0.472, 0.169 sec/epoch


11.9.3. 小结¶

1. 调整$$\rho$$的值，会发生什么？
2. 展示如何在不使用$$\mathbf{g}_t'$$的情况下实现算法。为什么这是个好主意？