Run this notebook online: or Colab:

# 11.6. 动量法¶

Section 11.4一节中，我们详述了如何执行随机梯度下降，即在只有嘈杂的梯度可用的情况下执行优化时会发生什么。 对于嘈杂的梯度，我们在选择学习率需要格外谨慎。 如果衰减速度太快，收敛就会停滞。 相反，如果太宽松，我们可能无法收敛到最优解。

## 11.6.1. 基础¶

### 11.6.1.1. 泄漏平均值¶

(11.6.1)$\mathbf{g}_{t, t-1} = \partial_{\mathbf{w}} \frac{1}{|\mathcal{B}_t|} \sum_{i \in \mathcal{B}_t} f(\mathbf{x}_{i}, \mathbf{w}_{t-1}) = \frac{1}{|\mathcal{B}_t|} \sum_{i \in \mathcal{B}_t} \mathbf{h}_{i, t-1}.$

(11.6.2)$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \mathbf{g}_{t, t-1}$

(11.6.3)\begin{aligned} \mathbf{v}_t = \beta^2 \mathbf{v}_{t-2} + \beta \mathbf{g}_{t-1, t-2} + \mathbf{g}_{t, t-1} = \ldots, = \sum_{\tau = 0}^{t-1} \beta^{\tau} \mathbf{g}_{t-\tau, t-\tau-1}. \end{aligned}

### 11.6.1.2. 条件不佳的问题¶

(11.6.4)$f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2.$

%load ../utils/djl-imports

import org.apache.commons.lang3.ArrayUtils;

float eta = 0.4f;
BiFunction<Float, Float, Float> f2d = (x1, x2) -> 0.1f * x1 * x1 + 2 * x2 * x2;

Function<Float[], Float[]> gd2d = (state) -> {
Float x1 = state[0], x2 = state[1], s1 = state[2], s2 = state[3];
return new Float[]{x1 - eta * 0.2f * x1, x2 - eta * 4 * x2, 0f, 0f};
};


Tablesaw not supporting for contour and meshgrids, will update soon


Fig. 11.6.1 Ellipsoid with Flat x1.

float eta = 0.6f;

Tablesaw not supporting for contour and meshgrids, will update soon


Fig. 11.6.2 Ellipsoid with Flat x1 with Large Learning Rate.

### 11.6.1.3. 动量法¶

(11.6.5)\begin{split}\begin{aligned} \mathbf{v}_t &\leftarrow \beta \mathbf{v}_{t-1} + \mathbf{g}_{t, t-1}, \\ \mathbf{x}_t &\leftarrow \mathbf{x}_{t-1} - \eta_t \mathbf{v}_t. \end{aligned}\end{split}

float eta = 0.6f;
float beta = 0.5f;

Function<Float[], Float[]> momentum2d = (state) -> {
Float x1 = state[0], x2 = state[1], v1 = state[2], v2 = state[3];
v1 = beta * v1 + 0.2f * x1;
v2 = beta * v2 + 4 * x2;
return new Float[]{x1 - eta * v1, x2 - eta * v2, v1, v2};
};


Tablesaw not supporting for contour and meshgrids, will update soon


Fig. 11.6.3 Contour Momentum.

eta = 0.6f;
beta = 0.25f;

Tablesaw not supporting for contour and meshgrids, will update soon


Fig. 11.6.4 Contour Momentum Less.

### 11.6.1.4. 有效样本权重¶

/* Saved in GradDescUtils.java */
public static Figure plotGammas(float[] time, float[] gammas,
int width, int height) {
double[] gamma1 = new double[time.length];
double[] gamma2 = new double[time.length];
double[] gamma3 = new double[time.length];
double[] gamma4 = new double[time.length];

// Calculate all gammas over time
for (int i = 0; i < time.length; i++) {
gamma1[i] = Math.pow(gammas[0], i);
gamma2[i] = Math.pow(gammas[1], i);
gamma3[i] = Math.pow(gammas[2], i);
gamma4[i] = Math.pow(gammas[3], i);
}

// Gamma 1 Line
ScatterTrace gamma1trace = ScatterTrace.builder(Functions.floatToDoubleArray(time),
gamma1)
.mode(ScatterTrace.Mode.LINE)
.name(String.format("gamma = %.2f", gammas[0]))
.build();

// Gamma 2 Line
ScatterTrace gamma2trace = ScatterTrace.builder(Functions.floatToDoubleArray(time),
gamma2)
.mode(ScatterTrace.Mode.LINE)
.name(String.format("gamma = %.2f", gammas[1]))
.build();

// Gamma 3 Line
ScatterTrace gamma3trace = ScatterTrace.builder(Functions.floatToDoubleArray(time),
gamma3)
.mode(ScatterTrace.Mode.LINE)
.name(String.format("gamma = %.2f", gammas[2]))
.build();

// Gamma 4 Line
ScatterTrace gamma4trace = ScatterTrace.builder(Functions.floatToDoubleArray(time),
gamma4)
.mode(ScatterTrace.Mode.LINE)
.name(String.format("gamma = %.2f", gammas[3]))
.build();

Axis xAxis = Axis.builder()
.title("time")
.build();

Layout layout = Layout.builder()
.height(height)
.width(width)
.xAxis(xAxis)
.build();

return new Figure(layout, gamma1trace, gamma2trace, gamma3trace, gamma4trace);
}

NDManager manager = NDManager.newBaseManager();

float[] gammas = new float[]{0.95f, 0.9f, 0.6f, 0f};

NDArray timesND = manager.arange(40f);
float[] times = timesND.toFloatArray();

plotGammas(times, gammas, 600, 400)


## 11.6.2. 实际实验¶

### 11.6.2.1. 从零开始实现¶

NDList initMomentumStates(int featureDim) {
NDManager manager = NDManager.newBaseManager();
NDArray vW = manager.zeros(new Shape(featureDim, 1));
NDArray vB = manager.zeros(new Shape(1));
return new NDList(vW, vB);
}

public class Optimization {
public static void sgdMomentum(NDList params, NDList states, Map<String, Float> hyperparams) {
for (int i = 0; i < params.size(); i++) {
NDArray param = params.get(i);
NDArray velocity = states.get(i);
// Update param
param.subi(velocity.mul(hyperparams.get("lr")));
}
}
}


AirfoilRandomAccess airfoil = TrainingChapter11.getDataCh11(10, 1500);

public TrainingChapter11.LossTime trainMomentum(float lr, float momentum, int numEpochs)
throws IOException, TranslateException {
int featureDim = airfoil.getColumnNames().size();
Map<String, Float> hyperparams = new HashMap<>();
hyperparams.put("lr", lr);
hyperparams.put("momentum", momentum);
return TrainingChapter11.trainCh11(Optimization::sgdMomentum, initMomentumStates(featureDim), hyperparams, airfoil, featureDim, numEpochs);
}

trainMomentum(0.02f, 0.5f, 2);

loss: 0.250, 0.078 sec/epoch

REPL.$JShell$154B$TrainingChapter11$LossTime@2365da3c


trainMomentum(0.01f, 0.9f, 2);

loss: 0.244, 0.069 sec/epoch

REPL.$JShell$154B$TrainingChapter11$LossTime@74f63ab5


trainMomentum(0.005f, 0.9f, 2);

loss: 0.243, 0.070 sec/epoch

REPL.$JShell$154B$TrainingChapter11$LossTime@5b042ebc


### 11.6.2.2. 简洁实现¶

Tracker lrt = Tracker.fixed(0.005f);
Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).optMomentum(0.9f).build();

TrainingChapter11.trainConciseCh11(sgd, airfoil, 2);

INFO Training on: 1 GPUs.
INFO Load MXNet Engine Version 1.9.0 in 0.069 ms.

Training:    100% |████████████████████████████████████████| Accuracy: 1.00, L2Loss: 0.28
loss: 0.245, 0.141 sec/epoch


## 11.6.3. 理论分析¶

$$f(x) = 0.1 x_1^2 + 2 x_2^2$$的2D示例似乎相当牵强。 下面我们将看到，它在实际生活中非常具有代表性，至少最小化凸二次目标函数的情况下是如此。

### 11.6.3.1. 二次凸函数¶

(11.6.6)$h(\mathbf{x}) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q} \mathbf{x} + \mathbf{x}^\top \mathbf{c} + b.$

(11.6.7)$h(\mathbf{x}) = \frac{1}{2} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c})^\top \mathbf{Q} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c}) + b - \frac{1}{2} \mathbf{c}^\top \mathbf{Q}^{-1} \mathbf{c}.$

(11.6.8)$h(\mathbf{z}) = \frac{1}{2} \mathbf{z}^\top \boldsymbol{\Lambda} \mathbf{z} + b'.$

(11.6.9)$\mathbf{z}_t = \mathbf{z}_{t-1} - \boldsymbol{\Lambda} \mathbf{z}_{t-1} = (\mathbf{I} - \boldsymbol{\Lambda}) \mathbf{z}_{t-1}.$

(11.6.10)\begin{split}\begin{aligned} \mathbf{v}_t & = \beta \mathbf{v}_{t-1} + \boldsymbol{\Lambda} \mathbf{z}_{t-1} \\ \mathbf{z}_t & = \mathbf{z}_{t-1} - \eta \left(\beta \mathbf{v}_{t-1} + \boldsymbol{\Lambda} \mathbf{z}_{t-1}\right) \\ & = (\mathbf{I} - \eta \boldsymbol{\Lambda}) \mathbf{z}_{t-1} - \eta \beta \mathbf{v}_{t-1}. \end{aligned}\end{split}

### 11.6.3.2. 标量函数¶

(11.6.11)$x_{t+1} = x_t - \eta \lambda x_t = (1 - \eta \lambda) x_t.$

$$|1 - \eta \lambda| < 1$$时，这种优化以指数速度收敛，因为在$$t$$步之后我们可以得到$$x_t = (1 - \eta \lambda)^t x_0$$。 这显示了在我们将学习率$$\eta$$提高到$$\eta \lambda = 1$$之前，收敛率最初是如何提高的。 超过该数值之后，梯度开始发散，对于$$\eta \lambda > 2$$而言，优化问题将会发散。

float[] lambdas = new float[]{0.1f, 1f, 10f, 19f};
float eta = 0.1f;

float[] time = new float[0];
float[] convergence = new float[0];
String[] lambda = new String[0];
for (float lam : lambdas) {
float[] timeTemp = new float[20];
float[] convergenceTemp = new float[20];
String[] lambdaTemp = new String[20];
for (int i = 0; i < timeTemp.length; i++) {
timeTemp[i] = i;
convergenceTemp[i] = (float) Math.pow(1 - eta * lam, i);
lambdaTemp[i] = String.format("lambda = %.2f", lam);
}
}

Table data = Table.create("data")
DoubleColumn.create("time", Functions.floatToDoubleArray(time)),
DoubleColumn.create("convergence", Functions.floatToDoubleArray(convergence)),
StringColumn.create("lambda", lambda)
);

LinePlot.create("convergence vs. time", data, "time", "convergence", "lambda");


(11.6.12)$\begin{split}\begin{bmatrix} v_{t+1} \\ x_{t+1} \end{bmatrix} = \begin{bmatrix} \beta & \lambda \\ -\eta \beta & (1 - \eta \lambda) \end{bmatrix} \begin{bmatrix} v_{t} \\ x_{t} \end{bmatrix} = \mathbf{R}(\beta, \eta, \lambda) \begin{bmatrix} v_{t} \\ x_{t} \end{bmatrix}.\end{split}$

## 11.6.4. 小结¶

• 动量法用过去梯度的平均值来替换梯度，这大大加快了收敛速度。

• 对于无噪声梯度下降和嘈杂随机梯度下降，动量法都是可取的。

• 动量法可以防止在随机梯度下降的优化过程停滞的问题。

• 由于对过去的数据进行了指数降权，有效梯度数为$$\frac{1}{1-\beta}$$

• 在凸二次问题中，可以对动量法进行明确而详细的分析。

• 动量法的实现非常简单，但它需要我们存储额外的状态向量（动量$$\mathbf{v}$$）。

## 11.6.5. 练习¶

1. 使用动量超参数和学习率的其他组合，观察和分析不同的实验结果。

2. 试试梯度下降和动量法来解决一个二次问题，其中你有多个特征值，即$$f(x) = \frac{1}{2} \sum_i \lambda_i x_i^2$$，例如$$\lambda_i = 2^{-i}$$。绘制出$$x$$的值在初始化$$x_i = 1$$时如何下降。

3. 推导$$h(\mathbf{x}) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q} \mathbf{x} + \mathbf{x}^\top \mathbf{c} + b$$的最小值和最小化器。

4. 当我们执行带动量法的随机梯度下降时会有什么变化？当我们使用带动量法的小批量随机梯度下降时会发生什么？试验参数如何？