Run this notebook online: or Colab:

9.4. 双向循环神经网络¶

• ___

• ___饿了。

• ___饿了，我可以吃半头猪。

9.4.1. 隐马尔可夫模型中的动态规划¶

.. _fig_hmm:

(9.4.1)$P(x_1, \ldots, x_T, h_1, \ldots, h_T) = \prod_{t=1}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t), \text{ where } P(h_1 \mid h_0) = P(h_1).$

(9.4.2)\begin{split}\begin{aligned} &P(x_1, \ldots, x_T) \\ =& \sum_{h_1, \ldots, h_T} P(x_1, \ldots, x_T, h_1, \ldots, h_T) \\ =& \sum_{h_1, \ldots, h_T} \prod_{t=1}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t) \\ =& \sum_{h_2, \ldots, h_T} \underbrace{\left[\sum_{h_1} P(h_1) P(x_1 \mid h_1) P(h_2 \mid h_1)\right]}_{\pi_2(h_2) \stackrel{\mathrm{def}}{=}} P(x_2 \mid h_2) \prod_{t=3}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t) \\ =& \sum_{h_3, \ldots, h_T} \underbrace{\left[\sum_{h_2} \pi_2(h_2) P(x_2 \mid h_2) P(h_3 \mid h_2)\right]}_{\pi_3(h_3)\stackrel{\mathrm{def}}{=}} P(x_3 \mid h_3) \prod_{t=4}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t)\\ =& \dots \\ =& \sum_{h_T} \pi_T(h_T) P(x_T \mid h_T). \end{aligned}\end{split}

(9.4.3)$\pi_{t+1}(h_{t+1}) = \sum_{h_t} \pi_t(h_t) P(x_t \mid h_t) P(h_{t+1} \mid h_t).$

(9.4.4)\begin{split}\begin{aligned} & P(x_1, \ldots, x_T) \\ =& \sum_{h_1, \ldots, h_T} P(x_1, \ldots, x_T, h_1, \ldots, h_T) \\ =& \sum_{h_1, \ldots, h_T} \prod_{t=1}^{T-1} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot P(h_T \mid h_{T-1}) P(x_T \mid h_T) \\ =& \sum_{h_1, \ldots, h_{T-1}} \prod_{t=1}^{T-1} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot \underbrace{\left[\sum_{h_T} P(h_T \mid h_{T-1}) P(x_T \mid h_T)\right]}_{\rho_{T-1}(h_{T-1})\stackrel{\mathrm{def}}{=}} \\ =& \sum_{h_1, \ldots, h_{T-2}} \prod_{t=1}^{T-2} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot \underbrace{\left[\sum_{h_{T-1}} P(h_{T-1} \mid h_{T-2}) P(x_{T-1} \mid h_{T-1}) \rho_{T-1}(h_{T-1}) \right]}_{\rho_{T-2}(h_{T-2})\stackrel{\mathrm{def}}{=}} \\ =& \ldots \\ =& \sum_{h_1} P(h_1) P(x_1 \mid h_1)\rho_{1}(h_{1}). \end{aligned}\end{split}

(9.4.5)$\rho_{t-1}(h_{t-1})= \sum_{h_{t}} P(h_{t} \mid h_{t-1}) P(x_{t} \mid h_{t}) \rho_{t}(h_{t}),$

(9.4.6)$P(x_j \mid x_{-j}) \propto \sum_{h_j} \pi_j(h_j) \rho_j(h_j) P(x_j \mid h_j).$

9.4.2. 双向模型¶

.. _fig_birnn:

9.4.2.1. 定义¶

(9.4.7)\begin{split}\begin{aligned} \overrightarrow{\mathbf{H}}_t &= \phi(\mathbf{X}_t \mathbf{W}_{xh}^{(f)} + \overrightarrow{\mathbf{H}}_{t-1} \mathbf{W}_{hh}^{(f)} + \mathbf{b}_h^{(f)}),\\ \overleftarrow{\mathbf{H}}_t &= \phi(\mathbf{X}_t \mathbf{W}_{xh}^{(b)} + \overleftarrow{\mathbf{H}}_{t+1} \mathbf{W}_{hh}^{(b)} + \mathbf{b}_h^{(b)}), \end{aligned}\end{split}

(9.4.8)$\mathbf{O}_t = \mathbf{H}_t \mathbf{W}_{hq} + \mathbf{b}_q.$

9.4.3. (双向循环神经网络的错误应用)¶

%load ../utils/djl-imports


NDManager manager = NDManager.newBaseManager();

// 加载数据
int batchSize = 32;
int numSteps = 35;
Device device = manager.getDevice();
TimeMachineDataset dataset = new TimeMachineDataset.Builder()
.setManager(manager)
.setMaxTokens(10000)
.setSampling(batchSize, false)
.setSteps(numSteps)
.build();
dataset.prepare();
Vocab vocab = dataset.getVocab();

// 通过设置 .optBidirectional(true) 来定义双向LSTM模型
int vocabSize = vocab.length();
int numHiddens = 256;
int numLayers = 2;
LSTM lstmLayer =
LSTM.builder()
.setNumLayers(numLayers)
.setStateSize(numHiddens)
.optReturnState(true)
.optBatchFirst(false)
.optBidirectional(true)
.build();

// Train the model
RNNModel model = new RNNModel(lstmLayer, vocabSize);
int numEpochs = Integer.getInteger("MAX_EPOCH", 500);

int lr = 1;
TimeMachine.trainCh8(model, dataset, vocab, lr, numEpochs, device, false, manager);

INFO Training on: 1 GPUs.
INFO Load MXNet Engine Version 1.9.0 in 0.063 ms.

perplexity: 1.0, 45576.8 tokens/sec on gpu(0)
time travellerererererererererererererererererererererererererer
travellerererererererererererererererererererererererererer


9.4.4. 小结¶

• 在双向循环神经网络中，每个时间步的隐状态由当前时间步的前后数据同时决定。

• 双向循环神经网络与概率图模型中的“前向-后向”算法具有相似性。

• 双向循环神经网络主要用于序列编码和给定双向上下文的观测估计。

• 由于梯度链更长，因此双向循环神经网络的训练代价非常高。

9.4.5. 练习¶

1. 如果不同方向使用不同数量的隐藏单位，$$\mathbf{H_t}$$的形状会发生怎样的变化？

2. 设计一个具有多个隐藏层的双向循环神经网络。

3. 在自然语言中一词多义很常见。例如，“bank”一词在不同的上下文“i went to the bank to deposit cash”和“i went to the bank to sit down”中有不同的含义。如何设计一个神经网络模型，使其在给定上下文序列和单词的情况下，返回该单词在此上下文中的向量表示？哪种类型的神经网络架构更适合处理一词多义？