11. deep learning miscellanies

11.1. Backpropagation的推导

11.1.1. 约定

(1)\[z^{l+1}_j =\sum_k w^l_{jk} a^l_k + b^l_j, \quad a^l_j=\sigma(z^l_j) \]

其中,\(z^l_j\) 表示未激活前第 \(l\) 层、第 \(j\) 个神经元的值,\(w^l_{jk}\) 为连接第 \(l\) 层第 \(j\) 个神经元和第 \(l+1\) 层第k个神经元的权重, \(a^l_k\) 表示激活后的第 \(l\) 层第 \(k\) 个神经元的值, \(b^l_j\) 为偏移量bias,\(\sigma\) 为激活函数。

注意和书籍 http://neuralnetworksanddeeplearning.com/chap2.html 中公式(23)的约定由区别,我们把weight和bias和神经元的值放到同一层中。

公式 (1) 写成矩阵形式为:

\[z^{l+1}=w^l a^l + b^l, \quad a^l=\sigma(z^l) \]

11.1.2. 公式推导

我们约定C为损失函数(loss function),并记:

\[\delta^l = \frac{\partial C}{\partial z^l} \]

约定 Hadamard product 或者elementwise相乘为(重复指标不求和):

\[u\odot v = u_i * v_i \]

根据公式 (1) 可以直接得出对偏移量 \(b\) 的偏导数(梯度):

\[\frac{\partial C}{\partial b^l_j} = \sum_i \frac{\partial C}{\partial z^{l+1}_i} \frac{\partial z^{l+1}_i}{\partial b^l_j} = \frac{\partial C}{z^{l+1}_j} = \delta^{l+1}_j \]

上式写成矩阵形式为:

\[\frac{\partial C}{\partial b^l} = \delta^{l+1} \]

对权重 \(w\) 的求导为:

\[\frac{\partial C}{\partial w^l_{jk}} = \sum_i \frac{\partial C}{\partial z^{l+1}_i} \frac{\partial z^{l+1}_i}{\partial w^l_{jk}} = \frac{\partial C}{\partial z^{l+1}_j} a^l_k = \delta^{l+1}_j a^l_k \]

上式写成矩阵形式为:

\[\frac{\partial C}{\partial w^l} = \delta^{l+1} (a^l)^T \]

\(l\)\(\delta^l\)\(l+1\) 层的 \(\delta^{l+1}\) 的关系为:

\[\frac{\partial C}{\partial z^l_j} = \sum_{i,k} \frac{\partial C}{\partial z^{l+1}_i} \frac{\partial z^{l+1}_i}{\partial a^l_k} \frac{\partial a^l_k}{\partial z^l_j} = \sum_i \delta^{l+1}_i w^l_{ij} \sigma^{'}(z^l_j) \]

上式写成矩阵形式为:

\[\delta^l = (w^l)^T \delta^{l+1}\odot\sigma^{'}(z^l) \]

可以看出:

\[\nabla_a C = (w^l)^T \delta^{l+1} \]

11.1.3. BP算法总结

BP算法可以概括为以下四个关系式:

\[\begin{aligned} \delta^l &= \frac{\partial C}{\partial z^l} = \nabla_z C \\ \frac{\partial C}{\partial w^l} &= \delta^{l+1} (a^l)^T \\ \frac{\partial C}{\partial b^l} &= \delta^{l+1} \\ \delta^l &= (w^l)^T \delta^{l+1}\odot\sigma^{'}(z^l) \end{aligned} \]

可以看出,可以从 \(\delta^{l+1}\) 的推导出对第 \(l\) 层的权重和偏移量的偏导,以及第 \(l\) 层的未激活前的神经元的偏导。

11.2. convolution arithmetic

reference:

Set input data size \(i\), convolution kernel size \(k\), stride size \(s\), and zero padding size \(p\). Then the output size \(o\) is:

(2)\[o = \left\lfloor{\frac{i + 2p - k}{s}}\right\rfloor + 1 \,. \]

The floor function \({\lfloor}\,{\rfloor}\) can found at https://en.wikipedia.org/wiki/Floor_and_ceiling_functions.

According to (2), pooling output size is:

(3)\[o = \left\lfloor{\frac{i-k}{s}}\right\rfloor + 1 \,. \]
explanation:

The convolution operation can be rewritten to matrix multiplication.

The dilation "rate" is controlled by an additional hyperparameter \(d\). A kernel of size k dilated by a factor d has an effective size:

\[\hat{k} = k + (k-1)(d-1) \,. \]

Combined with (2) the output size is:

(4)\[o = \left\lfloor{\frac{i + 2p - k - (k-1)(d-1)}{s}}\right\rfloor + 1 \,. \]

11.3. NLP

encoder-decoder model architecture:

  • Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.

  • Decoder-only models: Good for generative tasks such as text generation.

  • Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.

There are mainly tree decoding methods: Greedy search, Beam search, and Sampling.