11. deep learning miscellanies
11.1. Backpropagation的推导
11.1.1. 约定
其中,\(z^l_j\) 表示未激活前第 \(l\) 层、第 \(j\) 个神经元的值,\(w^l_{jk}\) 为连接第 \(l\) 层第 \(j\) 个神经元和第 \(l+1\) 层第k个神经元的权重, \(a^l_k\) 表示激活后的第 \(l\) 层第 \(k\) 个神经元的值, \(b^l_j\) 为偏移量bias,\(\sigma\) 为激活函数。
注意和书籍 http://neuralnetworksanddeeplearning.com/chap2.html 中公式(23)的约定由区别,我们把weight和bias和神经元的值放到同一层中。
公式 (1) 写成矩阵形式为:
11.1.2. 公式推导
我们约定C为损失函数(loss function),并记:
约定 Hadamard product 或者elementwise相乘为(重复指标不求和):
根据公式 (1) 可以直接得出对偏移量 \(b\) 的偏导数(梯度):
上式写成矩阵形式为:
对权重 \(w\) 的求导为:
上式写成矩阵形式为:
\(l\) 层 \(\delta^l\) 和 \(l+1\) 层的 \(\delta^{l+1}\) 的关系为:
上式写成矩阵形式为:
可以看出:
11.1.3. BP算法总结
BP算法可以概括为以下四个关系式:
可以看出,可以从 \(\delta^{l+1}\) 的推导出对第 \(l\) 层的权重和偏移量的偏导,以及第 \(l\) 层的未激活前的神经元的偏导。
11.2. convolution arithmetic
Set input data size \(i\), convolution kernel size \(k\), stride size \(s\), and zero padding size \(p\). Then the output size \(o\) is:
The floor function \({\lfloor}\,{\rfloor}\) can found at https://en.wikipedia.org/wiki/Floor_and_ceiling_functions.
According to (2), pooling output size is:
- explanation:
The convolution operation can be rewritten to matrix multiplication.
The dilation "rate" is controlled by an additional hyperparameter \(d\). A kernel of size k dilated by a factor d has an effective size:
Combined with (2) the output size is:
11.3. NLP
encoder-decoder model architecture:
Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
Decoder-only models: Good for generative tasks such as text generation.
Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.
There are mainly tree decoding methods: Greedy search, Beam search, and Sampling.