****************************** deep learning miscellanies ****************************** Backpropagation的推导 ===================== 约定 ^^^^ .. math:: z^{l+1}_j =\sum_k w^l_{jk} a^l_k + b^l_j, \quad a^l_j=\sigma(z^l_j) :label: explicitform 其中,:math:`z^l_j` 表示未激活前第 :math:`l` 层、第 :math:`j` 个神经元的值,:math:`w^l_{jk}` 为连接第 :math:`l` 层第 :math:`j` 个神经元和第 :math:`l+1` 层第k个神经元的权重, :math:`a^l_k` 表示激活后的第 :math:`l` 层第 :math:`k` 个神经元的值, :math:`b^l_j` 为偏移量bias,:math:`\sigma` 为激活函数。 注意和书籍 http://neuralnetworksanddeeplearning.com/chap2.html 中公式(23)的约定由区别,我们把weight和bias和神经元的值放到同一层中。 公式 :eq:`explicitform` 写成矩阵形式为: .. math:: z^{l+1}=w^l a^l + b^l, \quad a^l=\sigma(z^l) 公式推导 ^^^^^^^^ 我们约定C为损失函数(loss function),并记: .. math:: \delta^l = \frac{\partial C}{\partial z^l} 约定 **Hadamard product** 或者elementwise相乘为(重复指标不求和): .. math:: u\odot v = u_i * v_i 根据公式 :eq:`explicitform` 可以直接得出对偏移量 :math:`b` 的偏导数(梯度): .. math:: \frac{\partial C}{\partial b^l_j} = \sum_i \frac{\partial C}{\partial z^{l+1}_i} \frac{\partial z^{l+1}_i}{\partial b^l_j} = \frac{\partial C}{z^{l+1}_j} = \delta^{l+1}_j 上式写成矩阵形式为: .. math:: \frac{\partial C}{\partial b^l} = \delta^{l+1} 对权重 :math:`w` 的求导为: .. math:: \frac{\partial C}{\partial w^l_{jk}} = \sum_i \frac{\partial C}{\partial z^{l+1}_i} \frac{\partial z^{l+1}_i}{\partial w^l_{jk}} = \frac{\partial C}{\partial z^{l+1}_j} a^l_k = \delta^{l+1}_j a^l_k 上式写成矩阵形式为: .. math:: \frac{\partial C}{\partial w^l} = \delta^{l+1} (a^l)^T :math:`l` 层 :math:`\delta^l` 和 :math:`l+1` 层的 :math:`\delta^{l+1}` 的关系为: .. math:: \frac{\partial C}{\partial z^l_j} = \sum_{i,k} \frac{\partial C}{\partial z^{l+1}_i} \frac{\partial z^{l+1}_i}{\partial a^l_k} \frac{\partial a^l_k}{\partial z^l_j} = \sum_i \delta^{l+1}_i w^l_{ij} \sigma^{'}(z^l_j) 上式写成矩阵形式为: .. math:: \delta^l = (w^l)^T \delta^{l+1}\odot\sigma^{'}(z^l) 可以看出: .. math:: \nabla_a C = (w^l)^T \delta^{l+1} BP算法总结 ^^^^^^^^^^ BP算法可以概括为以下四个关系式: .. math:: \begin{aligned} \delta^l &= \frac{\partial C}{\partial z^l} = \nabla_z C \\ \frac{\partial C}{\partial w^l} &= \delta^{l+1} (a^l)^T \\ \frac{\partial C}{\partial b^l} &= \delta^{l+1} \\ \delta^l &= (w^l)^T \delta^{l+1}\odot\sigma^{'}(z^l) \end{aligned} 可以看出,可以从 :math:`\delta^{l+1}` 的推导出对第 :math:`l` 层的权重和偏移量的偏导,以及第 :math:`l` 层的未激活前的神经元的偏导。 convolution arithmetic ======================== :reference: - https://github.com/vdumoulin/conv_arithmetic - https://arxiv.org/abs/1603.07285 1. convolution -------------- Set input data size :math:`i`, convolution kernel size :math:`k`, stride size :math:`s`, and zero padding size :math:`p`. Then the output size :math:`o` is: .. math:: o = \left\lfloor{\frac{i + 2p - k}{s}}\right\rfloor + 1 \,. :label: conv The floor function :math:`{\lfloor}\,{\rfloor}` can found at https://en.wikipedia.org/wiki/Floor_and_ceiling_functions. 2. pooling ---------- According to :eq:`conv`, pooling output size is: .. math:: o = \left\lfloor{\frac{i-k}{s}}\right\rfloor + 1 \,. :label: pooling 3. tansposed convolution ------------------------ :explanation: The convolution operation can be rewritten to matrix multiplication. 4. dilated convolution ----------------------- The dilation "rate" is controlled by an additional hyperparameter :math:`d`. A kernel of size k dilated by a factor d has an effective size: .. math:: \hat{k} = k + (k-1)(d-1) \,. Combined with :eq:`conv` the output size is: .. math:: o = \left\lfloor{\frac{i + 2p - k - (k-1)(d-1)}{s}}\right\rfloor + 1 \,. :label: dilatedconv NLP ===== encoder-decoder model architecture: * **Encoder-only models**: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition. * **Decoder-only models**: Good for generative tasks such as text generation. * **Encoder-decoder models** or **sequence-to-sequence models**: Good for generative tasks that require an input, such as translation or summarization. Decoding Methods ****************** There are mainly tree decoding methods: ``Greedy search``, ``Beam search``, and ``Sampling``.