Paper Reading

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Lei Yang - Sensetime
Aug 19 2015

Quiz show

BN is proposed by Sergey Ioffe and Christian Szegedy. Which one of the following papers is also published by Christian Szegedy?

        A. (Deepid2)Deep Learning Face Representation by Joint Identification-Verification

        B. (Joint Bayesian)Bayesian Face Revisited: A Joint Formulation

        C. Robust Multi-Resolution Pedestrian Detection in Traffic Scenes 

        D. RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images

        E. (Googlenet)Going Deeper with Convolutions

What's normalization?

Make two distributions identical in statistical properties.

Rescaling(min-max)

$x' = \dfrac{x - x_{min}}{x_{max} - x_{min}}$

Standardization(z-score)

$x' = \dfrac{x - \mu}{\sigma}$

Scaling to unit length

$x' = \dfrac{x}{||x||}$

What's normalization?

What's the normalization used before BN?

Caffe
- Local Response Normalization
- Mean-Variance Normalization

MatConvNet
- Cross-Channel Normalization
- Spatial Normalization

Cross-Channel Normalization

For each output channel k

G(k) is a corresponding subset of input channels

$G(k) \subset \{1,2, ..., D\}$

$y_{ijk} = \dfrac{x_{ijk}} {(k + \alpha \sum_{m\in G(k)} x^2_{ijm})^{\beta}}$

Spatial Normalization

For each output channel k

$n^2_{ijk} = \dfrac{1}{H'W'} \sum_{1\leq i' \leq H', 1\leq j' \leq W'} x_{i+i'-\lfloor(H'-1)/2\rfloor, j+j'-1-\lfloor(W'-1)/2\rfloor}^2$

$y_{ijk} = \dfrac{x_{ijk}} {(1 + \alpha n^2_{ijk})^{\beta}}$

Local Response Normalization

Two modes:

ACROSS_CHANNEL
- across nearby channels
- but not spatial extent
WHITHIN_CHANNEL
- extend spatially
- but in seperate channels

$y_{i} = \dfrac{x_{i}} {(1 + (\alpha/n) \sum_{i} x^2)^{\beta}}$

Mean-Variance Normalization

Two modes:

ACROSS_CHANNEL
- across nearby channels
- but not spatial extent
WHITHIN_CHANNEL
- extend spatially
- but in seperate channels

$y_{i} = \dfrac{x_{i} - \mu(x)} {\epsilon + \sqrt{\sum_{i} x^2}}$

What's batch normalization?

Motivation

Problem: internal covariate shift

Change in the distribution of network activations due to the change in network parameters during training

Motivation

Idea: ensure the distribution of nonlinearity inputs remains more stable

$\hat{x} = \dfrac{x-E[x]}{\sqrt{Var[x]}}$

$E[\hat{x}] = 0, var[\hat{x}] = 1$

Forward

$x,y \in \Re^{H\times W\times K\times M}, \gamma,\beta \in \Re^{K}$

$\mu_k = \dfrac{1}{HWM} \sum_{i=1}^H\sum_{j=1}^W\sum_{m=1}^M x_{ijkm}$

$\sigma_k^2 = \dfrac{1}{HWM} \sum_{i=1}^H\sum_{j=1}^W\sum_{m=1}^M (x_{ijkm} - \mu_k)^2$

$\hat{x}_{ijkm} = \dfrac{x_{ijkm} - \mu_{k}}{\sqrt{\sigma_k^2 + \epsilon}}$

$y_{ijkm} = \gamma_k \times \hat{x}_{ijkm} + \beta_k$

Forward

Backward

For one feature map:

$\small{N = H\times W\times M}$

$\dfrac{\partial l}{\partial \gamma} = \sum_{i=1}^N \dfrac{\partial l}{\partial y_i} \cdot \hat{x}$

$\dfrac{\partial l}{\partial \beta} = \sum_{i=1}^N \dfrac{\partial l}{\partial y_i}$

Parameters of BN layer can be updated by above equations.

Backward

For one feature map k:

$\dfrac{\partial l}{\partial x_{ijkm}} = \sum_{i'j'km'} \dfrac{\partial l}{\partial y_{i'j'km'}}\dfrac{\partial y_{i'j'km'}}{\partial x_{ijkm}}$

$\dfrac{\partial y_{i'j'km'}}{\partial x_{ijkm}} = \gamma_k ((1-\dfrac{\partial \mu_k}{\partial x_{ijkm}})\dfrac{1}{\sqrt{\sigma^2+\epsilon}} - \dfrac{1}{2} (x_{i'j'km'} - \mu_k)(\sigma_k^2+\epsilon)^{-3/2}\dfrac{\partial \sigma_k^2}{\partial x_{ijkm}})$

$\dfrac{\partial \mu_k}{\partial x_{ijkm}} = \dfrac{1}{HWM} = \dfrac{1}{N}$

$\dfrac{\partial \sigma^2_k}{\partial x_{ijkm}} = \dfrac{2}{N} (x_{ijkm} - \mu_k)$

Diff can be passed down through BN layer by above equations.

Parameters of BN in practice

Network: xd_net_12m

Parameters of BN in practice

Network: xd_net_12m

Quiz show

The picture below describes what kind of Normalization?

        A. Cross-Channel Normalization

        B. Spatial Normalization

        C. Batch Normalization 

        D. Local Response Normalization

        E. Mean-Variance Normalization

Quiz show

The number of parameter $\gamma$ in a BN layer equals to?

        A. Batch size

        B. The number of feature maps

        C. The number of activations

Why batch normalization?

Experiments in the paper

Our experiments

Cifar 10
Training samples: 50000

Our experiments

Deepid2
Training samples: 500000

Our experiments (from Sun yi)

BN seems to be sensitive to learning rate or weight initialization?

Network	lr	iter	pcadim	accuracy
(tile conv)sn01 bn	0.01	150000	300	0.984000
(tile conv)sn02 bn	0.05	150000	400/500/700	0.991167
(full conv)tn03 7x6x1024(3)->7x6x256(1)->512 bn	0.03	150000	300	0.989000
(full conv)tn01 7x6x1024(3)->7x6x256(1)->512 bn	0.05	150000	400	0.992667
(full conv)tn04 7x6x1024(3)->7x6x256(1)->512 bn	0.1	150000	800/900	0.990167
(full conv)np04 tn01 -> no bn	0.05	150000	400	0.990500
(full conv)np05 tn01 -> no bn	0.01	150000	500/800	0.990000

Our experiments (from Sun gang)

Accelerate the training of VGG
Not improve the accuracy (even lower)
The improvement reported in the paper is inconlusive: The structure used in BN paper is different with Googlenet

Paper Reading

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Quiz show

Quiz show

What's normalization?

What's normalization?

What's normalization?

What's the normalization used before BN?

Cross-Channel Normalization

Spatial Normalization

Local Response Normalization

Mean-Variance Normalization

What's batch normalization?

Motivation

Motivation

Forward

Forward

Forward

Backward

Backward

Parameters of BN in practice

Parameters of BN in practice

Quiz show

Quiz show

Quiz show

Why batch normalization?

Experiments in the paper

Our experiments

Our experiments

Our experiments (from Sun yi)

Our experiments (from Sun gang)

Statistic from BN's citation

Statistic from BN's citation

<Thank You!>