Loading [MathJax]/jax/output/HTML-CSS/jax.js

Paper Reading

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Lei Yang - Sensetime
Aug 19 2015

Quiz show

Quiz show

BN is proposed by Sergey Ioffe and Christian Szegedy. Which one of the following papers is also published by Christian Szegedy?

        A. (Deepid2)Deep Learning Face Representation by Joint Identification-Verification
B. (Joint Bayesian)Bayesian Face Revisited: A Joint Formulation
C. Robust Multi-Resolution Pedestrian Detection in Traffic Scenes
D. RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images
E. (Googlenet)Going Deeper with Convolutions
  • E

What's normalization?

What's normalization?

What's normalization?

    Description

What's the normalization used before BN?

  • Caffe
    • Local Response Normalization
    • Mean-Variance Normalization
  • MatConvNet
    • Cross-Channel Normalization
    • Spatial Normalization

Cross-Channel Normalization

For each output channel k

G(k) is a corresponding subset of input channels

G(k){1,2,...,D} yijk=xijk(k+αmG(k)x2ijm)β
Description

Spatial Normalization

For each output channel k

n2ijk=1HW1iH,1jWx2i+i(H1)/2,j+j1(W1)/2 yijk=xijk(1+αn2ijk)β
Description

Local Response Normalization

Two modes:

  • ACROSS_CHANNEL
    • across nearby channels
    • but not spatial extent
  • WHITHIN_CHANNEL
    • extend spatially
    • but in seperate channels
yi=xi(1+(α/n)ix2)β

Mean-Variance Normalization

Two modes:

  • ACROSS_CHANNEL
    • across nearby channels
    • but not spatial extent
  • WHITHIN_CHANNEL
    • extend spatially
    • but in seperate channels
yi=xiμ(x)ϵ+ix2

What's batch normalization?

Motivation

Problem: internal covariate shift

Change in the distribution of network activations due to the change in network parameters during training

Description

Motivation

Idea: ensure the distribution of nonlinearity inputs remains more stable

ˆx=xE[x]Var[x]
E[ˆx]=0,var[ˆx]=1
Description

Forward

x,yH×W×K×M,γ,βK μk=1HWMHi=1Wj=1Mm=1xijkm σ2k=1HWMHi=1Wj=1Mm=1(xijkmμk)2 ˆxijkm=xijkmμkσ2k+ϵ yijkm=γk׈xijkm+βk

Forward

Description

Forward

Description

Backward

For one feature map:

N=H×W×M lγ=Ni=1lyiˆx lβ=Ni=1lyi

Parameters of BN layer can be updated by above equations.

Backward

For one feature map k:

lxijkm=ijkmlyijkmyijkmxijkm yijkmxijkm=γk((1μkxijkm)1σ2+ϵ12(xijkmμk)(σ2k+ϵ)3/2σ2kxijkm) μkxijkm=1HWM=1N σ2kxijkm=2N(xijkmμk)

Diff can be passed down through BN layer by above equations.

Parameters of BN in practice

Network: xd_net_12m

Parameters of BN in practice

Network: xd_net_12m

Quiz show

Quiz show

The picture below describes what kind of Normalization?

Description
        A. Cross-Channel Normalization
B. Spatial Normalization
C. Batch Normalization
D. Local Response Normalization
E. Mean-Variance Normalization
  • C

Quiz show

The number of parameter γ in a BN layer equals to?

        A. Batch size
B. The number of feature maps
C. The number of activations
  • B

Why batch normalization?

Experiments in the paper

Description

Our experiments

  • Cifar 10
  • Training samples: 50000
Description

Our experiments

  • Deepid2
  • Training samples: 500000
Description

Our experiments (from Sun yi)

  • BN seems to be sensitive to learning rate or weight initialization?
Networklriterpcadimaccuracy
(tile conv)sn01 bn0.011500003000.984000
(tile conv)sn02 bn0.05150000400/500/7000.991167
(full conv)tn03 7x6x1024(3)->7x6x256(1)->512 bn0.031500003000.989000
(full conv)tn01 7x6x1024(3)->7x6x256(1)->512 bn0.051500004000.992667
(full conv)tn04 7x6x1024(3)->7x6x256(1)->512 bn0.1150000800/9000.990167
(full conv)np04 tn01 -> no bn0.051500004000.990500
(full conv)np05 tn01 -> no bn0.01150000500/8000.990000

Our experiments (from Sun gang)

  • Accelerate the training of VGG
  • Not improve the accuracy (even lower)
  • The improvement reported in the paper is inconlusive: The structure used in BN paper is different with Googlenet
Description

Statistic from BN's citation

Statistic from BN's citation

Description Description

<Thank You!>