Batchnorm layernorm

2025年3月20日

20:23

Class torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)

Applies Batch Normalization over a 2D or 3D input.

Parameters
num_features (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#int"int) – number of features or channels C of the input
eps (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#float"float) – a value added to the denominator for numerical stability. Default: 1e-5
momentum (﷟HYPERLINK "https://docs.python.org/3/library/typing.html#typing.Optional"Optional[﷟HYPERLINK "https://docs.python.org/3/library/functions.html#float"float]) – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1
affine (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#bool"bool) – a boolean value that when set to True, this module has learnable affine parameters. Default: True
track_running_stats (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#bool"bool) – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics, and initializes statistics buffers running_mean and running_var as None. When these buffers are None, this module always uses batch statistics. in both training and eval modes. Default: True
>

Shape:
Input: (N,C) or (N,C,L), where NN is the batch size, C is the number of features or channels, and L is the sequence length
Output: (N,C)or (N,C,L)(same shape as input)
Examples:
>>> # With Learnable Parameters >>> m = nn.BatchNorm1d(100) >>> # Without Learnable Parameters >>> m = nn.BatchNorm1d(100, affine=False) >>> input = torch.randn(20, 100) >>> output = m(input)

Class torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)

Applies Batch Normalization over a 4D input.
4D is a mini-batch of 2D inputs with additional channel dimension.
Because the Batch Normalization is done over the C dimension, computing statistics on (N, H, W) slices, it’s common terminology to call this Spatial Batch Normalization.

Parameters
num_features (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#int"int) – C from an expected input of size (N,C,H,W)
eps (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#float"float) – a value added to the denominator for numerical stability. Default: 1e-5
momentum (﷟HYPERLINK "https://docs.python.org/3/library/typing.html#typing.Optional"Optional[﷟HYPERLINK "https://docs.python.org/3/library/functions.html#float"float]) – the value used for the running_mean and running_var computation. Can be set to None for cumulative moving average (i.e. simple average). Default: 0.1
affine (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#bool"bool) – a boolean value that when set to True, this module has learnable affine parameters. Default: True
track_running_stats (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#bool"bool) – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics, and initializes statistics buffers running_mean and running_var as None. When these buffers are None, this module always uses batch statistics. in both training and eval modes. Default: True
Shape:
Input: (N,C,H,W
Output: (N,C,H,W) (same shape as input)

# With Learnable Parameters
m = nn.BatchNorm2d(100)
# Without Learnable Parameters
m = nn.BatchNorm2d(100, affine=False)
input = torch.randn(20, 100, 35, 45)
output = m(input)

Class torch.nn.InstanceNorm1d(num_features, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False, device=None, dtype=None)

Applies Instance Normalization.
This operation applies Instance Normalization over a 2D (unbatched) or 3D (batched) input as described in the paper
Parameters
num_features (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#int"int) – number of features or channels CC of the input
eps (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#float"float) – a value added to the denominator for numerical stability. Default: 1e-5
momentum (﷟HYPERLINK "https://docs.python.org/3/library/typing.html#typing.Optional"Optional[﷟HYPERLINK "https://docs.python.org/3/library/functions.html#float"float]) – the value used for the running_mean and running_var computation. Default: 0.1
affine (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#bool"bool) – a boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.
track_running_stats (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#bool"bool) – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: False
Shape:
Input: (N,C,L)(N,C,L) or (C,L)(C,L)
Output: (N,C,L)(N,C,L) or (C,L)(C,L) (same shape as input)
Examples:
>>> # Without Learnable Parameters >>> m = nn.InstanceNorm1d(100) >>> # With Learnable Parameters >>> m = nn.InstanceNorm1d(100, affine=True) >>> input = torch.randn(20, 100, 40) >>> output = m(input)

classtorch.nn.InstanceNorm2d(num_features, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False, device=None, dtype=None)﷟HYPERLINK "https://pytorch.org/docs/stable/_modules/torch/nn/modules/instancenorm.html#InstanceNorm2d"[source]﷟HYPERLINK "https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/modules/instancenorm.py#L241"[source]

Applies Instance Normalization.
This operation applies Instance Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper ﷟HYPERLINK "https://arxiv.org/abs/1607.08022"Instance Normalization: The Missing Ingredient for Fast Stylization.

Parameters
num_features (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#int"int) – CC from an expected input of size (N,C,H,W)(N,C,H,W) or (C,H,W)(C,H,W)
eps (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#float"float) – a value added to the denominator for numerical stability. Default: 1e-5
momentum (﷟HYPERLINK "https://docs.python.org/3/library/typing.html#typing.Optional"Optional[﷟HYPERLINK "https://docs.python.org/3/library/functions.html#float"float]) – the value used for the running_mean and running_var computation. Default: 0.1

momentum (﷟HYPERLINK "https://docs.python.org/3/library/typing.html#typing.Optional"Optional[﷟HYPERLINK "https://docs.python.org/3/library/functions.html#float"float]) – the value used for the running_mean and running_var computation. Default: 0.1
affine (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#bool"bool) – a boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.
track_running_stats (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#bool"bool) – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: False
Shape:
Input: (N,C,H,W)(N,C,H,W) or (C,H,W)(C,H,W)
Output: (N,C,H,W)(N,C,H,W) or (C,H,W)(C,H,W) (same shape as input)
Examples:
>>> # Without Learnable Parameters >>> m = nn.InstanceNorm2d(100) >>> # With Learnable Parameters >>> m = nn.InstanceNorm2d(100, affine=True) >>> input = torch.randn(20, 100, 35, 45) >>> output = m(input)

BatchNorm和InstanceNorm的参数都是num_features，即channel的个数或者特征的个数（对于2Dtensor来说），也就是说都是对每个channel独立地计算均值和方差，区别在于，对于channel A来说，BatchNorm统计均值和方差时用到的是batch中所有样本的值，而InstanceNorm统计均值和方差时用到的是单个样本的值。BatchNorm和InstanceNorm都有两个可学习参数，这两个参数的shape都是num_features。

★对于Instancenorm来说，如果特征是标量，不是vector，那么就没办法做InstanceNorm，因为只有一个数，没法算平均值，只有vector才能算平均值。所以对InstanceNorm1d来说，当输入是2D的时候，是unbatched的，也就是这个2D输入是单个样本。而BatchNorm1d 就没有这个问题，当BatchNorm1d的输入是2D的时候，这个输入也是batched。

Instancenorm没法用于NLP中。
BatchNorm可以。对于NLP来说，input的shape是[batch_size, seq_len, embedding_len（hidden_size）]，用的时候，将前两维flatten，即reshape成 [batch_size* seq_len, embedding_len]的2D tensor。用Batchnorm1d。有两种方式，效果是一样的。
第一种：
feature = torch.randn(4, 2, 5) # [ batch, seq_len, hidden_size ] feature = feature.transpose(1, 2) # [ batch, hidden_size, seq_len ] 输入是3D的，

feature = torch.randn(4, 2, 5) # [ batch, seq_len, hidden_size ] feature = feature.transpose(1, 2) # [ batch, hidden_size, seq_len ] 输入是3D的，

bn = nn.BatchNorm1d(5, eps=1e-5) # hidden_dim output = bn(feature1)
Output = output.reshape(4,2,5)

第二种：
feature = torch.randn(4, 2, 5) # [ batch, seq_len, hidden_size ] feature = feature.reshape(4*2, 5) # [ batch * seq_len , hidden_size ] 输入是2D的

bn = nn.BatchNorm1d(5, eps=1e-5) # hidden_dim output = bn(feature1)
Output = output.reshape(4,2,5)

总结：Batchnorm是按照特征进行操作的，对于每个特征，计算batch中所有样本的均值和标准差，然后进行norm以及缩放。Batchnorm有四个参数，gamma ,beta,running_mean,running_variance，对于每一个特征来说，都有这样4个参数，即gamma.shape == beta.shape == running_mean.shape == running_var.shape == num_features。
对于image来说，特征的数目=channel的数目，一张图片是一个样本，样本的特征数目=图片的通道数，单个样本的单个特征是2D tensor.
对于NLP来说，特征的数目= token 的 embedding size,或者 token 的hidden_size，一个token是一个样本，样本的特征数目= hidden_size，单个样本的单个特征是一个标量。

★为什么Batchnorm不用于Transformer中？
一个batch中text_length是不同的，需要统一padding到max_length，正是因为一个token是一个样本，有些token是padding token，这些padding会影响均值和方差的计算。
The fundamental issue is that padding tokens, although necessary for alignment, are not part of the original data. They introduce a lot of zeros into the dataset, which can mislead the normalization process. This is why batch normalization is not applied in the context of self-attention in transformers.

Batchnorm进行计算时，计算每个batch的均值和方差，计算方差时需要注意用bessel correction。pytorch里unbiased = True。

Class torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True, bias=True, device=None, dtype=None)

The mean and standard-deviation are calculated over the last D dimensions, where D is the dimension of normalized_shape. For example, if normalized_shape is (3, 5) (a 2-dimensional shape), the mean and standard-deviation are computed over the last 2 dimensions of the input (i.e. input.mean((-2, -1))). γγ and ββ are learnable affine transform parameters of normalized_shape if elementwise_affine is True. The variance is calculated via the biased estimator, equivalent to torch.var(input, unbiased=False).

The mean and standard-deviation are calculated over the last D dimensions, where D is the dimension of normalized_shape. For example, if normalized_shape is (3, 5) (a 2-dimensional shape), the mean and standard-deviation are computed over the last 2 dimensions of the input (i.e. input.mean((-2, -1))). γγ and ββ are learnable affine transform parameters of normalized_shape if elementwise_affine is True. The variance is calculated via the biased estimator, equivalent to torch.var(input, unbiased=False).

This layer uses statistics computed from input data in both training and evaluation modes.
Parameters
normalized_shape (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#int"int or ﷟HYPERLINK "https://docs.python.org/3/library/stdtypes.html#list"list or ﷟HYPERLINK "https://pytorch.org/docs/stable/size.html#torch.Size"torch.Size) – input shape from an expected input of size [∗×normalized_shape[0]×normalized_shape[1]×…×normalized_shape[−1]][∗×normalized_shape[0]×normalized_shape[1]×…×normalized_shape[−1]] If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which is expected to be of that specific size.
eps (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#float"float) – a value added to the denominator for numerical stability. Default: 1e-5
elementwise_affine (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#bool"bool) – a boolean value that when set to True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases). Default: True.
bias (﷟HYPERLINK "https://docs.python.org/3/library/functions.html#bool"bool) – If set to False, the layer will not learn an additive bias (only relevant if elementwise_affine is True). Default: True.
Variables
weight – the learnable weights of the module of shape normalized_shapenormalized_shape when elementwise_affine is set to True. The values are initialized to 1.
bias – the learnable bias of the module of shape normalized_shapenormalized_shape when elementwise_affine is set to True. The values are initialized to 0.

Variables
weight – the learnable weights of the module of shape normalized_shapenormalized_shape when elementwise_affine is set to True. The values are initialized to 1.
bias – the learnable bias of the module of shape normalized_shapenormalized_shape when elementwise_affine is set to True. The values are initialized to 0.

Shape:
Input: (N,∗)(N,∗)
Output: (N,∗)(N,∗) (same shape as input)
Examples:
>>> # NLP Example >>> batch, sentence_length, embedding_dim = 20, 5, 10 >>> embedding = torch.randn(batch, sentence_length, embedding_dim) >>> layer_norm = nn.LayerNorm(embedding_dim) >>> # Activate module >>> layer_norm(embedding) >>> >>> # Image Example >>> N, C, H, W = 20, 5, 10, 10 >>> input = torch.randn(N, C, H, W) >>> # Normalize over the last three dimensions (i.e. the channel and spatial dimensions) >>> # as shown in the image below >>> layer_norm = nn.LayerNorm([C, H, W]) >>> output = layer_norm(input)

LayerNorm中不会像BatchNorm那样跟踪统计全局的均值方差，因此train()和eval()对LayerNorm没有影响。

LayerNorm中不会像BatchNorm那样跟踪统计全局的均值方差，因此train()和eval()对LayerNorm没有影响。

normalized_shape：
如果传入整数，比如4，则被看做只有一个整数的list，此时LayerNorm会对输入的最后一维进行归一化，这个int值需要和输入的最后一维一样大。

假设此时输入的数据维度是[3, 4]，则对3个长度为4的向量求均值方差，得到3个均值和3个方差，分别对这3行进行归一化（每一行的4个数字都是均值为0，方差为1）；LayerNorm中的weight和bias也分别包含4个数字，重复使用3次，对每一行进行仿射变换（仿射变换即乘以weight中对应的数字后，然后加bias中对应的数字），并会在反向传播时得到学习。

如果输入的是个list或者torch.Size，比如[3, 4]或torch.Size([3, 4])，则会对网络最后的两维进行归一化，且要求输入数据的最后两维尺寸也是[3, 4]。

假设此时输入的数据维度也是[3, 4]，首先对这12个数字求均值和方差，然后归一化这个12个数字；weight和bias也分别包含12个数字，分别对12个归一化后的数字进行仿射变换（仿射变换即乘以weight中对应的数字后，然后加bias中对应的数字），并会在反向传播时得到学习。
假设此时输入的数据维度是[N, 3, 4]，则对着N个[3,4]做和上述一样的操作，只是此时做仿射变换时，weight和bias被重复用了N次。
假设此时输入的数据维度是[N, T, 3, 4]，也是一样的，维度可以更多。
注意：显然LayerNorm中weight和bias的shape就是传入的normalized_shape。对于NLP来说，weight和bias的shape是embedding_dim，对于Image来说，weight和bias的shape是[C, H, W]。
Batchnorm make sure that across batch dimension, any individual neuron has unit gaussian distribution.
★以最简单的Batchnorm1D和LayerNorm1D来说，也就是输入是2维的，shape = batch_size, d，其中d是特征的个数，Batchnorm1D是对每一列（每个特征）进行归一化，LayerNorm1D是对每一行（每个样本）进行归一化。
对于NLP来说，特征的个数==token的embedding的长度，一个样本==一个token

墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图
墨迹绘图

已使用 OneNote 创建。