Google 调参playbook

2025年4月16日

14:29

https://github.com/google-research/tuning_playbook

1. Guide for starting a new project 开始新项目

1.1 Choosing the model architecture
选择模型，先选择commonly used的，调整model的层数等参数，然后再尝试custom model。可能的话，读相关paper。

Summary: When starting a new project, try to reuse a model that already works.

Choose a well established, commonly used model architecture to get working first. It is always possible to build a custom model later.
Model architectures typically have various hyperparameters that determine the model's size and other details (e.g. number of layers, layer width, type of activation function).

Thus, choosing the architecture really means choosing a family of different models (one for each setting of the model hyperparameters).
We will consider the problem of choosing the model hyperparameters in Choosing the initial configuration and A scientific approach to improving model performance.

When possible, try to find a paper that tackles something as close as possible to the problem at hand and reproduce that model as a starting point.

1.2 Choosing the optimizer
先用最流行的optimizer，没有最好的optimizer，只有最合适的optimizer。optimizer的参数需要注意，需要调整而不是盲目用默认参数，比如使用带动量的optimizer中参数 β 的控制着计算多少个batch的梯度平均值，kaggle jane street比赛中，一个batch是一天的数据，而该数据不具有长期依赖性，数据的短期波动更为主要，默认的β相当于计算过去100天？的移动平均，这个默认值过大，需要调整。

Summary: Start with the most popular optimizer for the type of problem at hand.

No optimizer is the "best" across all types of machine learning problems and model architectures. Even just comparing the performance of optimizers is a difficult task. 🤖
We recommend sticking with well-established, popular optimizers, especially when starting a new project.

Ideally, choose the most popular optimizer used for the same type of problem.

Be prepared to give attention to *all* hyperparameters of the chosen optimizer.

Optimizers with more hyperparameters may require more tuning effort to find the best configuration.
This is particularly relevant in the beginning stages of a project when we are trying to find the best values of various other hyperparameters (e.g. architecture hyperparameters) while treating optimizer hyperparameters as nuisance parameters.
It may be preferable to start with a simpler optimizer (e.g. SGD with fixed momentum or Adam with fixed ϵ, β1, and β2) in the initial stages of the project and switch to a more general optimizer later.

Well-established optimizers that we like include (but are not limited to):

SGD with momentum (we like the Nesterov variant)
Adam and NAdam, which are more general than SGD with momentum. Note that Adam has 4 tunable hyperparameters and they can all matter!

See How should Adam's hyperparameters be tuned?

1.3 Choosing the batch size
调整batchsize可以加速训练，batchsize不应该是一个可调整的超参数，理想的batchsize是硬件支持范围内最大的batchsize。当其他参数well tuned，且训练steps足够时，用任何大小的batchsize训练都可以达到相同的performance.

Summary: The batch size governs the training speed and shouldn't be used to directly tune the validation set performance. Often, the ideal batch size will be the largest batch size supported by the available hardware.

The batch size is a key factor in determining the training time and computing resource consumption.
Increasing the batch size will often reduce the training time. This can be highly beneficial because it, e.g.:

Allows hyperparameters to be tuned more thoroughly within a fixed time interval, potentially resulting in a better final model.
Reduces the latency of the development cycle, allowing new ideas to be tested more frequently.

Increasing the batch size may either decrease, increase, or not change the resource consumption.
The batch size should not be treated as a tunable hyperparameter for validation set performance.

As long as all hyperparameters are well-tuned (especially the learning rate and regularization hyperparameters) and the number of training steps is sufficient, the same final performance should be attainable using any batch size (see Shallue et al. 2018).
Please see Why shouldn't the batch size be tuned to directly improve validation set performance?

Determining the feasible batch sizes and estimating training throughput

增大batchsize的收益来源于增加training吞吐量，理论上，当显存未满时，batchsize增加一倍，吞吐量也增加一倍，如果不是这种情况，那么可能有其他bottleneck如IO。这些步骤可能在每次模型或优化器发生改变时都要重复进行（如，不同的模型架构可能运行更大的batch size适合内存）

For a given model and optimizer, there will typically be a range of batch sizes supported by the available hardware. The limiting factor is usually accelerator memory.
Unfortunately, it can be difficult to calculate which batch sizes will fit in memory without running, or at least compiling, the full training program.
The easiest solution is usually to run training jobs at different batch sizes (e.g. increasing powers of 2) for a small number of steps until one of the jobs exceeds the available memory.
For each batch size, we should train for long enough to get a reliable estimate of the training throughput

training throughput = (# examples processed per second)

or, equivalently, the time per step.

time per step = (batch size) / (training throughput)

When the accelerators aren't yet saturated, if the batch size doubles, the training throughput should also double (or at least nearly double). Equivalently, the time per step should be constant (or at least nearly constant) as the batch size increases.
If this is not the case then the training pipeline has a bottleneck such as I/O or synchronization between compute nodes. This may be worth diagnosing and correcting before proceeding.
If the training throughput increases only up to some maximum batch size, then we should only consider batch sizes up to that maximum batch size, even if a larger batch size is supported by the hardware.

All benefits of using a larger batch size assume the training throughput increases. If it doesn't, fix the bottleneck or use the smaller batch size.
Gradient accumulation simulates a larger batch size than the hardware can support and therefore does not provide any throughput benefits. It should generally be avoided in applied work.

These steps may need to be repeated every time the model or optimizer is changed (e.g. a different model architecture may allow a larger batch size to fit in memory).

Choosing the batch size to minimize training time

训练时间= time per step * total number steps ，time per step通常是常数，与batchsize无关。通常情况下随着batch size的增加，达到固定性能目标所需的步数steps通常会减少（Shallue et al. 2018 给出了当batch size发生改变时，重调所有相关超参数的方法），完美情况下，batchsize加倍，total steps会减半。因此，让训练时间最小化的batchsize，是能达到固定性能所需的最小的训练步数steps的那个batchsize。如果最终会增加训练时间，那么使用更大的批量大小是没有意义的。

Training time = (time per step) x (total number of steps)

We can often consider the time per step to be approximately constant for all feasible batch sizes. This is true when there is no overhead from parallel computations and all training bottlenecks have been diagnosed and corrected (see the previous section for how to identify training bottlenecks). In practice, there is usually at least some overhead from increasing the batch size.
As the batch size increases, the total number of steps needed to reach a fixed performance goal typically decreases (provided all relevant hyperparameters are re-tuned when the batch size is changed; Shallue et al. 2018).

E.g. Doubling the batch size might halve the total number of steps required. This is called perfect scaling.
Perfect scaling holds for all batch sizes up to a critical batch size, beyond which one achieves diminishing returns.
Eventually, increasing the batch size no longer reduces the number of training steps (but never increases it).

Therefore, the batch size that minimizes training time is usually the largest batch size that still provides a reduction in the number of training steps required.

This batch size depends on the dataset, model, and optimizer, and it is an open problem how to calculate it other than finding it experimentally for every new problem. 🤖
When comparing batch sizes, beware the distinction between an example budget/epoch budget (running all experiments while fixing the number of training example presentations) and a step budget (running all experiments with the number of training steps fixed).

Comparing batch sizes with an epoch budget only probes the perfect scaling regime, even when larger batch sizes might still provide a meaningful speedup by reducing the number of training steps required.

Often, the largest batch size supported by the available hardware will be smaller than the critical batch size. Therefore, a good rule of thumb (without running any experiments) is to use the largest batch size possible.

There is no point in using a larger batch size if it ends up increasing the training time.

Changing the batch size requires re-tuning most hyperparameters

大部分超参数的最优值都是对batch size敏感的。因此，当改变batch size时，一般都需要重新调参。和batch size相关性最强的超参数是优化器超参数（如，学习速率和动量）和正则化超参数，因此，需要对每个batch size单独调参。

The optimal values of most hyperparameters are sensitive to the batch size. Therefore, changing the batch size typically requires starting the tuning process all over again.
The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.
Keep this in mind when choosing the batch size at the start of a project. If you need to switch to a different batch size later on, it might be difficult, time consuming, and expensive to re-tune everything for the new batch size.

Batch norm如何与batch size相互影响

一般来说，应该使用与梯度计算不同的 batch size 来计算统计数据。有关详细讨论，请参阅batch norm部分。也就是bessel 修正吧。

1.4 Choosing the initial configuration

Before beginning hyperparameter tuning we must determine the starting point. This includes specifying (1) the model configuration (e.g. number of layers), (2) the optimizer hyperparameters (e.g. learning rate), and (3) the number of training steps.
Determining this initial configuration will require some manually configured training runs and trial-and-error.
Our guiding principle is to find a simple, relatively fast, relatively low-resource-consumption configuration that obtains a "reasonable" result.

"Simple" means avoiding bells and whistles wherever possible; these can always be added later. Even if bells and whistles prove helpful down the road, adding them in the initial configuration risks wasting time tuning unhelpful features and/or baking in unnecessary complications.

For example, start with a constant learning rate before adding fancy decay schedules.

Choosing an initial configuration that is fast and consumes minimal resources will make hyperparameter tuning much more efficient.

For example, start with a smaller model.

"Reasonable" performance depends on the problem, but at minimum means that the trained model performs much better than random chance on the validation set (although it might be bad enough to not be worth deploying).

Choosing the number of training steps involves balancing the following tension:

On the one hand, training for more steps can improve performance and makes hyperparameter tuning easier (see Shallue et al. 2018).
On the other hand, training for fewer steps means that each training run is faster and uses fewer resources, boosting tuning efficiency by reducing the time between cycles and allowing more experiments to be run in parallel. Moreover, if an unnecessarily large step budget is chosen initially, it might be hard to change it down the road, e.g. once the learning rate schedule is tuned for that number of steps.

A scientific approach to improving model performance

2.1 The incremental tuning strategy

从 baseline开始，一点一点地进行改进，在做出改进的同时，建立对problem以及data的insight.

Summary: Start with a simple configuration and incrementally make improvements while building up insight into the problem. Make sure that any improvement is based on strong evidence to avoid adding unnecessary complexity.

增量调参策略包括4个步骤：

our incremental tuning strategy involves repeating the following four steps:

Identify an appropriately-scoped goal for the next round of experiments.
Design and run a set of experiments that makes progress towards this goal.
Learn what we can from the results.
Consider whether to launch the new best configuration.

2.2 Choosing the goal for the next round of experiments

选择每轮实验的目标，目标的选择以探索exploration为主而不是exploitation，目的是获得对problem的insight，也就是Most of the time, our primary goal is to gain insight into the problem.

Summary: Each round of experiments should have a clear goal and be sufficiently narrow in scope that the experiments can actually make progress towards the goal.

Example goals include:

Try a potential improvement to the pipeline (e.g. a new regularizer, preprocessing choice, etc.).
Understand the impact of a particular model hyperparameter (e.g. the activation function)
Greedily minimize validation error.

2.3 Designing the next round of experiments

确定哪些是scientific超参数，冗余超参数，固定超参数。

Summary: Identify which hyperparameters are scientific, nuisance, and fixed hyperparameters for the experimental goal. Create a sequence of studies to compare different values of the scientific hyperparameters while optimizing over the nuisance hyperparameters. Choose the search space of nuisance hyperparameters to balance resource costs with scientific value.

如果要确定层数多少的影响，层数就是scientific超参数，学习率是冗余超参数，当我们调整层数时，不应该固定学习率，而应该选择当前层数下最优的学习率，因为层数和学习率对效果的影响是耦合的，不是独立的。而激活函数是Fixed 超参数，调整层数时，可以固定激活函数。

For example, if our goal is to "determine whether a model with more hidden layers will reduce validation error", then the number of hidden layers is a scientific hyperparameter.

The learning rate is a nuisance hyperparameter because we can only fairly compare models with different numbers of hidden layers if the learning rate is tuned separately for each number of layers (the optimal learning rate generally depends on the model architecture).
The activation function could be a fixed hyperparameter if we have determined in prior experiments that the best choice of activation function is not sensitive to model depth, or if we are willing to limit our conclusions about the number of hidden layers to only cover this specific choice of activation function. Alternatively, it could be a nuisance parameter if we are prepared to tune it separately for each number of hidden layers.

超参数是scientific还是nuisance还是fixed的，取决于我们的实验目标。

Whether a particular hyperparameter is a scientific hyperparameter, nuisance hyperparameter, or fixed hyperparameter is not inherent to that hyperparameter, but changes depending on the experimental goal.

For example, the choice of activation function could be a scientific hyperparameter (is ReLU or tanh a better choice for our problem?), a nuisance hyperparameter (is the best 5-layer model better than the best 6-layer model when we allow several different possible activation functions?), or a fixed hyperparameter (for ReLU nets, does adding batch normalization in a particular position help?).

当设计下一轮实验时，首先确定scientific 超参数，暂时把其他所有的超参数作为冗余超参数，当计算资源无限时，把其他所有的超参数作为冗余超参数的好处是so that the conclusions we draw from our experiments are free from caveats about fixed hyperparameter values. 但是这样的风险是，需要调整的冗余参数越多, the greater the risk we fail to tune them sufficiently well for each setting of the scientific hyperparameters and end up reaching the wrong conclusions from our experiments，而且我们也没有无限的计算资源。因此我们选择一些冗余超参数作为fixed 超参数，选择的原则是：The more a given nuisance hyperparameter interacts with the scientific hyperparameters, the more damaging it is to fix its value. For example, the best value of the weight decay strength typically depends on the model size, so comparing different model sizes assuming a single specific value of the weight decay would not be very insightful.

虽然超参数是scientific还是nuisance还是fixed的，取决于我们的实验目标。但是仍然有一些基本准则可以参考，

学习率相关的超参数，optimizer相关的超参数，通常是nuisance的。因为二者与scientific 超参数有interact。

Of the various optimizer hyperparameters (e.g. the learning rate, momentum, learning rate schedule parameters, Adam betas etc.), at least some of them will be nuisance hyperparameters because they tend to interact the most with other changes.

而Optimizer的选择通常是scientific 或者fixed的，

It is a scientific hyperparameter if our experimental goal involves making fair comparisons between two or more different optimizers (e.g. "determine which optimizer produces the lowest validation error in a given number of steps").

类似的，正则相关的超参数是nuisance的，但是是否选择正则以及选择哪种正则，是scientific 或者fixed的。

Hyperparameters introduced by a regularization technique are typically nuisance hyperparameters, but whether or not we include the regularization technique at all is a scientific or fixed hyperparameter.

For example, dropout adds code complexity, so when deciding whether to include it we would make "no dropout" vs "dropout" a scientific hyperparameter and the dropout rate a nuisance hyperparameter.

模型结构的超参数通常是scientific或者fixed的。

Architectural hyperparameters are often scientific or fixed hyperparameters because architecture changes can affect serving and training costs, latency, and memory requirements.

For example, the number of layers is typically a scientific or fixed hyperparameter since it tends to have dramatic consequences for training speed and memory usage.

在某些情况下，nuisance和fixed超参数的选择取决于scientific超参数。

For example, suppose we are trying to determine which optimizer out of Nesterov momentum and Adam results in the lowest validation error. The scientific hyperparameter is the optimizer, which takes values {"Nesterov_momentum", "Adam"}. The value optimizer="Nesterov_momentum" introduces the nuisance/fixed hyperparameters {learning_rate, momentum}, but the value optimizer="Adam" introduces the nuisance/fixed hyperparameters {learning_rate, beta1, beta2, epsilon}.

确定哪些是scientific超参数，冗余超参数，固定超参数之后，就可以指定一系列超参数的设置，每个设置称为一个trial，具体包括确定超参数的搜索空间，选择trial的次数，选择自动搜索算法，或者手动搜索等。

搜索的目标是确定不同scientific超参数对性能的影响，并时刻保持nuisance参数的最优，从而让不同的scientific超参数的对比变得有意义。

The purpose of the studies is to run the pipeline with different values of the scientific hyperparameters, while at the same time "optimizing away" (or "optimizing over") the nuisance hyperparameters so that comparisons between different values of the scientific hyperparameters are as fair as possible.

一个简单情况的例子：

For example, if our goal is to select the best optimizer out of Nesterov momentum and Adam, we could create one study in which optimizer="Nesterov_momentum" and the nuisance hyperparameters are {learning_rate, momentum}, and another study in which optimizer="Adam" and the nuisance hyperparameters are {learning_rate, beta1, beta2, epsilon}. We would compare the two optimizers by selecting the best performing trial from each study.

We can use any gradient-free optimization algorithm, including methods such as Bayesian optimization or evolutionary algorithms, to optimize over the nuisance hyperparameters, although we prefer to use quasi-random search in the exploration phase of tuning because of a variety of advantages it has in this setting.

复杂情况的例子，把scientific参数包括在nuisance参数中，一起进行参数搜索，此时用随机搜索更好，因为可以保证scientific参数被均匀地采样到。

In the more complicated case where we want to compare a large number of values of the scientific hyperparameters and it is impractical to make that many independent studies, we can include the scientific parameters in the same search space as the nuisance hyperparameters and use a search algorithm to sample values of both the scientific and nuisance hyperparameters in a single study.

In this case, our preference for using quasi-random search over fancier black-box optimization tools is even stronger, since it ensures that we obtain a relatively uniform sampling of values of the scientific hyperparameters. Regardless of the search algorithm, we need to make sure somehow that it searches the scientific parameters uniformly.

进行参数搜索实验时，需要探索足够多的scientific参数，才能获得足够的insight，也需要探索足够多nuisance参数，才能保证the nuisance hyperparameters 足够好well enough ，但是搜索的参数越多，计算资源消耗越大，因此需要平衡。

2.4 Extracting insight from experimental results

在实验过程中，除了探索scientific超参数的影响之外，还应该检查一些其他问题。

Summary: In addition to trying to achieve the original scientific goal of each group of experiments, go through a checklist of additional questions and, if issues are discovered, revise the experiments and rerun them.

Before analyzing a given set of experiments to make progress toward their original goal, we should ask ourselves the following additional questions:

搜索空间足够大吗？Is the search space large enough?

不合适的搜索空间：当最优点位于搜索空间的边界时。A search space is suspicious if the best point sampled from it is close to its boundary. We might find an even better point if we expanded the search range in that direction.

我们是否从搜索空间中采样了足够多的点？Have we sampled enough points from the search space?

通常，很难知道搜索空间的采样是否足够密集。通常会采样我们可以负担得起的东西，并尝试通过实验结果来校准我们对问题的insight。

有多少试验是不可行的 What fraction of the trials in each study are infeasible (i.e. trials that diverge, get really bad loss values, or fail to run at all because they violate some implicit constraint)?
该模型是否存在优化问题？ Does the model exhibit optimization issues?
我们可以从最佳试验的训练曲线中学到什么？

从训练曲线中能否看出存在过拟合问题？ What can we learn from the training curves of the best trials?

Problematic overfitting occurs when the validation error starts increasing at some point during training.

如果存在过拟合，那么应该加入正则项来解决些问题并重新运行实验，这样才能公平地比较scientific超参数的影响。If any of the best trials exhibits problematic overfitting, we usually want to re-run the experiment with additional regularization techniques and/or better tune the existing regularization parameters before comparing the values of the scientific hyperparameters. Reducing overfitting is often straightforward using common regularization techniques that add minimal code complexity or extra computation (e.g. dropout, label smoothing, weight decay), so it’s usually no big deal to add one or more of these to the next round of experiments.

训练后期的训练或验证误差是否存在高步进方差（high step-to-step variance）？

步进方差的最可能原因是批次方差（从每个批次的训练集中随机抽取示例）、小验证集以及在训练后期使用过高的学习率。

可能的补救措施包括增加批量大小、获取更多验证数据、使用学习率衰减或使用 Polyak 平均。

The most likely causes of step-to-step variance are batch variance (from randomly sampling examples from the training set for each batch), small validation sets, and using a learning rate that’s too high late in training.

Possible remedies include increasing the batch size, obtaining more validation data, using learning rate decay, or using Polyak averaging.

训练结束时试验是否仍在改进？Are the trials still improving at the end of training?

如果是，我们可能会从增加训练步骤数或改变学习率计划中受益。

If so, this indicates that we are in the "compute bound" regime and we may benefit from increasing the number of training steps or changing the learning rate schedule.

训练集和验证集的性能在最后的训练步骤之前很早就饱和了吗？Has performance on the training and validation sets saturated long before the final training step?

如果是，这表明我们可以减少训练步骤的数量。

If so, this indicates that we are in the "not compute-bound" regime and that we may be able to decrease the number of training steps.

据上述问题的答案，改进最近的研究（或研究组）以改进搜索空间和/或抽样更多试验，或采取其他一些纠正措施。

2.5 Determining whether to adopt a training pipeline change or hyperparameter configuration

在决定是否更改我们的模型或训练程序或采用新的超参数配置时，需要这个新的配置是否是真的可以提升效果，还是由其他因素造成的，例如不同的随机种子，不同的随机初始化、训练数据shuffle、丢弃掩码、数据扩充操作的模式以及并行算术操作的顺序，都是试验方差的潜在来源。

因此，在采用候选更改之前，请考虑将最佳试验运行 N 次以表征每次运行的试验方差。Therefore, before adopting a candidate change, consider running the best trial N times to characterize the run-to-run trial variance.

2.6 After exploration concludes

总结：一旦我们完成了对良好搜索空间的探索并决定了应该调整哪些超参数，贝叶斯优化工具就是一个引人注目的选择。

在某个时候，我们的优先事项将从更多地了解调优问题迁移到生成单一最佳配置以启动或以其他方式使用。

在这一点上，应该有一个精确的搜索空间，可以舒适地包含最佳观察试验周围的局部区域，并且已经过充分采样。

我们的探索工作应该已经揭示了最重要的要调整的超参数（以及它们的合理范围），我们可以使用这些超参数来构建搜索空间，以使用尽可能大的调整预算进行最终的自动调整研究。

由于我们不再关心最大化我们对调优问题的洞察力，因此准随机搜索的许多优势不再适用，应该使用贝叶斯优化工具来自动找到最佳超参数配置。(也就是说，先用准随机搜索，再用贝叶斯搜索，类似于先粗排，后精排)

如果搜索空间包含大量发散点（获得 NaN 训练损失或甚至训练损失比均值差很多标准差的点），使用黑盒优化工具来正确处理发散试验很重要（请参阅 Bayesian Optimization with Unknown Constraints 是处理此问题的绝佳方法）。

此时，我们还应该考虑检查测试集上的性能。

原则上，我们甚至可以将验证集折叠到训练集中，并重新训练通过贝叶斯优化找到的最佳配置。但是，这仅适用于未来不会针对此特定工作负载发布的情况（例如一次性 Kaggle 竞赛）。

Determining the number of steps for each training run（这部分没太看懂）

无论计算资源是否受限，那些增加batch之间的梯度方差的method，通常会增加training steps。比如更小的batchsize，数据增强，Dropout。

Regardless of whether a given workload is compute-bound or not, methods that increase the variance of the gradients (across batches) will usually result in slower training progress, and thus may increase the number of training steps required to reach a particular validation loss. High gradient variance can be caused by:

Using a smaller batch size
Adding data augmentation
Adding some types of regularization (e.g. dropout)

3.1 When training is compute-bound

当计算资源受限时，在这种情况下，理论上来说，加速训练速度就相当于提高训练performance，the "optimal" training time is always "as long as we can afford."

3.2 When training is not compute-bound

当计算资源不受限时，we can afford to train as long as we would like to, and, at some point, training longer doesn't help much (or even causes problematic overfitting).

Additional guidance for the training pipeline

4.1 Optimizing the input pipeline

优化输入pipeline。

input可能存在的问题：

数据没有放在一起，导致IO 延迟，比如从网络中读取数据。
处理Online data
同步问题
Common causes:

Data are not colocated with the training process, causing I/O latency (this might happen when reading training data over a network).
Expensive online data preprocessing (consider doing this once offline and saving).
Unintentional synchronization barriers that interfere with data pipeline prefetching. For example, when synchronizing metrics between the device and host in CommonLoopUtils (link).

Common tips:

Instrument input pipeline to prefetch examples (e.g. tf.data.Dataset.prefetch)
Remove unused features/metadata from each as early in the pipeline as possible.
Increase the replication of the number of jobs generating examples for the input pipeline. For example, by using the tf.data service.

4.2 Evaluating model performance

测试时，使用比训练时更大的batch size进行评估。以固定的步数间隔而不是时间间隔进行评估。

Summary: Run evaluation at larger batch sizes than training. Run evaluations at regular step intervals, not regular time intervals.

我们在训练的过程中进行周期性评估以实时管理它的进程，以便于模型检查点选择，这样我们在训练结束后可以检查它的训练曲线。
最简单的配置是在同一个计算实例中既进行训练又进行周期性评估，训练和评估交替进行。

在这种情况下，用于评估的batch size至少要和用于训练的一样大，因为评估阶段模型的计算量要求会更低。

周期性评估应该以固定的步数间隔进行，而不是时间间隔。

基于时间间隔做出的评估很难演绎成训练曲线，特别是当训练可能会受到训练作业抢占、网络延迟问题等影响时。

周期性评估的工作没有足够的时间在所有的离线验证集上进行，所以需要你进行一个合理的采样。
在构建样本数据集我们考虑一下的因素：

样本大小

要保证在这个样本数据集上获得的模型表现和你整个验证集是匹配的。
这个数据集不能太大，这样才能快速高效的完成模型预测，但是它也要足够大，能够正确地衡量模型效果的改进。
It should be large enough to accommodate multiple such evaluations across trials in sequence, and still produce accurate estimates. That is, to avoid adaptively “fitting” to the validation set over time, in a way that doesn’t generalize to a held-out test set. However, this consideration is rarely a practical concern.

不均衡的数据集

对于一个不均衡的数据，在数量比较少的样本类型上的结果会比较嘈杂。

对于样本数量较少的数据集，记录正确预测的示例数量，以更深入地了解准确性改进。

4.3 保存检查点并回溯选择最优

很多深度学习框架都支持模型检查点。模型的当前状态会周期性地保存在你的硬盘中，允许训练作业对计算实例中断具有弹性。
最好的检查点通常不是最后的检查点，尤其在你的验证集上表现已经不在随着时间提升甚至有点下降的时候。
建立一个pipeline来保存最好的N个检查点。在训练结束后，模型的选择就是选择训练过程中最好的检查点，我们把它称为retrospective optimal checkpoint selection。
early stopping方法通常是没有必要的，因为我们会保存N个最好的检查点。

4.4 BN的实现细节

Summary: nowadays batch norm can often be replaced with LayerNorm, but in cases where it cannot, there are trickly details when changing the batch size or number of hosts.

BN使用均值和方差对batch进行归一化，而在一个多设备的配置环境下，每个设备上的数据都是不一样的。
使用64大小的 batch size进行bn操作的效果在实际上会更好一点。
将batch size与用于计算BN的严样本数量解耦对于比较batch size很有帮助。
Ghost BN在每设备batch size> 虚拟batch size的情况并不总是正确处理。在这种情况下，我们实际上需要对每个设备上的batch进行二次采样，以获得正确数量的用于BN的目标。
Batchnorm在训练时需要记录batch的均值，然后计算移动平均EMA，用于测试中。在多设备的情况下，需要对多个设备上统计的batch均值进行同步，来计算EMA，但是通常的batchnorm的实现中计算第一个device的EMA。Exponential moving averages used in test mode batch norm are just a linear combination of training statistics, so these EMAs only need to be synchronized before saving them in checkpoints. However, some common implementations of batch norm do not synchronize these EMAs and only save the EMA from the first device.

4.5 对多主机pipeline的考虑

Summary: for logging, evals, RNGs, checkpointing, and data sharding, multi-host training can make it very easy to introduce bugs!

多主机的训练方式是更容易引入bug的

保证你的pipeline只在一个主机上进行记录和保存操作。
保证在评估或者保存前，bn已经进行了同步。
不同主机之间的RNG种子必须是一样的，用于数据洗牌和预处理的种子可以是不一样的。
在不同主机上的数据文件分片通常可以提高表现。

FAQs

What is the best learning rate decay schedule family?

没有最好的learing rate decay shcedule，但是可以确定的是，不应该固定lr，而应该在训练过程中，保持lr的变化。

Different learning rates work best at different times during the optimization process.

Which learning rate decay should I use as a default?

线性或cosine都可以
Our preference is either linear decay or cosine decay, and a bunch of other schedule families are probably good too.

How should Adam’s hyperparameters be tuned?

If < 10 trials in a study, only tune the (base) learning rate.
If 10-25 trials, tune learning rate and β1.
If 25+ trials, tune the learning rate, β1 and ϵ.
If one can run substantially more than 25 trials, additionally tune β2.

Why use quasi-random search instead of more sophisticated black box optimization algorithms during the exploration phase of tuning?

什么是准随机数：分布更均匀的随机数。Quasi-random refers to a sequence of numbers that are designed to fill a space more uniformly than purely random sequences. Unlike random numbers, which can cluster and leave gaps, quasi-random numbers are generated using deterministic algorithms that ensure a more even distribution across a defined range.

Why shouldn't the batch size be tuned to directly improve validation set performance?

Changing the batch size without changing any other details of the training pipeline will often affect the validation set performance.

However, the difference in validation set performance between two batch sizes typically goes away if the training pipeline is optimized independently for each batch size.

The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.

Smaller batch sizes introduce more noise into the training algorithm due to sample variance, and this noise can have a regularizing effect. Thus, larger batch sizes can be more prone to overfitting and may require stronger regularization and/or additional regularization techniques.

In addition, the number of training steps may need to be adjusted when changing the batch size.

Once all these effects are taken into account, there is currently no convincing evidence that the batch size affects the maximum achievable validation performance (see Shallue et al. 2018).