Google 调参playbook
2025年4月16日
14:29
https://github.com/google-research/tuning_playbook
1. Guide for starting a new project 开始新项目
Summary: When starting a new project, try to reuse a model that already works.
Summary: Start with the most popular optimizer for the type of problem at hand.
Summary: The batch size governs the training speed and shouldn't be used to directly tune the validation set performance. Often, the ideal batch size will be the largest batch size supported by the available hardware.
Determining the feasible batch sizes and estimating training throughput
增大batchsize的收益来源于增加training吞吐量,理论上,当显存未满时,batchsize增加一倍,吞吐量也增加一倍,如果不是这种情况,那么可能有其他bottleneck如IO。 这些步骤可能在每次模型或优化器发生改变时都要重复进行(如,不同的模型架构可能运行更大的batch size适合内存)
training throughput = (# examples processed per second)
or, equivalently, the time per step.
time per step = (batch size) / (training throughput)
Choosing the batch size to minimize training time
训练时间= time per step * total number steps ,time per step通常是常数,与batchsize无关。通常情况下随着batch size的增加,达到固定性能目标所需的步数steps通常会减少(Shallue et al. 2018 给出了当batch size发生改变时,重调所有相关超参数的方法),完美情况下,batchsize加倍,total steps会减半。因此,让训练时间最小化的batchsize,是能达到固定性能所需的最小的训练步数steps的那个batchsize。如果最终会增加训练时间,那么使用更大的批量大小是没有意义的。
Training time = (time per step) x (total number of steps)
Changing the batch size requires re-tuning most hyperparameters
大部分超参数的最优值都是对batch size敏感的。因此,当改变batch size时,一般都需要重新调参。和batch size相关性最强的超参数是优化器超参数(如,学习速率和动量)和正则化超参数,因此,需要对每个batch size单独调参。
Batch norm如何与batch size相互影响
一般来说,应该使用与梯度计算不同的 batch size 来计算统计数据。 有关详细讨论,请参阅batch norm部分。也就是bessel 修正吧。
2.1 The incremental tuning strategy
从 baseline开始,一点一点地进行改进,在做出改进的同时,建立 对problem以及data的insight.
Summary: Start with a simple configuration and incrementally make improvements while building up insight into the problem. Make sure that any improvement is based on strong evidence to avoid adding unnecessary complexity.
增量调参策略包括4个步骤:
our incremental tuning strategy involves repeating the following four steps:
2.2 Choosing the goal for the next round of experiments
选择每轮实验的目标,目标的选择以探索exploration为主而不是exploitation,目的是获得对problem的insight,也就是Most of the time, our primary goal is to gain insight into the problem.
Summary: Each round of experiments should have a clear goal and be sufficiently narrow in scope that the experiments can actually make progress towards the goal.
2.3 Designing the next round of experiments
确定哪些是scientific超参数,冗余超参数,固定超参数。
Summary: Identify which hyperparameters are scientific, nuisance, and fixed hyperparameters for the experimental goal. Create a sequence of studies to compare different values of the scientific hyperparameters while optimizing over the nuisance hyperparameters. Choose the search space of nuisance hyperparameters to balance resource costs with scientific value.
如果要确定层数多少的影响,层数就是scientific超参数,学习率是冗余超参数,当我们调整层数时,不应该固定学习率,而应该选择当前层数下最优的学习率,因为层数和学习率对效果的影响是耦合的,不是独立的。而激活函数是Fixed 超参数,调整层数时,可以固定激活函数。
For example, if our goal is to "determine whether a model with more hidden layers will reduce validation error", then the number of hidden layers is a scientific hyperparameter.
超参数是scientific还是nuisance还是fixed的,取决于我们的实验目标。
Whether a particular hyperparameter is a scientific hyperparameter, nuisance hyperparameter, or fixed hyperparameter is not inherent to that hyperparameter, but changes depending on the experimental goal.
For example, the choice of activation function could be a scientific hyperparameter (is ReLU or tanh a better choice for our problem?), a nuisance hyperparameter (is the best 5-layer model better than the best 6-layer model when we allow several different possible activation functions?), or a fixed hyperparameter (for ReLU nets, does adding batch normalization in a particular position help?).
当设计下一轮实验时,首先确定scientific 超参数,暂时把其他所有的超参数作为冗余超参数,当计算资源无限时,把其他所有的超参数作为冗余超参数的好处是so that the conclusions we draw from our experiments are free from caveats about fixed hyperparameter values. 但是这样的风险是,需要调整的冗余参数越多, the greater the risk we fail to tune them sufficiently well for each setting of the scientific hyperparameters and end up reaching the wrong conclusions from our experiments,而且我们也没有无限 的计算资源 。因此 我们选择一些冗余超参数作为fixed 超参数,选择的原则是:The more a given nuisance hyperparameter interacts with the scientific hyperparameters, the more damaging it is to fix its value. For example, the best value of the weight decay strength typically depends on the model size, so comparing different model sizes assuming a single specific value of the weight decay would not be very insightful.
虽然超参数是scientific还是nuisance还是fixed的,取决于我们的实验目标。但是仍然有一些基本准则可以参考,
学习率相关的超参数,optimizer相关的超参数,通常是nuisance的 。因为二者与scientific 超参数有interact。
Of the various optimizer hyperparameters (e.g. the learning rate, momentum, learning rate schedule parameters, Adam betas etc.), at least some of them will be nuisance hyperparameters because they tend to interact the most with other changes.
而Optimizer的选择通常是scientific 或者fixed的,
It is a scientific hyperparameter if our experimental goal involves making fair comparisons between two or more different optimizers (e.g. "determine which optimizer produces the lowest validation error in a given number of steps").
类似的,正则相关的超参数是nuisance的,但是是否选择正则以及选择哪种正则,是scientific 或者fixed的。
Hyperparameters introduced by a regularization technique are typically nuisance hyperparameters, but whether or not we include the regularization technique at all is a scientific or fixed hyperparameter.
模型结构的超参数通常是scientific或者fixed的。
Architectural hyperparameters are often scientific or fixed hyperparameters because architecture changes can affect serving and training costs, latency, and memory requirements.
在某些情况下,nuisance和fixed超参数的选择取决于scientific超参数。
For example, suppose we are trying to determine which optimizer out of Nesterov momentum and Adam results in the lowest validation error. The scientific hyperparameter is the optimizer, which takes values {"Nesterov_momentum", "Adam"}. The value optimizer="Nesterov_momentum" introduces the nuisance/fixed hyperparameters {learning_rate, momentum}, but the value optimizer="Adam" introduces the nuisance/fixed hyperparameters {learning_rate, beta1, beta2, epsilon}.
确定哪些是scientific超参数,冗余超参数,固定超参数之后,就可以指定一系列超参数的设置,每个设置称为一个trial,具体包括确定超参数的搜索空间,选择trial的次数,选择自动搜索算法,或者手动搜索等。
搜索的目标是确定不同scientific超参数对性能的影响,并时刻保持nuisance参数的最优,从而让不同的scientific超参数的对比变得有意义。
The purpose of the studies is to run the pipeline with different values of the scientific hyperparameters, while at the same time "optimizing away" (or "optimizing over") the nuisance hyperparameters so that comparisons between different values of the scientific hyperparameters are as fair as possible.
一个简单情况的例子:
For example, if our goal is to select the best optimizer out of Nesterov momentum and Adam, we could create one study in which optimizer="Nesterov_momentum" and the nuisance hyperparameters are {learning_rate, momentum}, and another study in which optimizer="Adam" and the nuisance hyperparameters are {learning_rate, beta1, beta2, epsilon}. We would compare the two optimizers by selecting the best performing trial from each study.
We can use any gradient-free optimization algorithm, including methods such as Bayesian optimization or evolutionary algorithms, to optimize over the nuisance hyperparameters, although we prefer to use quasi-random search in the exploration phase of tuning because of a variety of advantages it has in this setting.
复杂情况的例子,把scientific参数包括在nuisance参数中,一起 进行参数搜索,此时用随机搜索更好,因为可以保证scientific参数被均匀地采样到。
In the more complicated case where we want to compare a large number of values of the scientific hyperparameters and it is impractical to make that many independent studies, we can include the scientific parameters in the same search space as the nuisance hyperparameters and use a search algorithm to sample values of both the scientific and nuisance hyperparameters in a single study.
In this case, our preference for using quasi-random search over fancier black-box optimization tools is even stronger, since it ensures that we obtain a relatively uniform sampling of values of the scientific hyperparameters. Regardless of the search algorithm, we need to make sure somehow that it searches the scientific parameters uniformly.
进行参数搜索实验时,需要探索足够多的scientific参数,才能获得足够的insight,也需要探索足够多nuisance参数,才能保证the nuisance hyperparameters 足够好well enough ,但是搜索的参数越多,计算资源消耗越大,因此需要平衡。
2.4 Extracting insight from experimental results
在实验过程中,除了探索scientific超参数的影响之外,还应该检查一些其他问题。
Summary: In addition to trying to achieve the original scientific goal of each group of experiments, go through a checklist of additional questions and, if issues are discovered, revise the experiments and rerun them.
Before analyzing a given set of experiments to make progress toward their original goal, we should ask ourselves the following additional questions:
不合适的搜索空间:当最优点位于搜索空间的边界时。A search space is suspicious if the best point sampled from it is close to its boundary. We might find an even better point if we expanded the search range in that direction.
通常,很难知道搜索空间的采样是否足够密集。通常会采样我们可以负担得起的东西,并尝试通过实验结果来校准我们对问题的insight。
从训练曲线中能否看出存在过拟合问题? What can we learn from the training curves of the best trials?
Problematic overfitting occurs when the validation error starts increasing at some point during training.
如果存在过拟合,那么应该加入正则项来解决些问题并重新运行实验,这样才能公平地比较scientific超参数的影响。If any of the best trials exhibits problematic overfitting, we usually want to re-run the experiment with additional regularization techniques and/or better tune the existing regularization parameters before comparing the values of the scientific hyperparameters. Reducing overfitting is often straightforward using common regularization techniques that add minimal code complexity or extra computation (e.g. dropout, label smoothing, weight decay), so it’s usually no big deal to add one or more of these to the next round of experiments.
训练后期的训练或验证误差是否存在高步进方差(high step-to-step variance)?
步进方差的最可能原因是批次方差(从每个批次的训练集中随机抽取示例)、小验证集以及在训练后期使用过高的学习率。
可能的补救措施包括增加批量大小、获取更多验证数据、使用学习率衰减或使用 Polyak 平均。
The most likely causes of step-to-step variance are batch variance (from randomly sampling examples from the training set for each batch), small validation sets, and using a learning rate that’s too high late in training.
Possible remedies include increasing the batch size, obtaining more validation data, using learning rate decay, or using Polyak averaging.
训练结束时试验是否仍在改进?Are the trials still improving at the end of training?
如果是,我们可能会从增加训练步骤数或改变学习率计划中受益。
If so, this indicates that we are in the "compute bound" regime and we may benefit from increasing the number of training steps or changing the learning rate schedule.
训练集和验证集的性能在最后的训练步骤之前很早就饱和了吗?Has performance on the training and validation sets saturated long before the final training step?
如果是,这表明我们可以减少训练步骤的数量。
If so, this indicates that we are in the "not compute-bound" regime and that we may be able to decrease the number of training steps.
2.5 Determining whether to adopt a training pipeline change or hyperparameter configuration
在决定是否更改我们的模型或训练程序或采用新的超参数配置时,需要这个新的配置是否是真的可以提升效果,还是由其他因素造成的,例如不同的随机种子,不同的随机初始化、训练数据shuffle、丢弃掩码、数据扩充操作的模式以及并行算术操作的顺序,都是试验方差的潜在来源。
因此,在采用候选更改之前,请考虑将最佳试验运行 N 次以表征每次运行的试验方差。Therefore, before adopting a candidate change, consider running the best trial N times to characterize the run-to-run trial variance.
2.6 After exploration concludes
总结:一旦我们完成了对良好搜索空间的探索并决定了应该调整哪些超参数,贝叶斯优化工具就是一个引人注目的选择。
在某个时候,我们的优先事项将从更多地了解调优问题迁移到生成单一最佳配置以启动或以其他方式使用。
在这一点上,应该有一个精确的搜索空间,可以舒适地包含最佳观察试验周围的局部区域,并且已经过充分采样。
我们的探索工作应该已经揭示了最重要的要调整的超参数(以及它们的合理范围),我们可以使用这些超参数来构建搜索空间,以使用尽可能大的调整预算进行最终的自动调整研究。
由于我们不再关心最大化我们对调优问题的洞察力,因此准随机搜索的许多优势不再适用,应该使用贝叶斯优化工具来自动找到最佳超参数配置。(也就是说,先用准随机搜索,再用贝叶斯搜索,类似于先粗排,后精排)
如果搜索空间包含大量发散点(获得 NaN 训练损失或甚至训练损失比均值差很多标准差的点),使用黑盒优化工具来正确处理发散试验很重要(请参阅 Bayesian Optimization with Unknown Constraints 是处理此问题的绝佳方法)。
此时,我们还应该考虑检查测试集上的性能。
原则上,我们甚至可以将验证集折叠到训练集中,并重新训练通过贝叶斯优化找到的最佳配置。 但是,这仅适用于未来不会针对此特定工作负载发布的情况(例如一次性 Kaggle 竞赛)。
无论计算资源是否受限 ,那些增加batch之间的梯度方差的method,通常会增加training steps。比如更小的batchsize,数据增强,Dropout。
3.1 When training is compute-bound
当计算资源受限时,在这种情况下,理论上来说,加速训练速度就相当于提高训练performance,the "optimal" training time is always "as long as we can afford."
3.2 When training is not compute-bound
当计算资源不受限时,we can afford to train as long as we would like to, and, at some point, training longer doesn't help much (or even causes problematic overfitting).
4.1 Optimizing the input pipeline
优化输入pipeline。
input可能存在的问题:
4.2 Evaluating model performance
测试时,使用比训练时更大的batch size进行评估。以固定的步数间隔而不是时间间隔进行评估。
Summary: Run evaluation at larger batch sizes than training. Run evaluations at regular step intervals, not regular time intervals.
对于样本数量较少的数据集,记录正确预测的示例数量,以更深入地了解准确性改进。
4.3 保存检查点并回溯选择最优
4.4 BN的实现细节
Summary: nowadays batch norm can often be replaced with LayerNorm, but in cases where it cannot, there are trickly details when changing the batch size or number of hosts.
4.5 对多主机pipeline的考虑
Summary: for logging, evals, RNGs, checkpointing, and data sharding, multi-host training can make it very easy to introduce bugs!
多主机的训练方式是更容易引入bug的
What is the best learning rate decay schedule family?
没有最好的learing rate decay shcedule,但是可以确定的是,不应该固定lr,而应该在训练过程中,保持lr的变化 。
Different learning rates work best at different times during the optimization process.
Which learning rate decay should I use as a default?
How should Adam’s hyperparameters be tuned?
Why use quasi-random search instead of more sophisticated black box optimization algorithms during the exploration phase of tuning?
什么是准随机数:分布更均匀的随机数。Quasi-random refers to a sequence of numbers that are designed to fill a space more uniformly than purely random sequences. Unlike random numbers, which can cluster and leave gaps, quasi-random numbers are generated using deterministic algorithms that ensure a more even distribution across a defined range.
Why shouldn't the batch size be tuned to directly improve validation set performance?
Changing the batch size without changing any other details of the training pipeline will often affect the validation set performance.
However, the difference in validation set performance between two batch sizes typically goes away if the training pipeline is optimized independently for each batch size.
The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.
Smaller batch sizes introduce more noise into the training algorithm due to sample variance, and this noise can have a regularizing effect. Thus, larger batch sizes can be more prone to overfitting and may require stronger regularization and/or additional regularization techniques.
In addition, the number of training steps may need to be adjusted when changing the batch size.
Once all these effects are taken into account, there is currently no convincing evidence that the batch size affects the maximum achievable validation performance (see Shallue et al. 2018).
已使用 OneNote 创建。