Kfold

2025319

10:19

KFold:

class sklearn.model_selection.KFold(n_splits=5*shuffle=Falserandom_state=None)[source]#

Kfold是最简单的情形,将所有样本分为n份,每次取其中一份作为valid set,其他n-1份作为train set,共n次。

 

GroupKFold

class sklearn.model_selection.GroupKFold(n_splits=5*shuffle=Falserandom_state=None)[source]

K-fold iterator variant with non-overlapping groups.

Each group will appear exactly once in the test set across all folds (the number of distinct groups has to be at least equal to the number of folds).

The folds are approximately balanced in the sense that the number of samples is approximately the same in each test fold when shuffle is True.

 

GroupKfold不是独立地看待所有样本,每个样本都属于一个Group,类似于每个学生属于一个班级,Groupkfold就是以班级为单位来划分训练集和验证集,且每个班级只能出现在验证集一次。比如共有10个班级(10Group),n_splilts=5,那么第一个Fold是一班和二班做验证集,第二个Fold是三班和四班做验证集,……,验证集的大小不一定是两个班级,划分的标准是 the number of samples is approximately the same in each test fold when shuffle is True,当shuflleFalse时, 划分的标准是每个foldGroup的数量尽可能均匀,而不是样本数量尽可能均匀。

 

StratifiedKFold

class sklearn.model_selection.StratifiedKFold(n_splits=5*shuffle=Falserandom_state=None)[source]

Stratified K-Fold cross-validator.

Provides train/test indices to split data in train/test sets.

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

 

假设n_= 5,那么StratifiedKFold将每个类别的样本都分为5份,然后每个Fold从所有类别中各取出一份,组成这个Fold

 

 

StratifiedGroupKFold

class sklearn.model_selection.StratifiedGroupKFold(n_splits=5shuffle=Falserandom_state=None)[source]

Stratified K-Fold iterator variant with non-overlapping groups.

This cross-validation object is a variation of StratifiedKFold attempts to return stratified folds with non-overlapping groups. The folds are made by preserving the percentage of samples for each class.

Each group will appear exactly once in the test set across all folds (the number of distinct groups has to be at least equal to the number of folds).

The difference between GroupKFold and StratifiedGroupKFold is that the former attempts to create balanced folds such that the number of distinct groups is approximately the same in each fold, whereas StratifiedGroupKFold attempts to create folds which preserve the percentage of samples for each class as much as possible given the constraint of non-overlapping groups between splits.

 

 

StratifiedGroupKFold首先是Groupfold,也就是保证每个Group只出现在验证集中一次,StratifiedGroupKFold与Groupfold的不同在于,Groupfold划分的标准是确保每个fold中的group的数量尽可能相等,而StratifiedGroupKFold划分的标准是保证训练集和验证集的类别比例尽可能相等(只是尽可能,当class A只出现在Group 1中时,那么当Group1作验证集的时候,训练集中就没有class A的样本)。

 

总结:如何理解Fold5Fold相当于将数据划分为5份,每份都可以作为验证集,然后其他份作为训练集。

 

TimeSeriesSplit#

class sklearn.model_selection.TimeSeriesSplit(n_splits=5*max_train_size=Nonetest_size=Nonegap=0)[source]

Time Series cross-validator.

Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate.

This cross-validation object is a variation of KFold. In the kth split, it returns first k folds as train set and the (k+1)th fold as test set.

 

jane street比赛中,public 6th的方案中2fold timeseries split.

 

 

已使用 OneNote 创建。