我建议阅读这篇文章。文章解释说:
在交叉验证之前进行上采样时,您将选择最过采样的模型,因为过采样允许数据从验证折叠泄漏到训练折叠中。
相反,我们应该首先分成训练和验证折叠。然后,在每一折上,我们应该:
- 过采样少数类
- 在训练折叠上训练分类器
- 在剩余折叠上验证分类器
因此,为避免过度拟合,请尝试使用 imblearn make_pipeline 类,以便您可以将上采样作为交叉验证的一部分,如下所示:
kf = KFold(n_splits=5, random_state=42, shuffle=False)
# define parametres for hypertuning
params = {
'n_estimators': [50, 100, 200],
'max_depth': [4, 6, 10, 12],
'random_state': [13]
}
from imblearn.pipeline import Pipeline, make_pipeline
imba_pipeline = make_pipeline(SMOTE(random_state=42),
RandomForestClassifier(n_estimators=100, random_state=13))
cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf)
new_params = {'randomforestclassifier__' + key: params[key] for key in params}
grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
return_train_score=True)
grid_imba.fit(X_train, y_train);
# check recall on validation set
grid_imba.best_score_
# check recall on test set
y_test_predict = grid_imba.predict(X_test)
这将导致验证集召回是对测试集召回的良好估计。