让我们退后一步,看看我们为什么要进行这些拆分:
模型选择:估计不同模型的性能以选择最好的模型。
模型评估:选择最终模型,估计其对新数据的预测误差(泛化误差)。
(来源:“统计学习的要素 - 数据挖掘、推理和预测”,Hastie 等人)
对于模型选择,您使用验证集,对于模型评估,您使用测试集。
因此,一个直接的方法是这样的:
- 将数据拆分为训练/有效/测试集
- 在训练数据集上训练模型
- 在有效数据集上比较模型
- 重复第 2 步和第 3 步,直到满足您的个人停止标准(例如充分表现)
- 选择您的最终模型并在火车和有效数据集上重新训练
- 在测试数据集上评估您选择和重新训练的模型的性能
下面是一个 SVM 示例,取自 Mueller 和 Guido 的“Python 机器学习简介”:
from sklearn.svm import SVC
# split data into train+validation set and test set
X_trainval, X_test, y_trainval, y_test = train_test_split(
iris.data, iris.target, random_state=0)
# split train+validation set into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(
X_trainval, y_trainval, random_state=1)
print("Size of training set: {} size of validation set: {} size of test set:"
" {}\n".format(X_train.shape[0], X_valid.shape[0], X_test.shape[0]))
best_score = 0
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
for C in [0.001, 0.01, 0.1, 1, 10, 100]:
# for each combination of parameters, train an SVC
svm = SVC(gamma=gamma, C=C)
svm.fit(X_train, y_train)
# evaluate the SVC on the test set
score = svm.score(X_valid, y_valid)
# if we got a better score, store the score and parameters
if score > best_score:
best_score = score
best_parameters = {'C': C, 'gamma': gamma}
# rebuild a model on the combined training and validation set,
# and evaluate it on the test set
svm = SVC(**best_parameters)
svm.fit(X_trainval, y_trainval)
test_score = svm.score(X_test, y_test)
print("Best score on validation set: {:.2f}".format(best_score))
print("Best parameters: ", best_parameters)
print("Test set score with best parameters: {:.2f}".format(test_score))