拆分为测试/训练数据与使用交叉验证的不同性能

数据挖掘 Python 分类 scikit-学习多类分类阶级失衡

2022-03-08 14:01:19

我正在使用以下 scikit-learn 设置训练线性模型：

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

[...]

random_state=786543

max_iter=5, tol=None)
clf = LinearSVC(random_state=random_state, dual=True, C=1.5)

X_train, X_test, y_train, y_test, i_train, i_test = train_test_split(feature_matrix, y, indices, test_size=0.33, random_state=random_state)
clf.fit(X_train, y_train.values)
predicted_train = clf.predict(X_train)
predicted_test = clf.predict(X_test)
print('Train Accuracy: ' + str(np.mean(y_train == predicted_train)))
print('Test Accuracy: ' + str(np.mean(y_test == predicted_test)))
print('Test F1 micro: ' + str(f1_score(y_test, predicted_test, average='micro')))
print('Test F1 macro: ' + str(f1_score(y_test, predicted_test, average='macro')))
print('Test F1 weighted: ' + str(f1_score(y_test, predicted_test, average='weighted')))

训练精度：0.985129495926343

测试精度：0.9601936525013448

测试F1微：0.9601936525013448

测试F1宏：0.9000889214688401

测试 F1 加权：0.9590331562500389

但现在我跑

scores = cross_val_score(clf, feature_matrix, y, cv=5, scoring='f1_macro')
print(scores)

数组（[0.65860981, 0.84306338, 0.82113645, 0.83414211, 0.64665942]）

如何解释这种差异？我使用不同的随机状态对此进行了测试。

需要考虑的几点：

我有多个类（但每个样本只有一个标签）
数据集是倾斜的（所以有些类有很多样本，有些类很少）
我有 45066 个样本，5222 个特征，259 个类

每类样本数为：

sorted(list(np.unique(y, return_counts=True)[1]))

[1, 1, 1, 1, 1, 1, 1, 2, 3, 4, 4, 4, 4, 4, 7, 7, 8, 9, 9, 10, 10, 10, 10, 10, 10 , 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13 , 13, 13, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 18, 18 , 18, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 21, 21, 21, 22, 22, 22, 22, 23, 23, 24, 24, 24 , 25, 25, 25, 26, 26, 27, 27, 27, 27, 27, 29, 29, 29, 29, 30, 30, 30, 30, 32, 32, 32, 34, 34, 35, 35 , 35, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37, 38, 38, 38, 38, 40, 40, 40, 41, 41, 45, 45, 45, 46, 46 , 46, 46, 47, 47, 47, 48, 49, 50, 50, 52, 55, 56, 59, 59, 60, 60, 61, 61, 61, 65, 65, 67, 67, 69, 72 , 73, 74, 75, 77, 77, 79, 80, 84, 85, 87, 93, 96, 97, 97, 103, 110, 112, 117, 123, 130, 139, 139, 141, 143, 146 , 146, 147, 147, 150, 159, 161, 169, 170, 177, 180, 180, 189, 191,196、198、199、201、202、203、203、208、211、230、236、249、255、264、268、269、300、332、347、356、358、364、388、433、469、 476、484、548、652、698、723、748、753、807、815、1013、1200、1222、1243、1274、1447、1643、1741、2900、3909、4627]

2个回答

差异的原因

拆分需要考虑两个方面：

拆分是否以分层方式进行？（它应该）
数据被洗牌了吗？（它应该）

线

X_train, X_test, y_train, y_test, i_train, i_test = train_test_split(feature_matrix, y, indices, test_size=0.33, random_state=random_state)

默认情况下以分层方式拆分数据（请参阅参数stratify），并且默认情况下会随机播放（请参阅参数shuffle）：

见：https ://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

线

scores = cross_val_score(clf, feature_matrix, y, cv=5, scoring='f1_macro')

还以分层方式拆分数据（请参阅参数cv：

对于整数/无输入，如果估计器是分类器并且 y 是二元或多类，则使用 StratifiedKFold。在所有其他情况下，使用 KFold。

)，但它不会随机播放。这会导致这条线的不良结果。

见：https ://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

解决方案

选项1：预先洗牌数据：

import sklearn
scores = cross_val_score(clf, *sklearn.utils.shuffle(feature_matrix, df.eClass, random_state=42), cv=5, scoring='f1_macro')

选项 2：使用适当的交叉验证对象

我还研究了通过使用不同的对象来使用适当的交叉验证方案：

import sklearn
skf = sklearn.model_selection.StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    clf.fit(X_train, y_train.values)
    print('--------------------------------------')
    predicted_train = clf.predict(X_train)
    predicted_test = clf.predict(X_test)
    print('Train Accuracy: ' + str(np.mean(y_train == predicted_train)))
    print('Test Accuracy: ' + str(np.mean(y_test == predicted_test)))
    print('Test F1 micro: ' + str(f1_score(y_test, predicted_test, average='micro')))
    print('Test F1 macro: ' + str(f1_score(y_test, predicted_test, average='macro')))
    print('Test F1 weighted: ' + str(f1_score(y_test, predicted_test, average='weighted')))
    print('--------------------------------------')

看：

我会删除样本很少的类，因为它们会在模型中产生差异并且还有助于解决偏差。
我会尝试通过组合相似/或对结果具有相似影响的特征来创建新特征。这是因为与样本数量相比，您的特征太多了。
尝试使用逻辑回归并查看结果。Logreg 可以很好地处理此类数据。

其它你可能感兴趣的问题

上一篇表格/结构化数据的数据增强解决方案下一篇当有很多变量时如何查看相关图？