您的多类分类器将所有类分类到一个类中的一些可能原因是什么?

数据挖掘 机器学习 分类 机器学习模型 多标签分类 阶级失衡
2022-02-26 23:43:50

我有不平衡的课程。组 1 N = 140 组 2 N = 35 组 3 N = 30

我在这些数据上运行了代码,所有组都被归类为 Group1。我认为由于 group1 是多数群体,这并不奇怪。然后我运行相同的代码,但这次使用 SMOTE,现在所有组都是 140,我仍然得到相同的结果,所有组都被归类到 Group1。然后我平衡了班级权重(W/O SMOTE),但仍然得到了相同的结果。这让我很困惑。我究竟做错了什么?有人可以帮我理解吗?或者我能做些什么来改进模型?我尝试了 5 种不同的分类器(KNN、AdaBoost、SVC、RF、DT),在 6 种分类器中的 4 种我得到了相同的结果!

这是代码:

#Splitting data to training and testing 
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.1, random_state=42)

#Apply StandardScaler for feature scaling
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform (X_test)

#SMOTE
sm = SMOTE(random_state=42)
X_balanced, y_balanced = sm.fit_sample(X_train_std, y_train)

#PCA
pca = PCA(random_state=42)

#Classifier regularization (SVC).

svc = SVC(random_state=42, class_weight= 'balanced')
pipe_svc = Pipeline(steps=[('pca', pca), ('svc', svc)])


# Parameters of pipelines can be set using ‘__’ separated parameter names:
parameters_svc = [{'pca__n_components': [2, 5, 20, 30, 40, 50, 60, 70, 80, 90, 100, 140, 150]}, 
                   {'svc__C':[1, 10, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 400, 500], 
                    'svc__kernel':['rbf', 'linear','poly'], 
                    'svc__gamma': [0.05, 0.06, 0.07, 0.08, 0.09, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 
                                   0.008,0.009, 0.0001, 0.0002, 0.0003, 0.0004, 0.0005],
                    'svc__degree': [1, 2, 3, 4, 5, 6],
                    'svc__gamma': ['auto', 'scale']}]

clfsvc = GridSearchCV(pipe_svc, param_grid =parameters_svc, iid=False, cv=10,
                      return_train_score=False)
clfsvc.fit(X_balanced, y_balanced)


# Plot the PCA spectrum (SVC)
pca.fit(X_balanced)

fig1, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6)) #(I added 1 to fig)
ax0.plot(pca.explained_variance_ratio_, linewidth=2)
ax0.set_ylabel('PCA explained variance')

ax0.axvline(clfsvc.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))

# For each number of components, find the best classifier results
results_svc = pd.DataFrame(clfsvc.cv_results_) #(Added _svc to all variable def)
components_col_svc = 'param_pca__n_components'
best_clfs_svc = results_svc.groupby(components_col_svc).apply(
    lambda g: g.nlargest(1, 'mean_test_score'))

best_clfs_svc.plot(x=components_col_svc, y='mean_test_score', yerr='std_test_score',
               legend=False, ax=ax1)
ax1.set_ylabel('Classification accuracy (val)')
ax1.set_xlabel('n_components')

plt.tight_layout()
plt.show()

#Predicting the test set results (SVC)
y_pred1 = clfsvc.predict(X_test)

# Model Accuracy, how often is the classifier correct?
Accuracyscore_svc = accuracy_score(y_test, y_pred1)

print("Accuracy for SVC on CV data: ", Accuracyscore_svc)

# Making the confusion matrix to describe the performance of a classifier
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix (y_test, y_pred1)


#accuracy
# Get accuracy score
accuracy1 = accuracy_score(y_test, y_pred1)
print('Accuracy1: %.2f%%' % (accuracy1 * 100.0))


#Checking shape after confusion matrix
print (X_test)
print (y_pred1)

print (cm1)
1个回答

我终于找到了原因。我使用的是非标准化的 X_test。谢谢。

编辑:

以前我像这样定义了 y_pred;

y_pred = clf.predict (xtest) 

然后像这样构造混淆矩阵;

cm = confusion_matrix (y_test, y_pred)

但是,我忘记了我之前使用标准标量更改了 xtest,就像这样;

x_test_std = sc.transform (xtest)

并且应该使用的新 x_test 是 xtest_std 而不是 x_test。

当我意识到这一点并使用正确的 x_test_std 时,一切正常并且变得更有意义。