我有不平衡的课程。组 1 N = 140 组 2 N = 35 组 3 N = 30
我在这些数据上运行了代码,所有组都被归类为 Group1。我认为由于 group1 是多数群体,这并不奇怪。然后我运行相同的代码,但这次使用 SMOTE,现在所有组都是 140,我仍然得到相同的结果,所有组都被归类到 Group1。然后我平衡了班级权重(W/O SMOTE),但仍然得到了相同的结果。这让我很困惑。我究竟做错了什么?有人可以帮我理解吗?或者我能做些什么来改进模型?我尝试了 5 种不同的分类器(KNN、AdaBoost、SVC、RF、DT),在 6 种分类器中的 4 种我得到了相同的结果!
这是代码:
#Splitting data to training and testing
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.1, random_state=42)
#Apply StandardScaler for feature scaling
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform (X_test)
#SMOTE
sm = SMOTE(random_state=42)
X_balanced, y_balanced = sm.fit_sample(X_train_std, y_train)
#PCA
pca = PCA(random_state=42)
#Classifier regularization (SVC).
svc = SVC(random_state=42, class_weight= 'balanced')
pipe_svc = Pipeline(steps=[('pca', pca), ('svc', svc)])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
parameters_svc = [{'pca__n_components': [2, 5, 20, 30, 40, 50, 60, 70, 80, 90, 100, 140, 150]},
{'svc__C':[1, 10, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 250, 300, 400, 500],
'svc__kernel':['rbf', 'linear','poly'],
'svc__gamma': [0.05, 0.06, 0.07, 0.08, 0.09, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007,
0.008,0.009, 0.0001, 0.0002, 0.0003, 0.0004, 0.0005],
'svc__degree': [1, 2, 3, 4, 5, 6],
'svc__gamma': ['auto', 'scale']}]
clfsvc = GridSearchCV(pipe_svc, param_grid =parameters_svc, iid=False, cv=10,
return_train_score=False)
clfsvc.fit(X_balanced, y_balanced)
# Plot the PCA spectrum (SVC)
pca.fit(X_balanced)
fig1, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(6, 6)) #(I added 1 to fig)
ax0.plot(pca.explained_variance_ratio_, linewidth=2)
ax0.set_ylabel('PCA explained variance')
ax0.axvline(clfsvc.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))
# For each number of components, find the best classifier results
results_svc = pd.DataFrame(clfsvc.cv_results_) #(Added _svc to all variable def)
components_col_svc = 'param_pca__n_components'
best_clfs_svc = results_svc.groupby(components_col_svc).apply(
lambda g: g.nlargest(1, 'mean_test_score'))
best_clfs_svc.plot(x=components_col_svc, y='mean_test_score', yerr='std_test_score',
legend=False, ax=ax1)
ax1.set_ylabel('Classification accuracy (val)')
ax1.set_xlabel('n_components')
plt.tight_layout()
plt.show()
#Predicting the test set results (SVC)
y_pred1 = clfsvc.predict(X_test)
# Model Accuracy, how often is the classifier correct?
Accuracyscore_svc = accuracy_score(y_test, y_pred1)
print("Accuracy for SVC on CV data: ", Accuracyscore_svc)
# Making the confusion matrix to describe the performance of a classifier
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix (y_test, y_pred1)
#accuracy
# Get accuracy score
accuracy1 = accuracy_score(y_test, y_pred1)
print('Accuracy1: %.2f%%' % (accuracy1 * 100.0))
#Checking shape after confusion matrix
print (X_test)
print (y_pred1)
print (cm1)