考虑到一个平衡的训练集,我注意到分类的结果主要取决于测试集的类不平衡。
如本文所示,除非类是完全可分离的,否则给定类的模型的性能(精确度和召回率)将始终基于类的不平衡而降低。即:测试集越不平衡,模型对少数类的分类能力就越差。
这意味着对于任何给定的模型,分类性能将始终主要取决于您正在测试它的数据的平衡。
测试集的不平衡性如何定义我的模型经过训练后的预测能力?分类器的性能是否总是取决于目标人群的类平衡?这背后的数学推理是什么?
考虑到一个平衡的训练集,我注意到分类的结果主要取决于测试集的类不平衡。
如本文所示,除非类是完全可分离的,否则给定类的模型的性能(精确度和召回率)将始终基于类的不平衡而降低。即:测试集越不平衡,模型对少数类的分类能力就越差。
这意味着对于任何给定的模型,分类性能将始终主要取决于您正在测试它的数据的平衡。
测试集的不平衡性如何定义我的模型经过训练后的预测能力?分类器的性能是否总是取决于目标人群的类平衡?这背后的数学推理是什么?
大多数分类算法定义了类之间的决策边界。类不平衡将导致学习的决策边界对多数类具有偏好。这是存在偏好的,因为大多数损失函数都试图最小化平均误差(这最好通过最大化多数类的性能来完成)。
然后当测试数据集被分类时,少数类将继续表现更差,因为决策边界旨在最大化多数类性能。
分类器的“真实”性能与在特定测试集上观察到的性能之间存在混淆,该性能在分类器经过训练后确实是固定的。
“真实”性能只能估计,它应该使用遵循数据“真实”分布的随机样本进行估计。监督学习总是假设一个“真实群体”,训练集和测试集都应该是这个真实群体的子集。
如果使用具有不同分布的测试集,则无法保证性能与真实性能相同。这在某些实验中可能是相关的,但它不是对分类器本身的正确评估。
直观地说,这可以与一些学生在学习了一些练习后进行的测试进行比较:
编辑:研究平衡训练集与不平衡测试集的具体案例(OP 在评论中询问)。
修复OP发现的错误后重新编辑
这是一个有趣的研究案例,感谢您的提问:)
使用您的代码作为基础,我测试了以下代码:
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
#from scikitplot.metrics import plot_roc
#from scikitplot.metrics import plot_precision_recall
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
import numpy as np
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
import statistics as s
from collections import defaultdict
import random
from sklearn import tree
N_RUNS = 20
OPT_KNN = True
def fit_and_apply(X_train, y_train, X_test, y_test):
#training and testing on balanced data
if OPT_KNN:
clf = KNeighborsClassifier(n_neighbors=5)
clf = clf.fit(X_train, y_train)
else:
clf = tree.DecisionTreeClassifier().fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_pred_tr = clf.predict(X_train)
# print('train acc. : ',accuracy_score(y_train, y_pred_tr))
# print('test acc. : ',accuracy_score(y_test, y_pred))
# print('confusion matrix: \n',confusion_matrix(y_test, y_pred))
# print(classification_report(y_test, y_pred))
conf_mat_train = confusion_matrix(y_train, y_pred_tr)
conf_mat_test = confusion_matrix(y_test, y_pred)
report_train = classification_report(y_train, y_pred_tr,output_dict=True)
report_test = classification_report(y_test, y_pred,output_dict=True)
return conf_mat_train, conf_mat_test, report_train, report_test
def print_results(proportions, perf,summary=True):
print("")
for k,v in proportions.items():
print("Prop. ",k,"=",v)
for t,d0 in perf.items():
for c,d1 in d0.items():
if summary:
print(t,"class",c,"P,R,F:\t",end='')
for m,values in d1.items():
if m != "support":
if summary:
print("%.3f" % (s.mean(values)),end="\t")
else:
print(t,"class",c,m,":"," ".join([ "%.3f" % (p) for p in values ]), ". MEAN:",s.mean(values))
if summary:
print("")
def accu(conf_mat):
correct = conf_mat[0][0]+conf_mat[1][1]
incorrect= conf_mat[0][1]+conf_mat[1][0]
return correct/(correct+incorrect)
# perf[train|test][class][measure] = list of values
perf = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
proportions = {}
avg_conf_mat = defaultdict(lambda: [[0,0],[0,0]])
print("*** BALANCED -",end='')
for i in range(N_RUNS):
print(i,end=' ',flush=True)
#creating balanced dataset
X, y = make_classification(n_samples=10000, n_features=5, n_informative=5, n_redundant=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0, class_sep=0.5, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
proportions["data"] = Counter(y)
#splitting data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=None, stratify=y)
proportions["train"] = Counter(y_train)
proportions["test"] = Counter(y_test)
conf_mat_train,conf_mat_test,report_train, report_test = fit_and_apply(X_train, y_train, X_test, y_test)
for c in range(2):
for m,v in report_train[str(c)].items():
perf["train"][c][m].append(report_train[str(c)][m])
perf["test"][c][m].append(report_test[str(c)][m])
for i in range(2):
for j in range(2):
avg_conf_mat["train"][i][j] += conf_mat_train[i][j] / N_RUNS
avg_conf_mat["test"][i][j] += conf_mat_test[i][j] / N_RUNS
print_results(proportions, perf)
print("avg confusion matrix train: ",avg_conf_mat["train"]," avg accuracy=",accu(avg_conf_mat["train"]))
print("avg confusion matrix test: ",avg_conf_mat["test"]," avg accuracy=",accu(avg_conf_mat["test"]))
print("")
perf = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
proportions = {}
avg_conf_mat = defaultdict(lambda: [[0,0],[0,0]])
print("*** IMBALANCED A -",end='')
for i in range(N_RUNS):
print(i,end=' ',flush=True)
#making imbalanced data set (80%-20%)
imbalance = (0.8,0.2)
X, y = make_classification(n_samples=10000, weights=imbalance, n_features=5, n_informative=5, n_redundant=0, n_classes=2, n_clusters_per_class=2, flip_y=0, class_sep=0.5, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
# print(Counter(y))
proportions["data"] = Counter(y)
#splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=None, stratify=y)
#undersampling majority class to obtain balanced training set
res = RandomUnderSampler()
X_train_res, y_train_res = res.fit_resample(X_train, y_train)
# print("y_train_res:",Counter(y_train_res))
# print("y_test:",Counter(y_test))
proportions["train"] = Counter(y_train_res)
proportions["test"] = Counter(y_test)
conf_mat_train,conf_mat_test,report_train, report_test = fit_and_apply(X_train, y_train, X_test, y_test)
for c in range(2):
for m,v in report_train[str(c)].items():
perf["train"][c][m].append(report_train[str(c)][m])
perf["test"][c][m].append(report_test[str(c)][m])
for i in range(2):
for j in range(2):
avg_conf_mat["train"][i][j] += conf_mat_train[i][j] / N_RUNS
avg_conf_mat["test"][i][j] += conf_mat_test[i][j] / N_RUNS
print_results(proportions, perf)
print("avg confusion matrix train: ",avg_conf_mat["train"]," avg accuracy=",accu(avg_conf_mat["train"]))
print("avg confusion matrix test: ",avg_conf_mat["test"]," avg accuracy=",accu(avg_conf_mat["test"]))
print("")
perf = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
proportions = {}
avg_conf_mat = defaultdict(lambda: [[0,0],[0,0]])
print("*** IMBALANCED B -",end='')
for i in range(N_RUNS):
print(i,end=' ',flush=True)
#creating balanced dataset
X, y = make_classification(n_samples=10000, n_features=5, n_informative=5, n_redundant=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0, class_sep=0.5, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
# print(Counter(y))
proportions["data"] = Counter(y)
#splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=None, stratify=y)
#undersampling class 1 to obtain imbalanced test set
X_test_res = []
y_test_res = []
for i,c in enumerate(y_test):
# pick value in [0,1]
p = random.uniform(0,1)
if c == 0 or p<0.1:
X_test_res.append(X_test[i])
y_test_res.append(y_test[i])
proportions["train"] = Counter(y_train_res)
proportions["test"] = Counter(y_test_res)
conf_mat_train,conf_mat_test,report_train, report_test = fit_and_apply(X_train, y_train, X_test_res, y_test_res)
for c in range(2):
for m,v in report_train[str(c)].items():
perf["train"][c][m].append(report_train[str(c)][m])
perf["test"][c][m].append(report_test[str(c)][m])
for i in range(2):
for j in range(2):
avg_conf_mat["train"][i][j] += conf_mat_train[i][j] / N_RUNS
avg_conf_mat["test"][i][j] += conf_mat_test[i][j] / N_RUNS
print_results(proportions, perf)
print("avg confusion matrix train: ",avg_conf_mat["train"]," avg accuracy=",accu(avg_conf_mat["train"]))
print("avg confusion matrix test: ",avg_conf_mat["test"]," avg accuracy=",accu(avg_conf_mat["test"]))
print("")
两个主要修改是:
N_RUNS时间,以便在每种情况下获得对性能的良好估计。除了包括数据的生成之外,这与交叉验证的原理相同。我也设置random_state到None任何地方以避免任何偏见。make_classification(我不知道细节)。这在两个类在训练集上的表现不同这一事实中是可见的,如果训练数据是平衡的,这是不应该发生的。我认为您的版本(在我的代码中称为 A)是我上面提出的观点的一个有趣说明:只有当训练集和测试集都遵循数据。顺便说一句,当我们谈论“数据的分布”时存在歧义,人们通常认为这只是类的分布,但一般来说它是关于完整实例(特征+类)的分布,因为否则特征之间的统计关系并且类可能会丢失。在版本 A 的情况下,训练集不遵循数据的“真实分布”,而测试集则遵循。
[编辑] 现在,如果我们将选项 B 中使用不平衡测试集获得的性能与使用平衡测试集获得的性能进行比较,F1 分数的性能仍然不同。让我们详细看看会发生什么:
这意味着与平衡情况相比,精度(和 F1 分数)的差异是新的类分布的产物:虽然模型有完全相同的机会正确识别任一类的实例,但其 F1- 1 类的得分性能较低,0 类的得分性能较高。顺便说一句,这是选择全局性能指标难度的一个很好的例子:准确度(或等效地微 f1-score)与平衡情况相同,但宏 f1-分数不同。在这种情况下,我认为实际上表现是相同的,但严格来说,它确实可以看作是不同的。