测试集上的概率分布

数据挖掘 数据 预言 可能性
2022-02-24 01:05:42

如何解释测试集目标的预测概率分布?例如,如果我们想解释下面的图,我们可以说它是过度拟合的吗?

这里 x 轴显示目标为 1 的概率,y 轴是具有该概率的实例数。

在此处输入图像描述

训练集目标的概率如下:

在此处输入图像描述

1个回答

预计在训练集中有“更有信心”的预测概率,即您的模型将为用于训练的样本分配更接近 0 和 1 的概率,并且在中间更小。

如上所述,您几乎无法仅通过概率的直方图来说明您的模型性能,但您可以通过将它们按正负类拆分并添加诸如ks之类的分离度量来添加更多信息

代码示例:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import ks_2samp

import matplotlib.pyplot as plt

data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = .2, random_state = 42)

clf = RandomForestClassifier(random_state = 42).fit(X_train,y_train)


preds_test = clf.predict_proba(X_test)[:,1]
preds_train = clf.predict_proba(X_train)[:,1]

ks_train = round(ks_2samp(preds_train[y_train == 1], preds_train[y_train == 0])[0],4)
ks_test = round(ks_2samp(preds_test[y_test == 1], preds_test[y_test == 0])[0],4)


fig, ax = plt.subplots(1,2,figsize = (10,5))

ax[0].hist(preds_train[y_train == 1], color = "darkred",bins = "scott", alpha = .5, edgecolor = "red")
ax[0].hist(preds_train[y_train == 0], color = "darkgreen",bins = "scott", alpha = .5, edgecolor = "green")
ax[0].set_title(f"Class separation in train set\nks: {ks_train}")

ax[1].hist(preds_test[y_test == 1], color = "darkred",bins = "scott", alpha = .5, edgecolor = "red")
ax[1].hist(preds_test[y_test == 0], color = "darkgreen",bins = "scott", alpha = .5, edgecolor = "green")
ax[1].set_title(f"Class separation in test set\nks: {ks_test}");

在此处输入图像描述