选择阈值以获得 90% 精度的分类器 - ML 二元分类问题

数据挖掘 机器学习 分类 scikit-学习
2022-03-02 00:14:23

我选择了具有以下代码的阈值来获得 90% 精度的分类器

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train, cv=3)

z_scores = cross_val_predict(sgd_clf, X_train, y_train, method='decision_function')

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_pred, z_scores)

threshold_90_precision = thresholds[np.argmax(precisions >= 0.9)]

y_train_pred_90percent_precision = (z_scores >= threshold_90_precision)
print(precision_score(y_train, y_train_pred_90percent_precision))

我期望precision_score 为90%,但它返回95%。这是预期的吗?我的代码有什么不正确的地方吗?如果是预期的,你能解释一下原因吗?

1个回答
threshold_90_precision = thresholds[np.argmax(precisions >= 0.9)]

上面的片段没有做你期望它做的事情。

试试这些改变

precisions[precisions < 0.9] = 1
threshold_90_precision = thresholds[np.argmin(precisions)]

另外,我不确定您是否正确计算了准确性,因为 z_scores 是决策函数,而不是 Class。

这是一个使用 method='predict_proba'为 40% 的工作示例,您可以更改为 90%

model.fit(x_train, y_train)

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(model, x_train, y_train, cv=3)

z_scores = cross_val_predict(model, x_train, y_train, method='predict_proba')

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_pred, z_scores[:,0])

import numpy as np
precisions[precisions < 0.4] = 1
threshold_90_precision = thresholds[np.argmin(precisions)]

y_train_pred_90percent_precision = z_scores[:,0] >= threshold_90_precision
from sklearn.metrics import precision_score
print(precision_score(y_train, y_train_pred_90percent_precision))