数据挖掘 - Scikit-learn 对 AdaBoost 的实现 - 吾爱随笔录

我正在尝试AdaBoost纯实现算法Python（或NumPy在必要时使用）。

我遍历所有弱分类器（在这种情况下为决策树桩），然后遍历所有特征，然后遍历特征的所有可能值，以查看哪个更好地划分数据集。这是我的代码：

for _ in range(self.n_classifiers):
    classifier = BaseClassifier()
    min_error = np.inf

    # greedy search to find best threshold and feature
    for feature_i in range(n_features):
        thresholds = np.unique(X[:, feature_i])

        for threshold in thresholds:
            # here we find the best stump
            error = sum(w[y != predictions])
            if error < min_error:
                min_error = error

前两个循环根本不是问题，因为我们通常最多有几十个分类器和特征。但是第三个循环导致代码效率非常低。

解决这个问题的一种方法是忽略最好的弱分类器，只选择一个性能比随机分类器稍好一点的分类器（如 Robert E. SchapireYoav Freund 的Boosting: Foundations and Algorithms中所建议，第 6 页）：

for _ in range(self.n_classifiers):
    classifier = BaseClassifier()
    min_error = np.inf

    # greedy search to find best threshold and feature
    for feature_i in range(n_features):
        thresholds = np.unique(X[:, feature_i])

        for threshold in thresholds:
            # here we find the best stump
            error = sum(w[y != predictions])
            if error < 0.5 - gamma:
                min_error = error
                break

但在这种情况下，我的模型的准确率低于Scikit-learn，运行时间仍然是 3 倍。

我试图看看代码是如何Scikit-learn实现AdaBoost的，我并不清楚。我真的很感激任何评论。