数据挖掘 - 具有选择列的概率或保证选择列集的随机森林实现 - 吾爱随笔录

具有选择列的概率或保证选择列集的随机森林实现

数据挖掘算法随机森林

2021-10-03 02:25:55

是否有随机森林实现允许选择总是为森林中的每棵树选择的列集？或允许指定选择每一列的概率的实现？

这两种情况都可以通过选择变量（部分或完全随机）并在新的时间数据集中构建包含所有变量的树来模拟，然后重复此过程直到我们获得假定的树数，然后将树合并到森林中，

但是存在一些缺点，例如在脚本语言中移动大量数据直到将它们发送到森林的低级实现以构建树或缺乏合并程序，这使得使用一组树变得更加困难。

2个回答

R 中的 Ranger 可以执行第一个请求。该always.split.variables参数定义应始终包含哪些列。

我知道 xgboost 和 sklearn 都没有提供你想要的东西。我检查了R，也没有找到它。

但是随机森林是很容易实现的模型，所以你可以自己制作一个：

在这里，在 python 中使用 sklearn：

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import mode


class MyRandomForest:
    def __init__(self, Pcol, Pobs, n_estimators=10):
        self.n_estimators = n_estimators
        self.Pcol = Pcol  # vector
        self.Pobs = Pobs  # scalar

    def fit(self, X, y):
        self.classes_ = np.unique(y)
        assert len(self.Pcol) == X.shape[1]
        self.cols = []
        self.ms = []
        while True:
            j = np.random.rand(X.shape[1]) <= self.Pcol
            if not np.any(j):  # at least one column must be chosen!
                continue
            x = X[:, j]
            i = np.random.choice(range(len(X)), int(len(X)*self.Pobs), False)
            self.cols.append(j)
            self.ms.append(DecisionTreeClassifier().fit(x[i], y[i]))
            if len(self.ms) == self.n_estimators:
                break
        return self

    def predict(self, X):
        yp = [m.predict(X[:, cs]) for cs, m in zip(self.cols, self.ms)]
        yp = mode(yp, 0)[0][-1]
        return yp

if __name__ == '__main__':  # TEST
    from sklearn.datasets import load_iris
    from sklearn.cross_validation import StratifiedKFold
    from sklearn.metrics import accuracy_score
    iris = load_iris()
    X = iris.data
    y = iris.target
    for tr, ts in StratifiedKFold(y):
        m = MyRandomForest([0.5, 1, 0.2, 0.3], 1).fit(X[tr], y[tr])
        print(accuracy_score(y[ts], m.predict(X[ts])))

如果您使用的是 Linux，则可以使用multiprocessing它轻松地使其并行运行。

我所做的只是：

使用您的概率训练一系列单独的树
保存模型和为训练采样的任何列，因为我们只想使用那些进行评估（否则我们会得到一个错误）
使用模式获得投票最多的预测

注意：Pobs是要使用的观察的分数。您可能还希望更改np.random.choice(..., ..., False)为np.random.choice(..., ..., True)或使其可配置，以允许重新采样进行引导。通常，使用Pobs=1和训练随机森林resample=True。

其它你可能感兴趣的问题

上一篇不平衡类的异常值检测下一篇状态转换频率的数据可视化（可能在 R 中？）