默认的 RF 分类通过多数投票聚合树。您必须修改树的类投票分布(参见示例 A),或者您必须更改聚合规则(参见示例 B)。选项 A 可以通过分层/下采样或分类权重来实现。我主要提到是因为它是可能的,因为它可能会降低整体预测性能(测试集预测的 ROC 的 AUC)。选项 B 是修改聚合规则。森林预测的任何样本都会在每个类别上获得一定数量的投票(或 0)。选票的多元性可以理解为预测概率的伪估计,其中第k类的预测概率是第k类的选票除以所有选票。可以在训练期间或预测期间使用截止参数修改投票阈值。预测的类概率基本上是用类截断值划分的。如果 cutoff = c(.5,.5) 没有变化。if cutoff = c(.1,.9) 对第 1 类的投票要多得多。randomForest 中有一个陷阱,这样 OOB-CV 预测只有在训练期间修改后才会从 cutoff 生效,而对于新数据或测试集的预测训练后可以修改截止值。
library(randomForest)
make.data = function(obs=1000,vars=6) {
X = data.frame(replicate(vars,rnorm(obs)))
noise=rnorm(obs)
y.value = with(X,X1^2+sin(X2)+X3*X4) + noise
y.class = factor(y.value>median(y.value),labels=c("-1","200"))
return(data.frame(y=y.class,X=X))
}
train.data = make.data()
test.data = make.data()
#native RF
RF.default = randomForest(y~.,data=train.data)
print(RF.default)
>Confusion matrix:
> predClass
>trainClass -1 200
> -1 386 114 (~22% false positive class 200)
> 200 131 369
#solution A: Unbalancing the data by stratification.
#It works, but not recommendable.
#stratified RF, downsample, false postive class "200" is ~5.2%
RF.stratify = randomForest(y~.,data=train.data,
sampsize=c(500,140),
strata=train.data$y)
print(RF.stratify)
>Confusion matrix:
> -1 200 class.error
>-1 468 32 0.064 (~10% false positive class 200)
>200 236 264 0.472
#solution B:
#changed vote-rule with cutoff
RF.default$forest$cutoff=c(.17,.83)
#cutoff is not implemented for OOB-CV in predict.randomForest!
preds.train = predict(RF.default)
table(trainClass=train.data$y,
predClass=preds.train)
> predClass
>trainClass -1 200
> -1 389 111 (OOB take no effect from cutoff after training)
> 200 108 392
#but it does work for newdata prediction
preds.test = predict(RF.default,newdata=test.data)
table(testClass=test.data$y,
predClass=preds.test)
> predClass
>testClass -1 200
> -1 487 13 (~10% false positive class 200)
> 200 362 138
#found the 'gotcha' in the source file of randomForest
#cutoff only modifies OOB predictions if modified during training
RF.default = randomForest(y~.,data=train.data,cutoff=c(.17,.83))
preds.train = predict(RF.default)
table(trainClass=train.data$y,
predClass=preds.train)
> predClass
>trainClass -1 200
> -1 490 10
> 200 366 134
#extra tip use a ROC plot to investigate the relationship between false positive and false negative, to helpt choose your favorite cutoff.
library(AUC)
plot(roc(predict(RF.default,type="vote")[,2],train.data$y))