给定我的数据集的统计模型建议

数据挖掘 机器学习 r 随机森林 交叉验证
2022-02-26 13:08:53

将 k 折交叉验证与随机森林方法一起使用时遇到问题。输出之一是错误“randomForest.default(x, y, mtry = param$mtry, ...) 中的错误:需要至少两个类来进行分类。” 但是,我已经有两个类来做分类,分别是“Normal”和“Failure”。https://stackoverflow.com/questions/60643415/error-in-randomforest-defaultx-y-mtry-parammtry-need-at-least-two?noredirect=1#comment107290940_60643415上发布此问题时,向我推荐了我要求“根据我的数据和您的预测/估计/建模需求对统计模型提出建议”。

有人可以帮助我吗?

R脚本:

library(caret)    
library(randomForest)

data_failures <- read.csv('OUTPUT.csv', header = TRUE, sep = ",", stringsAsFactors = TRUE)

train.control <- trainControl(method = "cv", number = 10)
model <- train(Period_1 ~., data = data_failures, method = "rf",
                                   trControl = train.control)
print(model)

print(class(str(data_failures)))

输出:

Random Forest

112 samples
 11 predictor
  2 classes: 'Failure', 'Normal'

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 101, 101, 101, 101, 101, 101, ...
Resampling results across tuning parameters:

  mtry  Accuracy  Kappa
   2    1         NaN
   6    1         NaN
  11    1         NaN

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
'data.frame':   112 obs. of  12 variables:
 $ Period_1 : Factor w/ 2 levels "Failure","Normal": 2 2 2 2 2 2 2 2 2 2 ...
 $ Period_2 : Factor w/ 2 levels "Failure","Normal": 2 2 2 1 2 2 2 2 2 1 ...
 $ Period_3 : Factor w/ 2 levels "Failure","Normal": 2 2 1 2 2 2 2 2 2 2 ...
 $ Period_4 : Factor w/ 2 levels "Failure","Normal": 2 2 2 2 2 2 2 2 2 2 ...
 $ Period_5 : Factor w/ 2 levels "Failure","Normal": 2 2 2 2 2 2 2 2 2 2 ...
 $ Period_6 : Factor w/ 2 levels "Failure","Normal": 2 2 2 2 2 2 2 2 2 2 ...
 $ Period_7 : Factor w/ 2 levels "Failure","Normal": 2 2 2 2 2 2 2 2 2 2 ...
 $ Period_8 : Factor w/ 2 levels "Failure","Normal": 2 2 2 1 2 2 2 2 2 2 ...
 $ Period_9 : Factor w/ 2 levels "Failure","Normal": 2 2 2 2 2 2 2 2 2 2 ...
 $ Period_10: Factor w/ 2 levels "Failure","Normal": 2 2 2 2 2 2 2 2 2 2 ...
 $ Period_11: Factor w/ 2 levels "Failure","Normal": 2 2 2 2 2 2 2 2 2 2 ...
 $ Period_12: Factor w/ 2 levels "Failure","Normal": 2 2 2 2 2 2 2 2 2 2 ...
[1] "NULL"
Warning messages:
1: model fit failed for Fold08: mtry= 2 Error in randomForest.default(x, y, mtry = param$mtry, ...) :
  Need at least two classes to do classification.

2: model fit failed for Fold08: mtry= 6 Error in randomForest.default(x, y, mtry = param$mtry, ...) :
  Need at least two classes to do classification.

3: model fit failed for Fold08: mtry=11 Error in randomForest.default(x, y, mtry = param$mtry, ...) :
  Need at least two classes to do classification.

4: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

数据样本:

    Period_1 Period_2 Period_3 Period_4 Period_5 Period_6 Period_7 Period_8
1     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
2     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
3     Normal   Normal  Failure   Normal   Normal   Normal   Normal   Normal
4     Normal  Failure   Normal   Normal   Normal   Normal   Normal  Failure
5     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
6     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
7     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
8     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
9     Normal   Normal   Normal   Normal   Normal   Normal   Normal   Normal
10    Normal  Failure   Normal   Normal   Normal   Normal   Normal   Normal
```
1个回答

我的猜测是,您的失败/普通课程中的任何一个都比另一个少得多。因此,对于某个(即第 n 个)折叠,仅存在一个类的实例。您可以尝试对代表性不足的类进行过采样以防止这种情况,或者尝试进行分层 K-Fold,以便每个折叠都会出现两个类。