我的二元分类问题中的所有数据都由X和表示y。现在,我对这些数据进行分层交叉验证,如下所示:
scoring = {'accuracy' : make_scorer(accuracy_score),
'precision' : make_scorer(precision_score),
'recall' : make_scorer(recall_score),
'f1_score' : make_scorer(f1_score)}
model=RandomForestClassifier(n_estimators=50,random_state=10)
results = cross_validate(estimator=model, X=X, y=y, cv=10, scoring=scoring)
如果运行代码,我将得到以下结果:
Accuracy : 0.5436815489342804
Precision : 0.020165565854870747
Recall : 0.11013513513513513
F1_score : 0.03315023853741518
X现在,我拆分y如下:
#Test, training data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, stratify = y)
#Split the training data into validation set
X_val_train, X_val_test, y_val_train, y_val_test = train_test_split(X_train, y_train, test_size = 0.1, random_state=0, stratify=y_train )
X_train现在,如果我像以前一样在and上执行相同的交叉验证过程X_train,我将得到以下结果:
Accuracy : 0.8424393681243558
Precision : 0.47658195862621017
Recall: 0.1964997354963851
F1_score : 0.2773991741912054
我不明白为什么结果如此不同以及为什么会发生这种情况。
