数据挖掘 - 向 XGboost 模型添加额外变量会降低训练和测试的准确性 - 吾爱随笔录

我正在使用 Xgboost 拟合多类模型。我在训练中获得了 96% 的准确率，在测试中获得了 95% 的准确率。我正在使用 80-20 训练/测试拆分。但是，当我添加两个新功能时，训练的准确率下降到 92%，测试的准确率下降到 89%。

没有XGBoost：

选择可用于拆分节点的最重要变量并忽略其余变量？
处理多重共线性？

我没有使用交叉验证。难道是我仍然过度拟合数据？

这是我使用的代码

from sklearn.model_selection import train_test_split
df_new_train, df_new_test, y_train, y_test = train_test_split(df, labels2, test_size = 0.2)

dtrain = xgb.DMatrix(df_new_train, label=y_train)
dtest = xgb.DMatrix(df_new_test, label=y_test)

param = {
        'max_depth': 10,
        'early_stopping_rounds': 10,
        'eta': 0.01,
        'subsample': 0.6,
         'colsample_bytree': 0.5,
        #'alpha': 0.5,x`
        #'lambda': 0.5,
        'gamma': 10,
        'min_child_weight': 1,
        'watchlist': [(dtrain, 'train'), (dtest, 'valid')],
        'objective': 'multi:softprob',  # error evaluation for multiclass training
        'num_class': 4}  # the number of classes that exist in this datset
num_round = 1500

bst = xgb.train(param, dtrain,  num_round)

preds = bst.predict(dtest)
preds_train = bst.predict(dtrain)



best_preds_train = np.asarray([np.argmax(line) for line in preds_train])

best_preds = np.asarray([np.argmax(line) for line in preds])


print(classification_report(y_test,best_preds,target_names=label_encoder.classes_ )) 
```