我正在尝试对 kaggle 的 Titanic 数据集的决策树分类器进行交叉验证。清理数据后的第一步是拆分成训练集和测试集:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train, Y, test_size=0.2, random_state=0)
然后将数字转换为缩放值:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
此外:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
对于网格搜索,我使用了 GridSearchCV:
#Make a grid search
from sklearn.model_selection import GridSearchCV
tree_param = [{'criterion': ['entropy', 'gini'], 'max_depth': [2,3, 4]}]
最后将 GridSearchCV 分类器放入数据中:
clf = GridSearchCV(classifier, tree_param, cv=4)
clf.fit(X=x_train, y=y_train)
我得到的错误如下:
ValueError:输入包含 NaN、无穷大或对于 dtype('float32') 来说太大的值。
我检查了我的x_train和y_train集合,它们都包含特定范围内的数值:
