如何在 scikit 管道之后获取数据帧?

数据挖掘 Python scikit-学习 python-3.x 管道
2022-02-24 23:02:26

我正在使用 scikit 管道进行许多数据转换并拟合模型,但我需要在转换(输入器、编码等)之后立即提取 X_train 和 X_test 以便将其用于其他分析。我怎么才能得到它?

这是我的管道:

imputer_num = SimpleImputer(strategy = 'median')
imputer_cat = SimpleImputer(strategy = 'most_frequent')

XGB = XGBClassifier()
BBC = BalancedBaggingClassifier()
BRC = BalancedRandomForestClassifier()

models = [XGB, BBC, BRC]

numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy = 'median'))
,('scaler', StandardScaler())
])
    
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy = 'most_frequent'))
,('encod', encoder)
])
    
preprocessor = ColumnTransformer(
 transformers=[
('num', numeric_transformer, numericas_all)
,('cat', categorical_transformer, categoricas_all)
])
    
for item in models:
    pipe = Pipeline(steps=[('preprocessor', preprocessor),('classifier', item)])
    model = pipe.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    test_probs = model.predict_proba(X_test)
    print(model)
    print(balanced_accuracy_score(y_test, y_pred))
    print(roc_auc_score(y_test, y_pred))
1个回答

您可以尝试将预处理器应用于您的X_trainand X_test

preprocessor = ColumnTransformer(
 transformers=[
('num', numeric_transformer, numericas_all)
,('cat', categorical_transformer, categoricas_all)
])

X_train_pipe = preprocessor.transform(X_train)
X_test_pipe = preprocessor.transform(X_test)

编辑:

由于您没有使用任何创建新列的转换器,例如 OneHotEncoder,因此获取特征名称非常简单,因为这些名称与输入矩阵 X 相同(如果使用前面提到的其他编码器,您可以使用get_feature_names财产)

只需一步完成,我将在此预处理器中添加一个额外的步骤:

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline

names = X_train.columns.tolist()

preprocessor = ColumnTransformer(
     transformers=[
    ('num', numeric_transformer, numericas_all)
    ,('cat', categorical_transformer, categoricas_all)
    ])

pipe_preprocessor = Pipeline([("preprocessor", preprocessor), ("pandarizer", FunctionTransformer(lambda x: pd.DataFrame(x, columns = names)))]).fit(X_train)
    
X_train_pipe = pipe_preprocessor.transform(X_train)
X_test_pipe = pipe_preprocessor.transform(X_test)