拟合在缩放预测变量上的回归模型的重新缩放预测

机器算法验证 r 回归 预言
2022-03-17 19:59:39

假设我将模型拟合为已标准化为 z 分数形式的数据:

#load data
data(cars)

#standardize variables so that they have mean 0 and st. dev 1
st.cars <- scale(cars)
st.cars <- as.data.frame(st.cars)

mod <- lm(speed ~ dist,data=st.cars)

我想使用这个模型进行预测:

predict(mod,st.cars),但是,我想获得数据集中因变量原始尺度的预测。

有没有办法将预测重新缩放到原始比例?

2个回答

scale函数存储用于缩放属性中数据的scalecenter值。这些可用于将缩放数据的预测转换回原始数据规模。

# Scale cars data:
scars <- scale(cars)
# Save scaled attibutes:
scaleList <- list(scale = attr(scars, "scaled:scale"),
    center = attr(scars, "scaled:center"))
# scars is a matrix, make it a data frame like cars for modeling:
scars <- as.data.frame(scars) 
smod <- lm(speed ~ dist, data = scars)
# Predictions on scaled data:
sp <- predict(smod, scars)
# Fit the same model to the original cars data:
omod <- lm(speed ~ dist, data = cars)
op <- predict(omod, cars)
# Convert scaled prediction to original data scale:
usp <- sp * scaleList$scale["speed"] + scaleList$center["speed"]
# Compare predictions:
all.equal(op, usp)

如果要使用模型通过smod模型对象预测新数据,则需要使用scaleList对象中的适当值缩放 newdata 值(不要scale直接在 newdata 上调用函数)。

我已经建立在 skaluzny 的答案之上,如果您想要一种更直观的方法来执行此操作,而不保存比例属性,而是使用默认情况下 scale() 函数的功能的知识(您实际上只需要这个答案的最后几行)。

尺度函数居中(减去平均值),然后尺度(除以数据的标准差):

sdist <- scale(cars$dist)
head(sdist)

           [,1]
[1,] -1.5902596
[2,] -1.2798136
[3,] -1.5126481
[4,] -0.8141446
[5,] -1.0469791
[6,] -1.2798136

sdist2<-(cars$dist-mean(cars$dist))/sd(cars$dist)
head(sdist2)

[1] -1.5902596 -1.2798136 -1.5126481 -0.8141446 -1.0469791 -1.2798136
# Note this only is oriented the other way because scale() function outputs a matrix:

sdist2<-as.matrix(sdist2)
head(sdist2)
# The output now looks identical

           [,1]
[1,] -1.5902596
[2,] -1.2798136
[3,] -1.5126481
[4,] -0.8141446
[5,] -1.0469791
[6,] -1.2798136

因此,我们实际上可以使用原始数据的均值和标准差,而不是将事物存储为列表。

# Scale cars data:
scars <- scale(cars)

# Save scaled attibutes:
scaleList <- list(scale = attr(scars, "scaled:scale"),
                  center = attr(scars, "scaled:center"))

scaleList
$`scale`
    speed      dist 
 5.287644 25.769377 

$center
speed  dist 
15.40 42.98 

> sapply(cars,mean) # note that these values are the same as the `center` values above
speed  dist 
15.40 42.98

> sapply(cars,sd)  # note that these values are the same as the `scale` values above
    speed      dist 
 5.287644 25.769377 

所以现在我们可以检查如果我们只使用而不是缩放属性,预测值是否都mean()相同sd()

# scars is a matrix, make it a data frame like cars for modeling:
scars <- as.data.frame(scars) 
smod <- lm(speed ~ dist, data = scars)

# Predictions on scaled data:
sp <- predict(smod, scars)

# Fit the same model to the original cars data:
omod <- lm(speed ~ dist, data = cars)
op <- predict(omod, cars)

# Now the original answer was to use these stored attributes to modify the predictions:
usp1 <- sp * scaleList$scale["speed"] + scaleList$center["speed"]

# We can also simply use the standard deviation and mean from the original dataset:
usp2 <- sp * sd(cars$speed) + mean(cars$speed)

identical(usp1,usp2)
[1] TRUE

all.equal(op, usp1, usp2)
[1] TRUE

如果您这样做,这可能会更快/更有效,因为不需要创建额外的数据帧/对象:

Mod <- lm(scale(speed) ~ scale(dist), data = cars) # add scale() function directly to model

Unscaled_Pred <- predict(Mod, cars) * sd(cars$speed) + mean(cars$speed)


all.equal(op, Unscaled_Pred)
[1] TRUE                      # predictions are the same as the model that was never scaled