由于系数很大,我的逻辑模型一直很可疑,所以我尝试进行交叉验证,并对简化模型进行交叉验证,以确认原始模型被过度指定的事实,正如James 建议的那样。但是,我不知道如何解释结果(这是来自链接问题的模型):
> summary(m5)
Call:
glm(formula = cbind(ml, ad) ~ rok + obdobi + kraj + resid_usili2 +
rok:obdobi + rok:kraj + obdobi:kraj + kraj:resid_usili2 +
rok:obdobi:kraj, family = "quasibinomial")
[... see https://stats.stackexchange.com/q/48739/5509 for complete summary output ]
> cv.glm(na.omit(data.frame(orel, resid_usili2)), m5, K = 10)
$call
cv.glm(data = na.omit(data.frame(orel, resid_usili2)), glmfit = m5,
K = 10)
$K
[1] 10
$delta
[1] 0.2415355 0.2151626
$seed
[1] 403 271 1234892862 -1124595763 -489713400 1566924080 147612843
[8] 1879282918 -694084381 1171051622 2063023839 -1307030905 -477709428 1248673977
[15] -746898494 420363755 -890078828 460552896 -758793089 -913500073 -882355605
[....]
Warning message:
glm.fit: algorithm did not converge
我猜 delta 是平均拟合误差,但如何解释呢?它是好还是坏?顺便说一句,算法没有收敛,可能是因为系数很大(?)
我尝试了一个简化模型:
> summary(m)
Call:
glm(formula = cbind(ml, ad) ~ rok + obdobi + kraj, family = "quasibinomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7335 -1.2324 -0.1666 1.0866 3.1788
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -107.60761 48.06535 -2.239 0.025335 *
rok 0.05381 0.02393 2.249 0.024683 *
obdobinehn -0.26962 0.10372 -2.599 0.009441 **
krajJHC 0.68869 0.27617 2.494 0.012761 *
krajJHM -0.26607 0.28647 -0.929 0.353169
krajLBK -1.11305 0.55165 -2.018 0.043828 *
krajMSK -0.61390 0.37252 -1.648 0.099593 .
krajOLK -0.49704 0.32935 -1.509 0.131501
krajPAK -1.18444 0.35090 -3.375 0.000758 ***
krajPLK -1.28668 0.44238 -2.909 0.003691 **
krajSTC 0.01872 0.27806 0.067 0.946322
krajULKV -0.41950 0.61647 -0.680 0.496315
krajVYS -1.17290 0.39733 -2.952 0.003213 **
krajZLK -0.38170 0.36487 -1.046 0.295698
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 1.304775)
Null deviance: 2396.8 on 1343 degrees of freedom
Residual deviance: 2198.6 on 1330 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 4
这是交叉验证:
> cv.glm(orel, m, K = 10)
$call
cv.glm(data = orel, glmfit = m, K = 10)
$K
[1] 10
$delta
[1] 0.2156313 0.2154078
$seed
[1] 403 526 300751243 -244464717 1066448079 1971573706 -1154513152
[8] 634841816 -1521293072 -1040655077 505710009 -323431793 -1218609191 1060964279
[15] 1349082996 -32847357 -1387496845 821178952 -971482876 1295018851 1380491861
现在它收敛了。但是 delta 看起来或多或少是一样的,尽管这个模型看起来更加理智!我现在对交叉验证感到困惑......请给我一个关于如何解释它的提示。