类固醇的零膨胀:在泊松、负二项式和零膨胀回归中进行选择

机器算法验证 r 回归 泊松分布 负二项分布 零通胀
2022-04-11 21:04:41

我正在努力将替代计数模型融入我的数据中。我想我的问题是太多的零。

这是我的数据

> summary(smpl)
    response        predict1          predict2        
 Min.   :0.000   Min.   :   1.00   Min.   :    22005  
 1st Qu.:0.000   1st Qu.:   3.00   1st Qu.:  4669705  
 Median :0.000   Median :   8.00   Median : 12540318  
 Mean   :0.017   Mean   :  23.27   Mean   : 20382574  
 3rd Qu.:0.000   3rd Qu.:  20.00   3rd Qu.: 25468156  
 Max.   :3.000   Max.   :1584.00   Max.   :145348049

> table(smpl$response)
  0   1   2   3 
987  10   2   1 

我尝试了三种回归:基本泊松、负二项式和零膨胀,但唯一没有警告的返回系数的公式是泊松:

> summary(glm(response ~ ., data = smpl, family = poisson))

Call:
glm(formula = response ~ ., family = poisson, data = smpl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3871  -0.2214  -0.1722  -0.1148   4.7861  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.472e+00  3.521e-01  -9.862  < 2e-16 ***
predict1     3.229e-03  7.271e-04   4.442 8.93e-06 ***
predict2    -6.258e-08  3.060e-08  -2.045   0.0409 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 150.67  on 999  degrees of freedom
Residual deviance: 135.84  on 997  degrees of freedom
AIC: 170.06

Number of Fisher Scoring iterations: 8

负二项式返回关于收敛和交替限制的警告

summary(glm.nb(response ~ ., data = smpl))

Call:
glm.nb(formula = response ~ ., data = smpl, init.theta = 0.04901296596, 
    link = log)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.28844  -0.17677  -0.14542  -0.09808   2.38314  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.899e+00  4.587e-01  -8.499  < 2e-16 ***
predict1     1.226e-02  2.144e-03   5.720 1.06e-08 ***
predict2    -5.982e-08  3.407e-08  -1.756   0.0791 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.049) family taken to be 1)

    Null deviance: 69.927  on 999  degrees of freedom
Residual deviance: 55.940  on 997  degrees of freedom
AIC: 152.37

Number of Fisher Scoring iterations: 1


              Theta:  0.0490 
          Std. Err.:  0.0251 
Warning while fitting theta: alternation limit reached 

 2 x log-likelihood:  -144.3700 
Warning messages:
1: glm.fit: algorithm did not converge 
2: In glm.nb(response ~ ., data = smpl) : alternation limit reached

并且零膨胀(来自pscl包装)根本不返回任何东西

> summary(zeroinfl(response ~ ., data = smpl, dist = "negbin"))

Call:
zeroinfl(formula = response ~ ., data = smpl, dist = "negbin")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.45252 -0.08817 -0.05515 -0.04210 19.56118 

Count model coefficients (negbin with log link):
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.477e+00         NA      NA       NA
predict1     2.678e-03         NA      NA       NA
predict2    -1.160e-07         NA      NA       NA
Log(theta)  -1.241e+00         NA      NA       NA

Zero-inflation model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  4.869e+00         NA      NA       NA
predict1    -1.329e-01         NA      NA       NA
predict2    -1.346e-07         NA      NA       NA
Error in if (getOption("show.signif.stars") & any(rbind(x$coefficients$count,  : 
  missing value where TRUE/FALSE needed

然后我的问题是:

  1. 在使用负二项式(以避免警告)和零膨胀(以获得系数)方面,我能做些什么?
  2. 只看上面的结果(因此包括收敛和交替限制的问题)我应该选择负二项式模型,因为从 AIC 来看,在我的数据中似乎比泊松更适合?
1个回答

您有 13 个非零观测值,只有 3 个大于 1 的观测值。因此,我相信您能做的最好的事情就是仅使用二元回归(0 vs > 0)。可能它也有助于为此添加某种形式的正则化,例如减少偏差(参见包 brglm),因为非零非常罕见。

此外,我建议将预测变量带到类似的规模,例如,将 predict2 除以 1e6 左右。否则,数值优化算法就会出现问题。这可能会提高 glm.nb 和 zeroinfl 的收敛性(顺便说一句,它不会返回任何内容 - 有系数,但协方差矩阵估计似乎无效)。