机器算法验证 - 结合χ2χ2测试 - 吾爱随笔录

结合χ2χ2测试

机器算法验证卡方检验

2022-04-11 20:59:54

我有统计问题。我正在研究某种物质浓度的变化是否会影响昆虫被它吸引的强度。我通过用不同浓度的物质诱捕昆虫陷阱来做到这一点。我将诱捕器成对展开，成对中的诱捕器具有不同的浓度。现在我的数据看起来有点像这样：

More : Less
   6 : 0  
   1 : 0   
   3 : 1  
   1 : 0  
  15 : 3

我需要做的是证明昆虫更喜欢更高浓度的诱饵。

合并数据并使用是一种选择，但我不知道它是否是最佳选择。在一个月内在不同地点进行了诱捕，因此可能涉及扰动因素。 $\chi^{2}$

提前致谢！

编辑：这只是我完整数据的一小部分。总共我有大约 40 对值要比较，并且捕获了大约 300 只昆虫。这只是一个很小但具有代表性的部分。

2个回答

如果陷阱都被认为是独立的，那么这不是测试的变体，它只是一个变体。但是因为它们是成对使用的，因此不是独立的，所以您要寻找的是 McNemar 测试的变体。不幸的是，对超过 2x2 矩阵的测试的任何修改仍然会受到影响，因为在 Less 列中的项目数量如此之少。 $\chi^2$

你的影响是如此强大，我很想只报告数据。那有什么问题？当数据非常强大并且效果中几乎没有噪音时，很难理解为什么您不只报告您发现的内容并将其陈述为事实。统计数据并没有你想象的那么强大，也没有真正让这个引人入胜的数据故事变得更好。我担心它们只会被用来隐藏小样本问题。

如果您真的非常想要概率，那么重采样可能是您最好的选择。您可以执行排列或随机化测试。计算更多和更少条件之间的平均差异。然后随机打乱样本，保持配对，并计算新的平均差。在计算机上这样做数千次，然后找出您发现的差异在您采样的差异分布中的位置。较大或更大的影响的概率将是您要报告的p值。

这是一些可以做到的基本R。

dat <- matrix(c(6, 1, 3, 1, 15, 0, 0, 1, 0, 3), ncol = 2)
n <- nrow(dat)
eff <- diff( colSums(dat) )
samps <- rowSums(dat)
nsamp <- 5000

# bootstrap nsamp replications of your experiment
y <- replicate(nsamp, {
    # get a random amount for each location and put it in the more trap
    more <- sapply(1:n, function(i) {sample(samps[i]+1, 1) - 1})
    # of course, the rest is in the less trap
    less <- samps - more
    # calculate effect (less - more might be backwards of what you want 
    # but it's what the diff command did above for the original effect so 
    #we keep calculating in the same direction
    sum(less) - sum(more)
    })
# two sided p-value
sum(y < eff | y > -eff) / nsamp

该 p 值是数据产生的影响与给出零假设的影响一样大或更大的概率，以及对代表性样本的假设（总是隐含的）。将其视为考虑如果 null 为真会发生什么。陷阱只会随机捕捉昆虫。想象一下，你在尽可能多的地方抓到了和你一样多的昆虫，然后看看这种分布是如何在陷阱中随机出现的。如果当 null 为真时您的效果不太可能发生，那么我们得出结论 null 不是。

或者，可以通过替换对效应分布进行抽样。通过这样做，可以引导效果的置信区间。

# get each separate effect
effs <- dat[,2] - dat[,1] 
nsamp <- 1000    
# bootstrap nsamp replications of your experiment
y <- replicate(nsamp, {
    # randomly sample from the distribution of effects
    effSamp <- sample(effs, replace = TRUE)
    # get total sample effect
    sum(effSamp)
    })
    # get y into order so we can get the distribution cutoffs
y <- sort(y)
# 95% CI
y[0.025 * nsamp]; y[0.975 * nsamp]

一种超级稳健和超级保守的方法是非参数符号秩或匹配对符号测试。在第一个测试中，您假设两组的分布相同；在第二个中，您明确地结合了两个分布的耦合。

. signrank trapped0 = trapped1

Wilcoxon signed-rank test

        sign |      obs   sum ranks    expected
-------------+---------------------------------
    positive |        0           0         7.5
    negative |        5          15         7.5
        zero |        0           0           0
-------------+---------------------------------
         all |        5          15          15

unadjusted variance       13.75
adjustment for ties       -0.13
adjustment for zeros       0.00
                     ----------
adjusted variance         13.63

Ho: trapped0 = trapped1
             z =  -2.032
    Prob > |z| =   0.0422

. signtest trapped0 = trapped1

Sign test

        sign |    observed    expected
-------------+------------------------
    positive |           0         2.5
    negative |           5         2.5
        zero |           0           0
-------------+------------------------
         all |           5           5

One-sided tests:
  Ho: median of trapped0 - trapped1 = 0 vs.
  Ha: median of trapped0 - trapped1 > 0
      Pr(#positive >= 0) =
         Binomial(n = 5, x >= 0, p = 0.5) =  1.0000

  Ho: median of trapped0 - trapped1 = 0 vs.
  Ha: median of trapped0 - trapped1 < 0
      Pr(#negative >= 5) =
         Binomial(n = 5, x >= 5, p = 0.5) =  0.0313

Two-sided test:
  Ho: median of trapped0 - trapped1 = 0 vs.
  Ha: median of trapped0 - trapped1 != 0
      Pr(#positive >= 5 or #negative >= 5) =
         min(1, 2*Binomial(n = 5, x >= 5, p = 0.5)) =  0.0625

基本上，这些测试都说五个差异恰好落在同一侧（如果没有差异的空值为真，则“更多”条件的计数更高，概率为 1:2^5 = 1:32。但是，这些测试在样本量为 5 的情况下可能没有太大的功效，您可以通过做出更强的假设来获得更强的结果。

由于您正在处理计数，因此适用于您的问题的工具将是广义线性模型，即计数的泊松模型。将站点表示为block, 将条件表示为treat, 我得到

Poisson regression                                Number of obs   =         10
                                                  LR chi2(5)      =      47.17
                                                  Prob > chi2     =     0.0000
Log likelihood =  -11.51985                       Pseudo R2       =     0.6718

------------------------------------------------------------------------------
     trapped |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       block |
          2  |   .1666667   .1800206    -1.66   0.097     .0200653    1.384368
          3  |   .6666667   .4303315    -0.63   0.530     .1881311    2.362419
          4  |   .1666667   .1800206    -1.66   0.097     .0200653    1.384368
          5  |          3   1.414214     2.33   0.020     1.190861    7.557558
             |
     1.treat |        6.5    3.49106     3.49   0.000     2.268531    18.62438
       _cons |         .8   .4953114    -0.36   0.719     .2377266    2.692168
------------------------------------------------------------------------------

也就是说，“更多”条件平均比“更少”条件吸引的昆虫多 6.5 倍，置信区间为 [2.27x, 18.62x]。这里的 p 值比 p = 0.0005 的非参数检验要强得多。

状态码：

clear
input trapped block treat
6 1 1
0 1 0
1 2 1
0 2 0
3 3 1
1 3 0
1 4 1
0 4 0
15 5 1
3 5 0
end
poisson trapped i.block i.treat, irr
testparm i.treat
reshape wide trapped, i(block) j(treat)
signrank trapped0 = trapped1
signtest trapped0 = trapped1

其它你可能感兴趣的问题

上一篇符号：确定性变量、随机变量、随机变量的实现、函数下一篇绘制核 SVM (RBF) 的决策边界