机器算法验证 - 嵌套方差分析：样本量不等？方差分量？ - 吾爱随笔录

嵌套方差分析：样本量不等？方差分量？

机器算法验证 r 方差分析嵌套数据分区

2022-03-24 22:16:04

我对此完全不了解，我尝试做的所有阅读都让我感到困惑。我希望你能以一种有意义的方式向我解释事情。（似乎总是如此，“不应该这么难！”）

我正在尝试帮助一个正在研究社会系统对各种犬科动物宿主疾病流行的影响的学生。我们想将社会系统（例如，群居与独居）视为一种固定效应，而将宿主物种视为嵌套在社会系统中的随机效应（即，每个物种只有一种社会系统类型）。

我的理解是，最好的方法是进行混合效应逻辑回归。我们已经做到了，它奏效了，我们很高兴。不幸的是，她的顾问坚持要她计算由于社会系统、宿主物种和残差引起的变异量。我不知道如何通过混合效应逻辑回归来做到这一点，而且我之前关于这个主题的问题没有得到解答。

她的顾问建议改为进行方差分析，将疾病流行率值（每个被感染人群的比例）转换为对数。这带来了一个问题，因为一些流行值是 0 或 1，这将导致或一旦 logit 转换。她的顾问的“解决方案”是分别用和替换或。这感觉真的很笨拙，让我非常畏缩。但他是给她评分的人，在这一点上，我只想完成这个，所以如果他对它没意见，那就随便了。 $-\infty$ $\infty$ $-5$ $5$ $-\infty$ $\infty$

我们使用 R 进行此分析。代码可以在这里下载，输入数据在这里。数据文件包含两种不同病原体（A 和 B）的数据，我们分别对其进行分析（如代码所示）。

这是我们为病原体 B 制作的 ANOVA 设置：

mod1.lm <- lm(Seroprevalence_logit ~ Social.System + Social.System/Host.Species,
              data = prev_B)
print(mod1.anova <- anova(mod1.lm))

这引出了我的第一个问题：这是正确和适当的吗？需要考虑的因素：

我们希望将模型 II（随机效应）变量嵌套在模型 I（固定效应）变量中。
并非每个社会系统都嵌套了相同数量的宿主物种。
并非每个宿主物种都具有相同数量的受检种群。
并非每个检查的人群都有相同数量的个体（mydata.csv 中的 N_indiv 列）。我认为，这更像是一个权重问题，而不是更基本的问题。

我的下一个问题，也是这篇文章的主要问题，是：如何划分方差？这是我们的想法：

MS_A <- mod1.anova$"Mean Sq"[1]
MS_BinA <- mod1.anova$"Mean Sq"[2]
MS_resid <- mod1.anova$"Mean Sq"[3]
n <- length(unique(prev_A$Social.System))
r <- length(unique(prev_A$Host.Species))
VC_A <- (MS_A - MS_BinA)/(n*r)
VC_BinA <- (MS_BinA - MS_resid)/n
VC_resid <- MS_resid

不幸的是，使用我上面详述的 ANOVA 规范会导致悲伤。以下是病原体 B 的结果：

VC_A（即，Social.System）：- $-1.48$
VC_BinA（即宿主.物种）： $13.8$
VC_resid: $5.57$

研究使我相信这应该导致方差分量百分比分别为 0%、71.3% 和 28.7%。但是，这并不令人满意，原因有两个：

来自 ANOVA 的 Social.System 的 p 值为 ~，这表明它应该至少解释了一些观察到的方差。（Host.Species 的 p 值为 ~。） $0.025$ $3*10^{-5}$
我担心负方差分量可能是某事的危险信号。

请，您可以就这些问题中的任何一个提供任何帮助，我们将不胜感激。我完成了生物统计学的本科课程，所以我有一些背景，但我似乎无法弄清楚这些具体问题。提前致谢。

1个回答

希望您的朋友现在已经毕业，但如果没有，以下内容可能会有所帮助。

您在原始帖子Partitioning variance from logistic regression中走在了正确的轨道上，glmer()用于混合效应逻辑回归。

我建议反对：顾问的“解决方案”，使用 lm() 进行逻辑回归，并平均加权行（您应该按 N_indiv 加权）。

广义线性混合模型很难。 http://glmm.wikidot.com/faq有一些很好的信息 - 包括您需要多个级别的随机因子来估计其方差分量的事实。

我下面的代码需要 lme4 包和链接中的数据。

# Seroprevalance has been rounded, that's not OK
# to do logistic regression, (proportion * weight) must equal an integer
prev$seroexact <- round(prev$Seroprevalence * prev$N_indiv)/prev$N_indiv

# Host.Species is nested within Social.system, but you didn't reuse 
# species letters between Social.systems, so you can specify 
# Host.Species as a random effect without explicitly nesting it

# First random effect model
prev1.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species),
                  family=binomial(link="logit"), weights=N_indiv, data = prev)
summary(prev1.glmer)

## Fixed effects:
# Intercept is pathogen A and social.system A.  
# The z-test of the intercept is testing if the logit=0
# I.e. it's testing whether the combination of
# pathogen A and social.system A has prob=0.5.
# The other z-tests are testing whether other levels of the factors
# yield different probabilities than pathogen A and social.system A

## Random effects:
# This doesn't give you separate Host.Species and residual variances,
# Host.Species is treated as a random effect, so this model is the same as if
# you had summed the results of all studies with identical values of
# Host.Species, Pathogen, and Social.System. I.e. sum the results of the
# first 8 rows and create a single proportion and N_indiv, like so:

prevsum<-aggregate(cbind(N_indiv, prop=(seroexact*N_indiv)) ~ 
                   Social.System+Host.Species+Pathogen, data=prev, sum)
prevsum$prop<-prevsum$prop/prevsum$N_indiv

# which gives the same model:
prevsum.glmer = glmer(prop ~ Pathogen + Social.System + (1|Host.Species),
                      family=binomial(link="logit"), weights=N_indiv, data = prevsum)
summary(prevsum.glmer)

# So why are they broken up into multiple rows?  If each row represents
# one geographic area/time/litter/study/etc. then animals in one row
# might be more similar to eachother than they are to animals in
# another row that has the same values of Social, Species, & Pathogen.
# I think this is what the advisor wants as a "residual".

# To allow a random component for each row:
prev2<-cbind(resid=paste("Row_", row.names(prev), sep=""), prev)

prev2.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species) + (1|resid),
                   family=binomial(link="logit"), weights=N_indiv, data = prev2)
summary(prev2.glmer)

# This isn't a bad start, but I'm not comfortable with it because:
table(prev2[,2:3])

# Social.Sytstem D is only observed in Species F.
# This is called confounding, and it makes it hard to draw conclusions
# about Social Sytstem D.  How do you know what is caused by social
# system D and what is caused by species F?  If your friend really wants to
# make inferences about Social System D, she should collect data from
# another host species that uses Social System D.

# Leave out Soc_D:
prev3.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|Host.Species) + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])
summary(prev3.glmer)

# Even though Host Species is conceptually a random factor, you really need to observe
# more than 2 species per social system for a mixed model to accurately estimate
# the species variance.  As far as species variance is concerned, each species is a
# single sample (not animals or even litters), and you can't hope to estimate variance
# accurately with only two samples.

# We can fit the model with species as a fixed effect, but we don't have
# enough degrees of freedom to estimate all levels of Species:
prev4.glmer = glmer(seroexact ~ Pathogen + Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])

# Your friend doesn't need to estimate the level of each species in order to test
# whether species has any noticeable effect at all.  Unfortunately, we can't just
# Use the F statistic from anova() because calculating the denominator df for a
# GLMM is not straightforward.
anova(prev4.glmer) #Gives you an F statistic, but no denominator df or p-value

# Instead we fit a simpler model without Species:
prev5.glmer = glmer(seroexact ~ Pathogen + Social.System + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D",])

# And we'll compare the two models With a Likelihood-Ratio test using anova()
anova(prev5.glmer,prev4.glmer)

# With a p-value of 0.01331 we can say it's worth keeping Species in the model.

# Now let's check the pathogen * social system interaction:
prev6.glmer = glmer(seroexact ~ Pathogen * Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, nAGQ=2, 
                    data = prev2[prev2$Social.System != "Soc_D",])
summary(prev6.glmer) #Neither interaction term is significant
anova(prev6.glmer)
# We don't need a denominator df to know that the F statistic of 0.0774 for
# the interaction is insignificant.

# Since the interaction between Pathogen and Social System was not significant,
# we don't need to include the interaction term.  Similarly, I don't see a 
# statistical reason to  split the model into two separate 'pathogen specific'
# models, but maybe there's a scientific reason to do so:

# Separate tests for each pathogen:
prev7A.glmer = glmer(seroexact ~ Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D" & prev$Pathogen == "Path_A",])
summary(prev7A.glmer)
# Social System B looks different from Social System A in pathogen A prevalance:

# Calculate the odds of having Pathogen A for Social System A vs B
beta7A<-fixef(prev7A.glmer)
exp(-beta7A[2]) #negative sign means odds of A:B instead of B:A
# So animals with Social System A have about 25 times the odds of
# animals with social system B of having Pathogen A

# Test for Pathogen B:
prev7B.glmer = glmer(seroexact ~ Social.System + Host.Species + (1|resid),
                    family=binomial(link="logit"), weights=N_indiv, 
                    data = prev2[prev2$Social.System != "Soc_D" & prev$Pathogen == "Path_B",])
summary(prev7B.glmer)
# The only significant effects are species-specific, which are not of interest

# Let's return to prev4.glmer, which models both pathogens:
summary(prev4.glmer)

# The only significant fixed effect in prev4.glmer is Pathogen.
beta4<-fixef(prev4.glmer)

# For a randomly selected animal, the odds of having Pathogen B to having Pathogen A are:
exp(beta4[2])

# That's about as much as you can interpret with the data she has.

# To answer the Advisor's request for variance components:
# Residual variance is:
getME(prev4.glmer, "theta")^2

# You can't do a good job of estimating species variance with these data.
# If her advisor won't listen, then you can tell him that your estimate is:
getME(prev3.glmer, "theta")[2]^2
# But it's a really crappy estimate.

# There is no such thing as a variance component for Social System because
# it's a fixed effect.  But you can get its sum of squares:
anova(prev4.glmer)

其它你可能感兴趣的问题

上一篇残差与拟合图的异方差下一篇对为什么各种统计测试有效的直觉有多重要？