机器算法验证 - 来自两个独立正态变量的样本均值比率分布？ - 吾爱随笔录

来自两个独立正态变量的样本均值比率分布？

机器算法验证分布正态分布采样

2022-03-28 23:07:41

问题

我们有一个大小为的样本，平均和 SD来自随机变量 $N$ $\bar{x}$ $\bar{\sigma_x}$ $X \sim \mathcal{N} (\mu, \sigma^2)$

我们有一个大小为的样本，平均和 SD来自随机变量 $M$ $\bar{y}$ $\bar{\sigma_y}$ $Y \sim \mathcal{N} (c\mu, c^2\sigma^2)$

我们希望找到和的估计值，以及这些估计值的分布 $\mu$ $c$

到目前为止我的进步

$\bar{x}$ 是的一个明显估计，我们知道它具有 t 分布，但仅使用此统计量会忽略中包含的 $\mu$ $\mu$ $\bar{y}$
$\frac{\bar{y}}{\bar{x}}$ 会给我们一个的估计值，但它会有什么分布？有一个柯西分布，但是当使用样本均值的比率时，类似的分布是什么？换一种说法： $c$ $\frac{Y}{X}$

正常：t-分布 :: Cauchy : ???

的估计值，我们就可以将除以那个估计值来得到的另一个估计值，从而提取中包含的的附加信息。但是该估计的分布是什么，以及如何将它与我们的估计结合起来？事情似乎变得令人困惑，因为现在我们有一个 t 分布随机变量除以我们的样本等效的柯西随机变量......有没有更直接的方法来做到这一点？ $c$ $\bar{y}$ $\mu$ $\mu$ $\bar{y}$ $\bar{x}$

2个回答

这个框架是 Cox 模型的一个特例

http://www.jstor.org/stable/2530661

在这里学习

http://onlinelibrary.wiley.com/doi/10.1002/bimj.200310009/abstract

如果您只能将 Y 除以 c，那么您的所有数据都将来自。这向我建议了一种迭代方法。估计 c，然后使用汇集的数据估计和；然后使用这些改进的估计来更好地估计 c，并重复直到收敛。这回避了理论最佳估计器的问题，但可能仍然是一种有用的方法。 $N(\mu, \sigma^2)$ $\mu$ $\sigma^2$

您可以使用基于您的模型的模拟（如果您对它有信心）来计算您选择的任何估计量或估计量组合的近似分布。

然后，我将使用引导程序来估计您对和的估计的方差。这具有不依赖于模型的分布假设的优势。 $\mu$ $c$

对我来说，说明这种一般方法比试图解释更容易：

###
# Create a function that does the iterative thing
RatioEst <- function(x,y, verbose=FALSE){
    mu_latest <- mean(x)
    sigma2_latest <- var(x)
    for (i in 1:5){
        c_latest <- mean(c(
            mean(y / mu_latest),
            sqrt(var(y)/sigma2_latest)))
        mu_latest <- mean(c(x, y/c_latest))
        sigma2_latest <- var(c(x, y/c_latest))
        if(verbose){print(c(mu_latest, c_latest, sigma2_latest))}
    }
    return(c(mu_latest, c_latest))
}

#### Simulation to get an idea of the distribution of estimates.
# Simulate data many times and see the results of our estimation technique. 
# True values of mu and c are 30 and 2

reps <- 10000
results <- matrix(0, nrow=reps, ncol=2)

for (i in 1:reps){
    x <- rnorm(20,30,5)
    y <- rnorm(30,60,10)
    results[i,] <- RatioEst(x,y, verbose=FALSE)
}

summary(results)

par(mfrow=c(1,2))
plot(density(results[,1]), bty="l", main="Simulated estimates of mu",
    xlab="True value=30")
plot(density(results[,2]), bty="l", main="Simulated estimates of c",
    xlab="True value=2")

这给出了下面的结果，表明我选择的估计量是有偏差的（对于 mu 向上；对于 c 向下），尽管重复估计的中位数非常好。

       mu              c        
 Min.   :24.43   Min.   :0.5937  
 1st Qu.:28.85   1st Qu.:1.8256  
 Median :30.01   Median :2.0072  
 Mean   :31.21   Mean   :1.9340  
 3rd Qu.:31.87   3rd Qu.:2.1284  
 Max.   :73.57   Max.   :2.6688

在此处输入图像描述

所以这是一个模拟来显示我选择的估计器的属性（你会看到它包括一种有趣的 c 估计，它是两个估计的平均值）。如果您使用这种方法，下面是您如何进行实际估计：

#### Actual estimation
set.seed(123)
x <- rnorm(20,30,5)
y <- rnorm(30,60,10)

# point estimates
RatioEst(x, y, verbose=TRUE)

这给出了这些结果（包括显示迭代如何工作）：

[1] 31.12087  1.89926 22.66501
[1] 31.050508  1.906381 22.529121
[1] 31.001155  1.911407 22.438041
[1] 30.967360  1.914864 22.377693
[1] 30.944615  1.917198 22.337999
[1] 30.944615  1.917198

要获得置信区间，请使用引导程序：

# bootstrap
# Simulate data *once* and then resample from it many times.
# Has the advantage that will work even if original specification
# of distribution is incorrect
reps <- 699
boot.results <- matrix(0, nrow=reps, ncol=2)
for (i in 1:reps){
    boot.results[i,] <- RatioEst(
        x=sample(x, replace=TRUE), 
        y=sample(y, replace=TRUE))
}
summary(boot.results)
apply(boot.results, 2, quantile, probs=c(0.025, 0.975))

这给出了（非对称）95% 置信区间的这些结果：

           mu      c
2.5%  28.02008 1.109987
97.5% 44.38868 2.236229

其它你可能感兴趣的问题

上一篇涉及四分位距和标准差的不等式下一篇因子实验的相同条件下如何处理重复测量？