机器算法验证 - 来自 R 中的自定义连续分布的样本 - 吾爱随笔录

来自 R 中的自定义连续分布的样本

机器算法验证 r 分布样本连续数据

2022-04-03 05:40:19

我在 R 中有一个概率密度函数，我想从中抽取一个样本。我怎么做？

我目前的解决方案（以及谷歌一直给我的解决方案）是为一组密集的值（x）评估函数，给出相关的概率（px），然后使用sample(x, size=1, prob=px). 由于我以这种方式绘制了数千个样本，因此模拟分布的计算量非常大，并且是离散的，即使它应该是连续的。

具体来说，我正在编写自己的 Gibbs 采样器来推断哪些均值和标准差可能导致观察向量（贝叶斯推断）。我从由可能性 * 先验组成的分布中采样。这是我目前正在做的循环版本的最小示例：

# values to assign posterior probabilities to. here 1000 values are used to simulate continuity.
mu.x = seq(from=0.001, to=20, length.out=1000)

# the likelihood distribution which will be called for varying mu and a fixed sigma and fixed D.
likelihoodMu =    function(mu, sigma, D) mu.likelihood = ((2 * pi * sigma^2) ^ (-length(D) / 2)) * exp(-1 / (2 * sigma^2) * sum((D - mu)^2))

# will collect samples
mySamples = rep(NA, length(mu.x))

# Draw 5000 samples
for(i in 1:5000) {
   # Loop over mu.x and get likelihood for each value
   mu.likelihoodDistribution = sapply(mu.x, likelihoodMu, sigma=2, D=c(1,2,3))

   # Draw a sample from the likelihood distribution calculated above
   mySample = sample(mu.x, size=1, prob=mu.likelihoodDistribution)
}

我正在寻找一种直接从 R 中的似然函数进行采样的方法，而不是通过离散且计算量大的 seq-sapply-sample 模拟。在上面的例子中，像 sample('mu', FUN=likelihoodMu, sigma=2, D=c(1,2,3), size=1) 这样的东西会很好。它最好是通用的，因为我是从不同类型的分布中抽样的。

2个回答

如果你想从某个 pdf 中采样，你可以使用

拒绝抽样只需要密度函数和指定一个值作为上限，该上限至少与密度函数的最大值一样大。缺点是它最终可能是一种非常低效的采样方式，具体取决于密度函数的形状。

如果分布函数的倒数已知，则逆变换采样是首选方式。您的示例中的可能性就是这种情况，因为它是高斯分布，并且相关的分位数函数（=逆分布函数）在 R 中可用。通常，逆变换采样通过从区间 [0 ,1] 并将获得的值用作分位数函数的参数。然后分位数函数的结果值遵循指定的概率分布。

详细说明这个例子：由于可能性是高斯分布，它通过设置最大化 $\mu$ 的算术平均值 $D$ 价值观

μ = \frac{1}{n} \sum_{i}^{n} D_{i}

$\mu = \frac{1}{n}\sum_i^n D_i$ 从中可以计算出方差

Var (μ) = \frac{1}{n^{2}} \sum_{i}^{n} Var (D_{i}) = \frac{σ^{2}}{n}

$\text{Var}(\mu)=\frac{1}{n^2}\sum_i^n \text{Var}(D_i)=\frac{\sigma^2}{n}$ 就这样

μ

$\mu$ 值服从高斯分布

N (μ, σ / \sqrt{n})

$N(\mu,\sigma/\sqrt{n})$ . 您也可以使用该rnorm函数，而不是自己实现逆变换采样。

互联网上有一些关于拒绝抽样的帖子，但我发现这个是最有帮助的。我的例子来自那里，稍作修改。如果您需要加快速度，可以使用data.table可以让您因速度增加而感到头晕的包。我没有打扰，因为这个特定的实现对我来说是即时的。

这是我对一个函数的实现 $x \geq 0$ .

kindaSortaLikeAcdfWeibullButNotReally <- function(x, sigma, b, mx) {
  exp(-(x/sigma)^(-b)) * mx
}

xrange <- 400 # function range from 0 (implicit) to x
N <- 100000 # number of samples
b <- -2.16
mx <- 35.48
sigma <-  147.17

xy <- data.frame(proposed = runif(N, min = 0, max = xrange))

xy$fit <- kindaSortaLikeAcdfWeibullButNotReally(x = xy$proposed, 
                                                sigma = sigma, b = b, mx = mx)
xy$random <- runif(N, min = 0, max = 1)

maxDens <- max(xy$fit)

xy$accepted <- with(xy, random <= fit/maxDens)
# retain only those values which are "below" the custom distribution
xy <- xy[xy$accepted, ]

hist(xy$proposed, freq = FALSE, breaks = 100, col = "light grey")
# multiply by 130 to make it look fit nicely
curve(weibullLikeDistribution(x, sigma = sigma, b = b, mx = mx)/(maxDens * 130),
      from = 0, to = 400, add = TRUE, col = "red", lwd = 2)

这是一张显示该算法如何工作的图像。你找到分布（拟合，黑点），在分布（列）周围的一个正方形中抛出一堆值，random看看它是否高于拟合。

# modify above example to put 
mx <- 1
xrange <- 300
# xy <- xy[xy$accepted, ] # skip this step - you'll see why if you don't

library(ggplot2)

xys <- xy[order(xy$proposed), ]
xys <- xys[seq(1, nrow(xys), by = 11), ]

ggplot(xy, aes(x = proposed, y = fit/maxDens)) +
  theme_bw() +
  scale_color_brewer(palette = "Set1") +
  geom_line(alpha = 0.5) +
  geom_point(data = xys, aes(y = random, color = accepted), alpha = 0.5) +
  geom_point(data = xys, aes(x = proposed, y = fit/maxDens), alpha = 0.5) +
  geom_segment(data = xys, aes(x = proposed, y = random, xend = proposed, yend = fit/maxDens), alpha = 0.3)

其它你可能感兴趣的问题

上一篇残差与拟合分析下一篇为什么对称分布足以使样本均值和方差不相关？