机器算法验证 - 根据描述性统计生成分布 - 吾爱随笔录

根据描述性统计生成分布

机器算法验证分布正态分布描述性统计

2022-03-31 05:31:17

有一个我想近似的隐藏变量。我知道几个汇总统计：最小值、最大值、平均值、中位数、标准差、n；并且这大约是正常的。

考虑到均值和标准差，我显然可以做一个简单的正态分布，但我知道它略微偏斜且尾部有限。显然，我的近似值并不完美，但更接近。我可能会在中实现R，但不依赖于平台的建议表示赞赏。

例子：

Xbar <- 101.73
Xmedian <- 97.7
Xs <- 20.45 (standard deviation)
Xmin <- 50
Xmax <- 160
n <- 148

3个回答

您必须指定模型。给定汇总统计信息，您无法估计模型或生成分布函数。如果您有数据，您最多可以进行非参数估计，例如自举或密度估计。如果没有实际数据，您将无法执行任何非参数过程——您必须指定参数模型。鉴于您有样本矩，我建议您选择一个模型并使用矩方法来估计它。如果您不知道除此之外的任何内容，则大致正常，只需使用正态分布，因为您没有理由使用其他任何内容。

如果您只想要一个看起来大致正常并满足您的描述性统计数据的分布，这是一种可能的方法。从 148 个数字的正态分布样本开始，然后应用一系列转换来（大约）满足描述性统计数据。当然，有很多发行版可以满足这个问题......

# function for descriptive stats
stats = function(x)  c(min(x),max(x),median(x),mean(x),sd(x))

# simple power transformation (hold min and max constant)
pow = function(x,lam) {
   t = (x-min(x))^lam
   (t/max(t))*(max(x)-min(x))+min(x)
}

# power transform of upper and lower halves of data (hold min,max,median constant)
pow2 = function(par, x) {
    m = median(x)
    t1 = pow(m-x[1:74], par[1])
    t2 = pow(x[75:148]-m, par[2])
    c(m-t1, t2+m)
}


# transformation to fit minimum and maximum
t1 = function(x) {
   x = ((x-min(x))/diff(range(x)) *110) + 50
}

# optimise power transformation match median
t2 = function(x) {
   l = optimise(function(l) { (median(pow(x,l))-97.7)^2 }, c(-5,5))$min
   pow(x,l)
}

# optimise power transformation of upper and lower halves to fit mean and sd
t3 = function(x) {
    l2 = optim(c(1,1), function(par) { 
       r = pow2(par,x); (mean(r)-101.73)^2 + (sd(r)-20.45)^2 })$par
    pow2(l2, x)
}

d = t1(sort(rnorm(148)))
stats(d)
d = t2(d)
stats(d)
d = t3(d)
stats(d) # result should match your descriptive stats
hist(d)  # looks normal-ish

# repeat and plot many distributions that satisfy requirements
plot(d,cumsum(d), type="l")
for(n in 1:500) { 
   d = t3(t2(t1(sort(rnorm(148)))))
   lines(d,cumsum(d), col=rgb(1,0,0,0.05))
}

您可以混合使用法线。选择最少数量的组件，以使您足够接近您所考虑的分布。“足够接近”是您判断的问题。这是一个例子。

# Parameters of the mixture
p1 = 0.6
m1 = 95
s1 = 6
m2 = 103
s2 = 26

# Number of obs.
n = 148

# Draw the component indicators
set.seed(31337)
mix_indicator = rep(1,n)
mix_indicator[which(runif(n) > p1)] = 2

# Draw the normals
draws = rnorm(n)*s1 + m1
draws[which(mix_indicator==2)] = rnorm(sum(mix_indicator==2))*s2 + m2

print(mean(draws))    # 100.9
print(median(draws))  # 97.1
print(sqrt(var(draws)))  # 18.4
print(min(draws))     # 49
print(max(draws))     # 175

其它你可能感兴趣的问题

上一篇人类语音中的词熵/频率下一篇元分析和同质性——这些人做了什么？