拟合分布到重尾数据的问题

机器算法验证 r 分布 配件 肥尾 l 时刻
2022-03-18 21:04:17

我目前正在尝试将分布拟合到一些重尾数据集(请参阅下面的数据集)并且很难产生良好的结果:

v1 <- c(11.87,14.04,11.86,179.88,13.09,14.68,13.54,32.25,54.52,66.28,15.16,39.83,89.81,116.87,298.94,427,6249.42,3334.9,4503.93,4933.9,15.72,12.33,13.64,15.12,47.86,12.3,12.79,12.34,14.44,12.17,120.17,12.82,13.07,47.48,13.72,12.19,30.48,129.16,191.41,282.96,1076.53,4354.01,7882.12,12.45,13.9,12.12,16.63,12.26,12.01,17.87,12.62,11.81,12.68,12.33,12.04,15.64,12,13.1,62.44,13.21,13.28,12.99,13.52,23.47,13.36,11.81,11.86,13.74,58.72,12.72,12.02,11.92,44.75,77.96,23.53,11.81,23.46,12.52,29.47,12.31,12.54,11.86,12.42,11.89,11.94,12.44,13.48,12.37,16.3,11.93,28.1,18.85,11.96,11.94,18.64,11.79,12.1,13.82,29.8,19.6,11.79,12.77,11.77,14.92,20.59,12.24,19.39,12.79,20.97,13.01,32.79,12.73,24.56,11.92,13.08,12.41,14.46,27.26,25.51,13.8,32.35,15.32,27.82,15.26,13.71,29.8,12.72,14.7,12.28,14.55,12.13,12.86,18.36,36.18,12.1,14.56,29.34,11.82,12,12.21,27.98,17.86,14.13,13.45,17.53,14,13.08,12.23,14.94,14.12,13.19,20.14,40.22,21.19,14.46,35.38,14.89,31.77,3057.58,12.86,12.15,20.52,132.7,17.5,31.34,15.27,13.45,15.32,24.75,12.45,13.45,29.4,13.82,13.48,83.84,100.4,12.78,31.58,16.59,12.51,59.34,66.09,232.43,12.74,12.21,73.53,109.48,13.53,17.65,87.09,18.93,12.43,12.32,15.55,14.1,12.15,12.43,11.82,12.87,12.28,140,240.49,12.76,25.97,13.6,18.32,117.4,242.05,13.94,111,161.53,247.33,13.51,15.49,65.64,14.27,35.17,11.83,30.21,29.14,12.53,11.76,14.08,49.06,212.09,258.35,13,13.74,29.08,60.23,12.16,142.66,202.93,74.79,12.88,27.48,48.91,64.79,49.25,224.59,299.4,29.24,68.88,15.12,34.8,23.68,43.55,12.4,17.61,18.1,15.15,11.92,14.17,13.45,14.51,44.46,14.24,34.15,258.84,12.75,12.77,34.44,13.12,22.49,53.46,14.96,13.75,11.77,16.2,12.52,12.19,17.69,34.83,13.25,12.39,29.59,56.69,82.38,12.13,27.69,15.12,50.21,68.42,16.84,14.96,11.81,15.53,168.74,797.01,52.84,67.02,15.83,167.27,240.05,12.03,48.64,30.45,28.81,54.1,17.73,33.99,19.93,37.21,35.3,122.36,44.94,15.2,26.46,217.48,257.06,14.69,13.22,55.8,26.95,55.05,16.71,44.58,20.71,14.24,41.69,58.3,108.43,137.71,13.89,19.53,46.72,22.45,36.93,20.72,17.39,15.32,28.83,16.34,26.04,44.12,17.84,14.23,14.17,13.63,13.12,12.91,12.72,36.33,18.25,14.06,14.67,27.51,18.38,12.69,14.14,16.19,11.87,12.26,31.92,14.09,19.07,32.24,19.29,34.24,21.39,13.05,17.57,5651.61,6635.33,1666.81,6692.4,2161.37,15.63,37.85,61.85,68.92,252.31,16.45,28.21,57.45,93.8,70.53,178.19,239.22,270.67,419.6,11.93,11.88,14.38,51.44,54.91,81.9,112.63,3911.01,8625.72,9144.85,11.9,19.59,39.06,3153.42,8628.67,17.58,12.7,11.91,17.08,11.92,18.83,12.09,13.19,14.02,11.74,42.91,225.66,257.56,18.97,58.93,150.21,249.29,262.74,20.67,48.07,239.44,283.07,777.53,866.46,2570.59,5306.95,7773.85,8706.43,8730.16,21.43,86.28,12.22,103.45,120.04,197.63,502.12,580.07,19.02,18.98,12.3,13.49,50.26,76.13,14.69,44.07,73.74,180.95,13.37,15.37,58.62,60.12,228.92,251.56,268.03,11.77,16.83,50.36,63.02,107.3,234.99,261.7,18.09,58.17,75.96,220.08,250.4,16.36,14.1,61.36,140.59,278.06,417.68,797.12,1633.51,3911.59,3463.77,32.29,59.93,17.8,70.88,88.52,244.83,282.94,312.01,658.95,828.67,15.23)

我试图将一个通用的极值和一个偏斜的学生-t分布拟合到它:

library(fExtremes)

# generalized extreme value distribution
empFit <- gevFit(v1)
coefs <- slot(empFit, 'fit')$par.ests
qqplot(v1, rgev(1e4, xi = coefs['xi'], mu = coefs['mu'], beta = coefs['beta']),ylim=c(0,1e4))
abline(0,1)


# skewed-student t
empFit <- sstdFit(v1)
vpars <- empFit$estimate
qqplot(v1, rsstd(1e4,vparas[1],vparas[2],vparas[3],vparas[4]),ylim=c(0,1e4))
abline(0,1)

qqplots 看起来不太有希望。gev-distribution 完全关闭,student-t 不能很好地捕捉尾巴。

我非常感谢这方面的任何帮助/评论,因为我不是重尾分布方面的专家。

我已经尝试用广义帕累托拟合高于 0.95 分位数的值,而用其他部分拟合剩余部分。但我不确定这是否是一个好方法。

另一个观察:也尝试了其他一些分布和包,但经常发现用于拟合分布的优化器很难处理可能性(这就是为什么我也质疑在上述示例中获得的参数)。

4个回答

我将Tukey-Lambda PPCC应用于您的数据以获得以下图: 在此处输入图像描述

处看到局部最大值,这意味着尾部比正态分布更重。正常为 0.14,Cauchy 为 -1。所以,如果你采用这种观点,那么你的数据尾巴不像柯西那样“胖”,但比正常情况要胖得多。λ=0.47

然而,全局最大值为λ=33.17

我在这里为上述两个最大值绘制了 Tukey lambda: 在此处输入图像描述 在此处输入图像描述

因此,您对分布有两种截然不同的看法。一个是肥尾,另一个非常凸。

我认为凸分布更适合。柯西和其他肥尾的不适合在这里,像我上一个情节那样去寻找这些奇怪的分布。

下面是的Tukey Lambda 分布的样子λ在此处输入图像描述

我使用这个工具来衡量我的数据的形状

你的数据集真的很长尾。考虑对不太极端的数据进行第一次测试(检查这种外观,例如采样峰度)。L 矩不适用于非常长的尾数据,例如对于柯西分布,并非所有 L 矩都存在。

再见斯蒂芬

一种方法是使用您的数据构建 L 矩图。http://cran.r-project.org/web/packages/lmom/index.html有一个 R 包(带有完整手册) 。

L 矩图将在图表上绘制数据的 L 偏度和 L 峰度,并显示不同分布类型在同一图表上的显示方式,以便您可以尝试选择已知分布。

尝试了一些分布的混合,但距离令人满意的结果还很远..

数据集与上述(v1)相同。帕累托的最大似然函数由下式给出:

 pareto.MLE <- function(X)
 {
 n <- length(X)
 m <- min(X)
 a <- n/sum(log(X)-log(m))
 return( c(m,a) ) 
 }

现在,我分别拟合数据集的上尾:

pquant <- .95

idx <- which(v1>quantile(v1,pquant))
v1a <- v1[idx]
vpars1 <- pareto.MLE(v1a)

qqplot(v1a, rpareto(1e4, vpars1[1],vpars1[2]),ylim=c(0,1e4))
abline(0,1)

v1b <- v1[-idx]
vpars2 <- pareto.MLE(v1b)

qqplot(v1b, rpareto(1e4, vpars2[1],vpars2[2]))
abline(0,1)

v3 <- c(rpareto(1e5*.95, vpars2[1],vpars2[2]),rpareto(1e5*.05,vpars1[1],vpars1[2]))
qqplot(v1,v3,ylim=c(0,1e4))
abline(0,1)

QQ图看起来很差。此外,每当拟合上尾时,我得到高得离谱的值的概率太高了。分布的其余部分也没有很好地拟合。我欢迎在这方面提出任何进一步的建议!

另一个问题:什么样的“尾巴度量”来评估尾巴的拟合优度是明智的?RMSE 似乎不是这里的正确选择?我仍然只能通过检查 qq-plots 来定性地判断。

最好的