数据挖掘 - 多项式回归没有像预期的那样过度拟合 - 吾爱随笔录

多项式回归没有像预期的那样过度拟合

数据挖掘回归线性回归

2022-03-15 05:50:46

我正在尝试在 Python 中实现多项式回归并使其大部分工作，但对于更高次的多项式来说，它还不够过拟合。我知道，这是一件令人不安的奇怪事情，但这可能表明我的实现存在潜在问题。

在这种情况下，我使用了一个玩具示例，我在该区域中均匀采样 $[0,1]$ 并将输出值设置为 $\sin(2\pi x) + \mathcal{N}(0, 0.3)$ , 即只是用方差增加高斯噪声 $0.3$ .

我使用的是正规方程形式而不是梯度下降，因此解决方案应该是最佳最小二乘解决方案。根据拉格朗日插值定理，我们应该有一个 n 次多项式恰好适合 n + 1 个点。但是，对于 10 个样本和 9 次多项式（甚至更高次），我的实现并不完全适合每个点。

您会注意到，在我的代码中，我并没有准确地表示图表中 x 的所有值（即，我可能在万分之一之后关闭），但是如果那样的话，效果应该不会那么明显导致它。因为我正在对 x 值进行均匀采样 $[0, 1]$ ，我的错误区域内可能有两个重叠的 x 值，这可能会导致图表中的拟合不佳，但这对于低样本量来说是极不可能的（请参阅我自己的问题和此问题的解决方案）

    import numpy as np
    import scipy as sp
    import matplotlib.pyplot as plt
    import math
    import random
    plt.rcParams['figure.figsize'] = [10, 5]
    
    def sample(n_samples):
        ''' Samples uniformly from x axis and returns y values given by sin(2pi)
        with Gaussian noise'''
        
        x = np.random.uniform(0, 1, size=n_samples)
        y = np.sin(2*np.pi*x) + np.random.normal(0, .3, n_samples)
        
        return [x,y]
    
    def poly_regress(data, n_param):
        ''' Implementation of polynomial regression with n powers. Uses normal
        equation rather than gradient descent.
        
        data: 2 by n matrix with data[0] the x values and data[1] the y values.
        n_param: degree of returned polynomial
        '''
        
        coeff = np.zeros(n_param) # Coefficient vector initialization
        
        # Calculate the vandermonde matrix of input x
        vand_x = np.vander(data[0],N = n_param + 1, increasing=True)
        
        # Computes the coefficient vector via the normal equation
        coeff = np.linalg.pinv(vand_x.T @ vand_x) @ (vand_x.T @ data[1])
    
        # This is for plotting purposes. If we wanted to predict a particular x
        # or range of values, then we'd use an almost identical form.
        x = np.arange(0,1, .0001)
        y = 0
        for i in range(n_param + 1):
            y += coeff[i]*np.power(x, i)
            
        # plots the true value of x. This only works for this toy example.
        true_x = np.arange(0,1, .01)
        true_y = np.sin(2*np.pi*true_x)
        
        plt.plot(true_x, true_y, color='g', label= "True Curve")
        plt.plot(x, y, color='b', label="Prediction")
        plt.plot(data[0], data[1], 'r+', label="samples")
        
        plt.ylim(-2,2)
        plt.legend()
        plt.show()
        return [x,y]

任何帮助是极大的赞赏。这只是在扩展到更一般的回归形式之前的基本案例测试，所以我希望它在这一点上能够理想地运行。

立方案例：

高度无噪音的情况：

对于低次多项式具有类似的结果。

1个回答

好消息是您的代码在理论上似乎是正确的。

坏消息是numpy.linalg.pinv在拟合高阶多项式时存在数值稳定性问题。该问题是由 AT*A 矩阵的行列式递减引起的。

替代方法是使用numpy.linalg.lstsq倾向于提供稍微更好的结果的方法。

但是，根据我对您的样本数据的测试（如下所示），numpy.linalg.lstsq一旦点数超过 10，甚至无法始终准确地拟合这些点。

例如，只有 15 个点（和 14 次多项式），AT*A 的行列式减小到 ca 10E-150，使得矩阵几乎是奇异的。

在下图中，我绘制了行列式的 log10 作为 1000 次实验中 x 点之间平均距离的函数：模式非常清晰，平均距离越小，行列式越小。这似乎是合理的：x 点越接近，方程的线性相关性就越高。

然而，在实践中，我们经常对多项式的阶数明显低于观察次数的情况感兴趣。在这种情况下，结果在数值上更稳定。

import numpy as np
import matplotlib.pyplot as plt
import math
import random
plt.rcParams['figure.figsize'] = [10, 5]
#*************************************************************
# added random seed for reproducibility
np.random.seed(123456)
#*************************************************************
##############################################################
def sample(n_samples):
    x = np.random.uniform(0, 1, size=n_samples)
    y = np.sin(2*np.pi*x) + np.random.normal(0, 0.3, n_samples)
        
    return [x,y]
##############################################################
def poly_regress(data, n_param):
    #coeff = np.zeros(n_param) # Coefficient vector initialization
    # Calculate the vandermonde matrix of input x
    vand_x = np.vander(data[0],N = n_param + 1, increasing=True)
    # Computes the coefficient vector via the normal equation
    coeff_pinv = np.linalg.pinv(vand_x.T @ vand_x) @ (vand_x.T @ data[1])
    #*************************************************************
    # added alternative more precise approach 
    coeff_lstsq = np.linalg.lstsq(vand_x, data[1])[0]
    #*************************************************************
    pred_pinv = 0
    pred_lstsq = 0
    for i in range(n_param + 1):
       pred_pinv += coeff_pinv[i]*np.power(data[0], i)
       pred_lstsq += coeff_lstsq[i]*np.power(data[0], i)
    return np.array([
        np.average(np.diff(np.sort(data[0]))), np.linalg.det(vand_x.T @ vand_x),
        np.average(abs(pred_pinv - data[1])), np.average(abs(pred_lstsq - data[1]))])            
##############################################################
# evaluate results
for num_pts in [3, 5, 15]:
    iter = 1000
    res = np.zeros(iter*4)
    res.shape = (iter,4)
    poly_order = num_pts - 1
    for ii in range(iter):
        # this function returns:
        # [0] average distance between x points: np.average(np.diff(np.sort(data[0])))
        # [1] determinant of the A^T * A matrix: np.linalg.det(vand_x.T @ vand_x)
        # [2] MAE of linalg.pinv-based approach
        # [3] MAE of linalg.lstq-based approach
        res[ii] = poly_regress(sample(num_pts), poly_order)
    plt.plot(res.T[0], np.log10(abs(res.T[1])), 'x', color='tab:grey', label= "log10(abs(Det))")
    plt.xlabel("avg(diff(sort(data[0])))")
    plt.legend()
    plt.title("Num data points: " + str(num_pts) + ", poly degree: " + str(poly_order) + 
              ". Calculations repeated " + str(iter) + " times.")
    plt.show()
    plt.plot(res.T[0], (res.T[2]), 'rd', color='tab:red', label= "MAE linalg.pinv")
    plt.plot(res.T[0], (res.T[3]), 'rP', color='tab:blue', label= "MAE linalg.lstsq")
    plt.xlabel("avg(diff(sort(data[0])))")
    plt.legend()
    plt.title("Num data points: " + str(num_pts) + ", poly degree: " + str(poly_order) + 
              ". Calculations repeated " + str(iter) + " times.")
    plt.show()

其它你可能感兴趣的问题

上一篇为什么在不是神经网络的东西中使用优化器？下一篇Keras中的序列预处理和文本预处理有什么区别？