如何改进具有高训练性能和低测试性能的回归模型

数据挖掘 机器学习 scikit-学习 回归 机器学习模型 岭回归
2022-03-12 23:58:15

我正在对一些数据进行回归分析。我不断得到非常高的训练分数和低的测试分数。我的代码如下,我能做些什么来增强它?先感谢您。

# coding: utf-8

# In[1]:

#Importing modules
import sys
import math 
import itertools
import numpy as np
import pandas as pd
from numpy import genfromtxt
from matplotlib import style
import matplotlib.pyplot as plt
from sklearn import linear_model
from matplotlib import style, figure
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split


# In[2]:


#Importing data
df = np.genfromtxt('/Users/Studies/Machine_learning/reactivity/main_us.csv', delimiter=',')
#To skip the header ad skiprpws=0


# In[3]:


X = df[0:,1:306]
y = df[0:,0]


# In[4]:


print (X).shape
print (y).shape
display (X)
display (y)
print (y)


# In[5]:


X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=4)


# In[6]:


#Apply StandardScaler for feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)
print len(X_test), len(y_test)


# In[7]:


#Applying PCA for dimnetionality reduction

from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

#Checking shape after scaling
print ("Checking shape after scaling")
print (X_train.shape)
print (X_test.shape)


#Variance/Values
print("Explained_variance_ratio")
print(pca.explained_variance_ratio_)
print("Singular_values")
print(pca.singular_values_)


#Plotting
print ("Graph")
plt.scatter (X_train[:,0], X_train[:,1], c=y_train, edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('rainbow',6))
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.colorbar();

print ('You are looking at a high dimentional data explained by 2 components')
print ('Eeven though these components hold some information, but this to seperate the components apart')


print(pca.explained_variance_ratio_)
print(pca.singular_values_)

#Checking shape after scaling 
print (X_train.shape)
print (y_train.shape)
print (X_train.shape)


# In[8]:


alphas = 10**np.linspace(10,-2,100)*0.5
alphas


# In[9]:


from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge, Lasso

for Model in [Ridge, Lasso]:
    model = Model()
    print('%s: %s' % (Model.__name__,
                      cross_val_score(model, X, y).mean()))

# Out[9]:

Ridge: -1.3841312374053019
Lasso: -1.164517926682712

# In[10]:


import numpy as np
from matplotlib import pyplot as plt

alphas = np.logspace(-3, -1, 30)

plt.figure(figsize=(5, 3))

for Model in [Lasso, Ridge]:
    scores = [cross_val_score(Model(alpha), X, y, cv=3).mean()
            for alpha in alphas]
    plt.plot(alphas, scores, label=Model.__name__)

plt.legend(loc='lower left')
plt.xlabel('alpha')
plt.ylabel('cross validation score')
plt.tight_layout()
plt.show()


# In[11]:


# alpha = 0.1
model = Ridge(alpha = 0.1)
model.fit(X_train,y_train)
print model.score(X_train,y_train)   
print model.score(X_test,y_test)

# alpha = 0.01
model1 = Ridge(alpha = 0.01)
model.fit(X_train,y_train)
print model.score(X_train,y_train)   
print model.score(X_test,y_test)

# alpha = 0.001
model2 = Ridge(alpha = 0.001)
model.fit(X_train,y_train)
print model.score(X_train,y_train)   
print model.score(X_test,y_test)

# alpha = 0.0001
model3 = Ridge(alpha = 0.0001)
model.fit(X_train,y_train)
print model.score(X_train,y_train)   
print model.score(X_test,y_test)

# Out[11]:

0.9999996833724945
-0.4120322763917558
0.9999996833724945
-0.4120322763917558
0.9999996833724945
-0.4120322763917558
0.9999996833724945
-0.4120322763917558


# In[12]:


modelCV = RidgeCV(alphas = [0.1, 0.01, 0.001,0.0001], store_cv_values = True)
modelCV.fit(X_train,y_train)
modelCV.alpha_  #giving 0.1
print modelCV.score(X_train,y_train)  # giving 0.36898424479812919 which is the same score as ridge regression with alpha = 0.1
print modelCV.score(X_test,y_test) 

# Out[12]:

0.9999996833724951
-0.41203227638984496
1个回答

我不会对您的代码进行太多操作,因为我可以看到您只导入了所有库:您面临过拟合问题,以下是当我遇到相同情况时的一些事情:

  1. 构建多个模型并检查拟合优度,然后实施。
  2. 交叉验证是您应该研究的东西,以确保您选择了正确的模型。

如何处理: 1. 用更多的数据进行训练:(不是每次都有效,但是用更多的数据训练可以帮助算法更好地检测信号。) 2. 去除特征。(因为每个变量都会有方差所以均匀如果它不显着,它会在训练期间尝试解释因变量的方差,但在测试中,它会失败,因为它不够显着) 3. 早期停止:(早期停止是指在学习者通过之前停止训练过程点。) 4. 正则化:(这是获得稳定模型的一种方法) 5. 集成:(我最喜欢的)集成是机器学习方法,用于组合来自多个单独模型的预测。有几种不同的集成方法,但最常见的两种是:

Bagging 试图减少过度拟合复杂模型的机会。

它并行训练大量“强”学习者。强学习器是一种相对不受约束的模型。Bagging 然后将所有强大的学习者组合在一起,以“理顺”他们的预测。推动提高简单模型预测灵活性的尝试。它依次训练大量“弱”学习者。弱学习器是一个受约束的模型(即您可以限制每个决策树的最大深度)。序列中的每一个都专注于从前一个的错误中学习。然后,Boosting 将所有弱学习器组合成一个强学习器。虽然 bagging 和 boosting 都是集成方法,但它们从相反的方向解决问题。Bagging 使用复杂的基础模型并试图“理顺”