数据挖掘 - 难以理解 SGDRegressor 学习的回归线 - 吾爱随笔录

难以理解 SGDRegressor 学习的回归线

数据挖掘机器学习 Python 线性回归在线学习

2022-03-10 01:14:13

我正在制作一个演示笔记本，以更好地理解在线（增量）学习。我在sklearn 文档中读到，通过该partial_fit()方法支持在线学习的回归模型的数量相当有限：只有SGDRegressor并且PassiveAgressiveRegressor可用。此外，XGBoost 还通过xgb_model参数支持相同的功能。目前，我选择SGDRegressor尝试。

我创建了一个示例数据集（下面的数据集生成代码）。数据集如下所示：

尽管这个数据集显然不是像 SGDRegressor 这样的线性回归模型的良好候选者，但我对这个片段的观点只是为了演示随着模型看到越来越多的数据点，学习参数 ( coef_, intercept_) 和回归线如何变化。

我的做法：

收集数据排序后的前 100 个数据点
在前 100 个观察值上训练初始模型并检索学习参数
绘制学习的回归线
迭代：采取N“新”观察，使用partial_fit()，检索更新的参数，并绘制更新的回归线

问题是，在对前 100 个观察值进行训练后，学习的参数和回归线似乎根本不正确。我尝试修改max_iter和的eta0参数，SGDRegressor()因为我认为 SGD 无法收敛到最优解，因为学习率太慢和/或最大迭代次数太低。然而，这似乎没有帮助。

这是我的情节：

我的完整代码：

from sklearn import datasets
import matplotlib.pyplot as plt

random_state = 1

# generating first section
x1, y1 = datasets.make_regression(n_samples=1000, n_features=1, noise=20, random_state=random_state)
x1 = np.interp(x1, (x1.min(), x1.max()), (0, 20))
y1 = np.interp(y1, (y1.min(), y1.max()), (100, 300))

# generating second section
x2, y2 = datasets.make_regression(n_samples=1000, n_features=1, noise=20, random_state=random_state)
x2 = np.interp(x2, (x2.min(), x2.max()), (15, 25))
y2 = np.interp(y2, (y2.min(), y2.max()), (275, 550))

# generating third section
x3, y3 = datasets.make_regression(n_samples=1000, n_features=1, noise=20, random_state=random_state)
x3 = np.interp(x3, (x3.min(), x3.max()), (24, 50))
y3 = np.interp(y3, (y3.min(), y3.max()), (500, 600))

# combining three sections into X and y
X = np.concatenate([x1, x2, x3])
y = np.concatenate([y1, y2, y3])

# plotting the combined dataset
plt.figure(figsize=(15,5))
plt.plot(X, y, '.');
plt.show();

# organizing and sorting data in dataframe
df = pd.DataFrame([])
df['X'] = X.flatten()
df['y'] = y.flatten()
df = df.sort_values(by='X')
df = df.reset_index(drop=True)

# train model on first 100 observations
model =  linear_model.SGDRegressor()
model.partial_fit(df.X[:100].to_numpy().reshape(-1,1), df.y[:100])
print(f"model coef: {model.coef_[0]:.2f}, intercept: {model.intercept_[0]:.2f}")
regression_line = model.predict(df.X[:100].to_numpy().reshape(-1,1))
plt.figure(figsize=(15,5));
plt.plot(X,y,'.');
plt.plot(df.X[:100], regression_line, linestyle='-', color='r');
plt.title("SGDRegressor on first 100 observations with default arguments");

我在这里误解或监督什么？

1个回答

一次调用partial_fit不太可能让你很好地适应，因为它只执行随机梯度下降的一次迭代。如文档中所述：

在内部，此方法使用 max_iter = 1。因此，不能保证调用一次后达到成本函数的最小值。客观收敛、提前停止等事项应由用户自行处理。

来源

我对在线学习和部分拟合不是很熟悉，但如果你想让它起作用，似乎你需要应用某种循环功能。玩了一会儿后，我发现这个简单的修改已经大大改善了结果：

# train model on first 100 observations  
model = linear_model.SGDRegressor()
amount = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300]
for a in amount:
    model.partial_fit(df.X[:a].to_numpy().reshape(-1, 1), df.y[:a])
    print(f"amount: {a}, model coef: {model.coef_[0]:.2f}, intercept: {model.intercept_[0]:.2f}")
regression_line = model.predict(df.X[:800].to_numpy().reshape(-1, 1))
plt.figure(figsize=(15, 15))
plt.plot(X, y, '.')
plt.plot(df.X[:800], regression_line, linestyle='-', color='r')
plt.title("SGDRegressor on first 100 observations with default arguments")
plt.show()

在这里，您可以在输出中看到截距在增加，而系数在减少，这是我们期望的良好拟合的样子。
我希望这足以让您的项目再次启动！

其它你可能感兴趣的问题

上一篇Keras：使用 EarlyStopping 时如何恢复初始权重下一篇对于定量研究和统计的初学者，哪个是更好的统计工具：R 还是 IBM SPSS？为什么？