数据挖掘 - 使用 pandas 和 sklearn 处理大量机器学习功能 - 吾爱随笔录

使用 pandas 和 sklearn 处理大量机器学习功能

数据挖掘机器学习 scikit-学习熊猫

2022-03-14 01:54:12

我对数据科学比较陌生，我正在处理一个大型数据集。在删除具有大量 nan valuew 的特征和编码分类特征之后，它有很多行和大约 270 个特征。当我使用 sklearn 运行逻辑回归时，我的计算机内存不足并崩溃。我如何处理像这样的巨大数据集？

3个回答

我假设您已经进行了特征选择，所以您所有的约 200 个特征都是描述您的目标的特征

因此，特别是对于使用 SGD 的模型，您可以分批训练模型，即每次都添加新的观察值

在您的情况下，如果使用 python，您可以使用SGDClassfiier来loss = log优化逻辑回归成本函数，并使用方法partial_fit。

您可能需要执行以下操作：

chunksize = 5
clf = SGDClassifier(loss='log', penalty="l2", random_state = 42)

for train in pd.read_csv("train.csv", chunksize=chunksize, iterator=True):
    X = train[features_columns]
    Y = train["target"]
    clf.partial_fit(X, Y)

在所有可用变量上运行模型可能不是问题，因为您有太多变量与结果没有直接关联，并且您可能会失去性能（您会用不必要的信息淹没您的模型，因此它会丢失并且无法找到直接和必要的）。您应该开始尝试仅从您认为对您的问题最感兴趣的几个变量开始生成模型。然后，您必须进行特征选择（选择要放入模型中的变量），以便您知道哪个集合会带来最好的模型。有很多特征选择算法，您可以在别处轻松找到有关它们的更多信息。

在您的数据集上运行 PCA 或 LDA。这是一些示例代码。

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df.head()

X = df.values
X.shape

from sklearn.preprocessing import StandardScaler

# It is essential to perform feature scaling before running PCA if there is a significant difference in 
# the scale between the features of the dataset; for example, one feature ranges in values between 0 and 1 
# and another between 100 and 1,000. PCA is very sensitive to the relative ranges of the original features. 
# We can apply z-score standardization to get all features into the same scale by using Scikit-learn 
# StandardScaler() class which is in the preprocessing submodule in Scikit-learn.
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

from sklearn.decomposition import PCA
pca_30 = PCA(n_components=30, random_state=2020)
pca_30.fit(X_scaled)
X_pca_30 = pca_30.transform(X_scaled)

print('variance explained by all 30 components = ', sum(pca_30.explained_variance_ratio_ * 100))

# The first component alone captures about 44.27% of the variability in the dataset and the second component alone captures about 18.97% of the variability in the dataset and so on.
pca_30.explained_variance_ratio_ * 100

np.cumsum(pca_30.explained_variance_ratio_ * 100)


plt.plot(np.cumsum(pca_30.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('explained variance')


print(np.cumsum(pca_30.explained_variance_ratio_ * 100)[0])
print(np.cumsum(pca_30.explained_variance_ratio_ * 100)[1])
print(np.cumsum(pca_30.explained_variance_ratio_ * 100)[2])


# You can see that the first 10 principal components keep about 95.1% of the variability in the 
# dataset while reducing 20 (30–10) features in the dataset. That’s great. The remaining 20 features 
# only contain less than 5% of the variability in data.


# two principal components
pca_2 = PCA(n_components=2, random_state=2020)
pca_2.fit(X_scaled)
X_pca_2 = pca_2.transform(X_scaled)


plt.figure(figsize=(10,10))
sns.scatterplot(x=X_pca_2[:,0], y=X_pca_2[:,1], s=70, hue=cancer.target, palette=['blue','red'])
plt.title('2D Scatterplot of 63% of Variability Captured')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')


# three principal components
pca_3 = PCA(n_components=3, random_state=2020)
pca_3.fit(X_scaled)
X_pca_3 = pca_3.transform(X_scaled)

from mpl_toolkits import mplot3d
fig = plt.figure(figsize= (12,9))
ax = plt.axes(projection='3d')
sctt = ax.scatter3D(X_pca_3[:,0], X_pca_3[:,1], X_pca_3[:,2], c=cancer.target, s=50, alpha=0.6)
plt.title('3D Scatterplot of 72% of Variability Captured')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')



pca_95 = PCA(n_components=.95, random_state=2020)
pca_95.fit(X_scaled)
X_pca_95 = pca_95.transform(X_scaled)

# This means that the algorithm has found 10 principal components to preserve 95% of the variability in 
# the data. The X_pca_95 array holds the values of all 10 principal components. 
X_pca_95.shape



df_new = pd.DataFrame(X_pca_95, columns=['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10'])
df_new['label'] = cancer.target
df_new