背景
试图确定用于 MNIST 的 PCA 的主要组件数量 (k),目标是 95%。
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
# Split data into training and test
X, y = mnist["data"], mnist["target"]
X_train, y_train = X[:60000], y[:60000]
COVERAGE=0.95
如果我遵循Coursera 机器学习 - 主成分分析算法,它是 67。
from sklearn.preprocessing import StandardScaler
X_centered = StandardScaler().fit_transform(X_train - X_train.mean(axis=0))
covariance_matrx = X_centered.T.dot(X_centered)
U, s, Vt= sp.linalg.svd(covariance_matrx)
calculated_coverages = ((s ** 2) / (len(s) -1)).cumsum()
calculated_coverages = calculated_coverages / calculated_coverages[-1]
k = np.argmax(np.array(calculated_coverages) >= COVERAGE)
print("k-th component to cover {0} is {1}".format(calculated_coverages[k], k))
覆盖 0.9507022719172283 的第 k 个分量是 66
但是,如果我使用 scikit learn 中的 explain_variance_ratio_,则为 154。
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X_train)
contributions = pca.explained_variance_ratio_
coverages = pca.explained_variance_ratio_.cumsum()
k = np.argmax(coverages >= COVERAGE)
print("k-th primary compoent for 95% coverage is {}".format(k + 1))
95% 覆盖率的第 k 个主要成分是 154
当我查看scikit-learn/sklearn/decomposition/_pca.py时,看起来逻辑是相同的。
U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)
components_ = V
# Get variance explained by singular values
explained_variance_ = (S ** 2) / (n_samples - 1)
total_var = explained_variance_.sum()
explained_variance_ratio_ = explained_variance_ / total_var
singular_values_ = S.copy() # Store the singular values.
问题
请帮助理解它们为什么不同。
