运行我的数据的无监督图,我注意到一个双曲线(“回旋镖”)形状:
dim=2
vectorizer = TfidfVectorizer(min_df=5, max_df = 0.4, stop_words = 'english')
train_tf_idf = vectorizer.fit_transform(bunch_train.data)
svd = TruncatedSVD(n_components=dim,random_state=42)
svd_train = svd.fit_transform(train_tf_idf)
svd_train = Normalizer().fit_transform(svd_train)
labels = {
0:'alt.atheism',
1:'comp.graphics',
2:'sci.med'}
y = np.vectorize(labels.get)(bunch_train.target)
with plt.style.context('seaborn-whitegrid'):
plt.figure(figsize=(12, 8))
for lab, col in zip(('alt.atheism', 'comp.graphics', 'sci.med'),
('blue', 'red', 'green')):
plt.scatter(svd_train[y==lab, 0],
svd_train[y==lab, 1],
label=lab,
c=col)
plt.title('2D SVD on TF-IDF - 3-NewsGroups',size=28)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend(loc='upper left',prop={'size': 20})
plt.tight_layout()
plt.show()
我怀疑它与 Normalizer 有关 - 删除以下行时:
svd_train = Normalizer().fit_transform(svd_train)
数据图如下:

