除了 LDA,您还可以使用带有K-Means 的潜在语义分析。它不是神经网络,而是“经典”聚类,但效果很好。
sklearn 中的示例(取自此处):
dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)
labels = dataset.target
true_k = np.unique(labels).shape[0]
vectorizer = TfidfTransformer()
X = vectorizer.fit_transform(dataset.data)
svd = TruncatedSVD(true_k)
lsa = make_pipeline(svd, Normalizer(copy=False))
X = lsa.fit_transform(X)
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100)
km.fit(X)
现在集群分配标签可用于km.labels_
例如,这些是从 20 个带有 LSA 的新闻组中提取的主题:
Cluster 0:  space  shuttle  alaska  edu  nasa  moon  launch  orbit  henry  sci
Cluster 1:  edu  game  team  games  year  ca  university  players  hockey  baseball
Cluster 2:  sale  00  edu  10  offer  new  distribution  subject  lines  shipping
Cluster 3:  israel  israeli  jews  arab  jewish  arabs  edu  jake  peace  israelis
Cluster 4:  cmu  andrew  org  com  stratus  edu  mellon  carnegie  pittsburgh  pa
Cluster 5:  god  jesus  christian  bible  church  christ  christians  people  edu  believe
Cluster 6:  drive  scsi  card  edu  mac  disk  ide  bus  pc  apple
Cluster 7:  com  ca  hp  subject  edu  lines  organization  writes  article  like
Cluster 8:  car  cars  com  edu  engine  ford  new  dealer  just  oil
Cluster 9:  sun  monitor  com  video  edu  vga  east  card  monitors  microsystems
Cluster 10:  nasa  gov  jpl  larc  gsfc  jsc  center  fnal  article  writes
Cluster 11:  windows  dos  file  edu  ms  files  program  os  com  use
Cluster 12:  netcom  com  edu  cramer  fbi  sandvik  408  writes  article  people
Cluster 13:  armenian  turkish  armenians  armenia  serdar  argic  turks  turkey  genocide  soviet
Cluster 14:  uiuc  cso  edu  illinois  urbana  uxa  university  writes  news  cobb
Cluster 15:  edu  cs  university  posting  host  nntp  state  subject  organization  lines
Cluster 16:  uk  ac  window  mit  server  lines  subject  university  com  edu
Cluster 17:  caltech  edu  keith  gatech  technology  institute  prism  morality  sgi  livesey
Cluster 18:  key  clipper  chip  encryption  com  keys  escrow  government  algorithm  des
Cluster 19:  people  edu  gun  com  government  don  like  think  just  access
您还可以应用Non-Negative Matrix Factorization,这可以解释为聚类。您需要做的就是在转换后的空间中获取每个文档的最大组件 - 并将其用作集群分配。
在 sklearn 中:
nmf = NMF(n_components=k, random_state=1).fit_transform(X)
labels = nmf.argmax(axis=1)