TF-IDF 和主题建模不适合,因为它们不考虑类。一种方法是训练一个基本分类器并提取每个类的重要特征。
步骤:
- 为文本语料库创建一个 TF-IDF 矩阵。
- 使用 TF-IDF 矩阵作为特征矩阵和类作为目标来训练基本分类器。(一个不错的准确性就足够了。)
feature_importances从经过训练的分类器中获取。
- 排序以获得最重要的特征及其对应的类。
import numpy as np
from collections import defaultdict
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
# Loading sample data
categories = ['comp.sys.mac.hardware', 'rec.autos', 'sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'),categories=categories)
# 1. Fit corpus to tfidf vectorizer
tfidf = TfidfVectorizer(min_df=15, max_df=0.95, max_features=5_000)
tfidf_matrix = tfidf.fit_transform(newsgroups.data)
# 2. Train classifier
clf = RandomForestClassifier()
clf.fit(tfidf_matrix, newsgroups.target)
# 3. Get feature importances
feature_importances = clf.feature_importances_
# 4. Sort and get important features
word_indices = np.argsort(feature_importances)[::-1] # using argsort we get indices of important features
feature_names = tfidf.get_feature_names() # Lookup to get words from index
top_n = 50 # Top N features to be considered
top_words_per_class = defaultdict(list)
for word_idx in word_indices[:top_n]:
word = feature_names[word_idx]
word_class = newsgroups.target_names[clf.predict(tfidf.transform([word]))[0]]
top_words_per_class[word_class].append(word)
top_words_per_class会是这样的:
{
"rec.autos": ["car", "cars", "engine", "ford", "like", "dealer", "oil", "toyota"],
"sci.space": ["space", "nasa", "orbit", "launch", "earth", "moon", "shuttle", "thanks", "program", "project", "spacecraft"],
"comp.sys.mac.hardware": ["mac", "apple", "drive", "scsi", "centris", "video", "quadra", "monitor", "se", "card", "powerbook", "use", "problem", "simms", "software", "modem"],
"rec.sport.baseball": ["baseball", "game", "team", "games", "season", "players", "year", "league", "runs", "hit", "player", "braves", "teams", "pitching"]}
}