与另一组文档相比,为一组文档找到特征词的方法?

数据挖掘 nlp 文本挖掘
2022-03-15 22:43:04

我正在研究异常检测问题,在异常检测结束时,我将拥有一组文档,其中包含每个标记为异常的对象的标题。

同时,我还有另一组文档,它们是每个标记为异常的对象的标题的文档/文本。

anomalous_titles =[[Product:A - sub_group:X1 - pod: P1 - function: M1], [Product:B...],..]
not_anomalous_titles =[[Product:R - type:TX - producer: XX], [Product:B...],..]

我想在这里做的是了解异常文档之间是否共享任何在非异常文档组中不常见的单词或模式。

在这种情况下应用什么方法比较好?我知道 TF-IDF 和主题建模,但我不知道这对这个用例是否有意义?

感谢任何输入!

1个回答

TF-IDF 和主题建模不适合,因为它们不考虑类。一种方法是训练一个基本分类器并提取每个类的重要特征。

步骤:

  1. 为文本语料库创建一个 TF-IDF 矩阵。
  2. 使用 TF-IDF 矩阵作为特征矩阵和类作为目标来训练基本分类器。(一个不错的准确性就足够了。)
  3. feature_importances从经过训练的分类器中获取。
  4. 排序以获得最重要的特征及其对应的类。
import numpy as np
from collections import defaultdict
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

# Loading sample data
categories = ['comp.sys.mac.hardware', 'rec.autos', 'sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'),categories=categories)

# 1. Fit corpus to tfidf vectorizer
tfidf = TfidfVectorizer(min_df=15, max_df=0.95, max_features=5_000)
tfidf_matrix = tfidf.fit_transform(newsgroups.data)

# 2. Train classifier
clf = RandomForestClassifier()
clf.fit(tfidf_matrix, newsgroups.target)

# 3. Get feature importances
feature_importances = clf.feature_importances_

# 4. Sort and get important features
word_indices = np.argsort(feature_importances)[::-1] # using argsort we get indices of important features
feature_names = tfidf.get_feature_names() # Lookup to get words from index

top_n = 50 # Top N features to be considered
top_words_per_class = defaultdict(list)
for word_idx in word_indices[:top_n]:
    word = feature_names[word_idx]
    word_class = newsgroups.target_names[clf.predict(tfidf.transform([word]))[0]]
    top_words_per_class[word_class].append(word)

top_words_per_class会是这样的:

{
  "rec.autos": ["car", "cars", "engine", "ford", "like", "dealer", "oil", "toyota"],
  "sci.space": ["space", "nasa", "orbit", "launch", "earth", "moon", "shuttle", "thanks", "program", "project", "spacecraft"], 
  "comp.sys.mac.hardware": ["mac", "apple", "drive", "scsi", "centris", "video", "quadra", "monitor", "se", "card", "powerbook", "use", "problem", "simms", "software", "modem"],
  "rec.sport.baseball": ["baseball", "game", "team", "games", "season", "players", "year", "league", "runs", "hit", "player", "braves", "teams", "pitching"]}
}