数据挖掘 - 除了 Tf-Idf，我还可以使用 K-Means 进行文本聚类吗？ - 吾爱随笔录

除了 Tf-Idf，我还可以使用 K-Means 进行文本聚类吗？

数据挖掘 Python 聚类 k-均值无监督学习

2021-10-16 03:47:21

我正在研究一个文本聚类问题。我的目标是创建具有相似上下文、相似谈话的集群。我有大约 4000 万条来自社交媒体的帖子。首先，我使用K-Meansand编写了聚类Tf-Idf。以下代码表明了我在做什么。

以下是主要步骤：

做一些预处理
tfidf_matrix在使用标记化和词干提取的同时创建
tf-idf在矩阵上运行 K-Means

有结果

csvRows = []
nltk.download('stopwords')

title = []
synopses = []
filename = "cc.csv"
num_clusters = 20
pkl_file = "doc_cluster.pkl"
generate_pkl = False

if len(sys.argv) == 1:
    print("Will use "+pkl_file + " to cluster")
elif sys.argv[1] == '--generate-pkl':
    print("Will generate a new pkl file")
    generate_pkl = True


# pre-process data
with open(filename, 'r') as csvfile:
    # creating a csv reader object
    csvreader = csv.reader(csvfile)

    # extracting field names through first row
    fields = csvreader.next()

    # extracting each data row one by one
    duplicates = 0
    for row in csvreader:
    # removes the characters specified
    if line not in synopses:
        synopses.append(line)
        title.append(row[0])
    else:
        duplicates += 1



stopwords = nltk.corpus.stopwords.words('english')
stemmer = SnowballStemmer("english")


def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(
    text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
    if re.search('[a-zA-Z]', token):
        filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text)
          for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
    if re.search('[a-zA-Z]', token):
        filtered_tokens.append(token)
    return filtered_tokens


totalvocab_stemmed = []
totalvocab_tokenized = []

for i in synopses:
    # for each item in 'synopses', tokenize/stem
    allwords_stemmed = tokenize_and_stem(i)
    # extend the 'totalvocab_stemmed' list
    totalvocab_stemmed.extend(allwords_stemmed)

    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame(
    {'words': totalvocab_tokenized}, index=totalvocab_stemmed)

print 'there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame'


# define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                           min_df=0.0, stop_words='english',
                           use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1, 3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses)
terms = tfidf_vectorizer.get_feature_names()
# dist = 1 - cosine_similarity(tfidf_matrix)


km = KMeans(n_clusters=10, max_iter=1000,
        verbose=1).fit(tfidf_matrix)


clusters = km.labels_.tolist()

# uncomment the below to save your model
# since I've already run my model I am loading from the pickle

if(generate_pkl):
    joblib.dump(km,  pkl_file)
    print("Generated pkl file " + pkl_file)

km = joblib.load(pkl_file)

clusters = km.labels_.tolist()


films = {'title': title,  'synopsis': synopses, 'cluster': clusters, }

total_count = len(films['synopsis'])

csvRows = []

for idx in range(total_count):
    csvRows.append({
    'title': films['title'][idx],
    'cluster': films['cluster'][idx]
    })

print('Creating cluster.csv')

with open('cluster.csv', 'w') as output:
    writer = csv.DictWriter(output, csvRows[0].keys())
    writer.writeheader()
    writer.writerows(csvRows)
    print("\ncreated cluster.csv")

结果不是很令人满意。他们非常平均。可以做些什么来改进我的聚类算法？我仍然想使用K-Means，但还有什么方法可以代替Tf-Idf？

另外，如果你们认为有更好的替代方案K-Means，请提出建议，如果您能指出我已经做过类似事情的来源/示例，它会更有帮助。

我将始终在接近 4000 万的卷上运行集群。

3个回答

通过使用 GloVe 等算法代替 Tf-Idf，您可能会看到改进。与 Tf-Idf 一样，GloVe 将一组单词表示为一个向量。与作为词袋方法的 Tf-Idf 不同，GloVe 和类似技术保留了推文中的单词顺序。了解感兴趣的单词之前或之后出现的单词是分配含义的宝贵信息。本文贯穿了不同的技术，并对每一种技术进行了很好的描述。此外，Kaggle 上的这个脚本展示了如何使用预训练的词向量来表示推文。

对于您的集群，我建议查看基于密度的集群。K-means 是一种不错的通用算法，但它是一种分区方法，并且依赖于可能不正确的假设，例如集群的大小大致相等。几乎可以肯定情况并非如此。这个博客对文本聚类进行了很好的讨论。如果你使用基于密度的并使用 Python，我强烈推荐Leland McInnes 的HDBSCAN。

祝你好运！

您可以尝试使用 n_grams。

n-gram

n-grams是一种基于语言的数据的特征提取技术。它对字符串进行分段，以便可以找到词根，忽略动词结尾、复数等......

分割工作如下：

字符串：Hello World

2-gram：“He”、“el”、“ll”、“lo”、“o”、“W”、“Wo”、“or”、“rl”、“ld” 3-gram：“Hel” 、“ell”、“llo”、“lo”、“o W”、“Wo”、“Wor”、“orl”、“rld” 4-gram：“Hell”、“ello”、“llo”、“ lo W”、“o Wo”、“Wor”、“Worl”、“orld”

因此，在您的示例中，如果我们使用 4-gram，单词 Hello 的截断看起来是相同的。这种相似性会被你的特征捕捉到。

K-means 需要一个矩阵，为了提高性能，你应该确定你是如何创建它的。

TF-IDF 是一个矩阵，Word2Vec 可以做些什么来改进我的聚类算法？超参数应该像二元组一样调整。

我仍然想使用 K-Means，但是还有什么方法可以用来代替 TF-IDF？

Word2Vec/CBOW

其它你可能感兴趣的问题

上一篇有什么问题：条件的长度> 1并且只使用第一个元素？下一篇Keras 异常：ValueError：检查输入时出错：预期 conv2d_1_input 的形状为 (150, 150, 3) 但得到的数组形状为 (256, 256, 3)