我正在研究一个文本聚类问题。我的目标是创建具有相似上下文、相似谈话的集群。我有大约 4000 万条来自社交媒体的帖子。首先,我使用K-Meansand编写了聚类Tf-Idf。以下代码表明了我在做什么。
以下是主要步骤:
- 做一些预处理
tfidf_matrix在使用标记化和词干提取的同时创建tf-idf在矩阵上运行 K-Means有结果
csvRows = [] nltk.download('stopwords') title = [] synopses = [] filename = "cc.csv" num_clusters = 20 pkl_file = "doc_cluster.pkl" generate_pkl = False if len(sys.argv) == 1: print("Will use "+pkl_file + " to cluster") elif sys.argv[1] == '--generate-pkl': print("Will generate a new pkl file") generate_pkl = True # pre-process data with open(filename, 'r') as csvfile: # creating a csv reader object csvreader = csv.reader(csvfile) # extracting field names through first row fields = csvreader.next() # extracting each data row one by one duplicates = 0 for row in csvreader: # removes the characters specified if line not in synopses: synopses.append(line) title.append(row[0]) else: duplicates += 1 stopwords = nltk.corpus.stopwords.words('english') stemmer = SnowballStemmer("english") def tokenize_and_stem(text): # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token tokens = [word for sent in nltk.sent_tokenize( text) for word in nltk.word_tokenize(sent)] filtered_tokens = [] # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) stems = [stemmer.stem(t) for t in filtered_tokens] return stems def tokenize_only(text): # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] filtered_tokens = [] # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) return filtered_tokens totalvocab_stemmed = [] totalvocab_tokenized = [] for i in synopses: # for each item in 'synopses', tokenize/stem allwords_stemmed = tokenize_and_stem(i) # extend the 'totalvocab_stemmed' list totalvocab_stemmed.extend(allwords_stemmed) allwords_tokenized = tokenize_only(i) totalvocab_tokenized.extend(allwords_tokenized) vocab_frame = pd.DataFrame( {'words': totalvocab_tokenized}, index=totalvocab_stemmed) print 'there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame' # define vectorizer parameters tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.0, stop_words='english', use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1, 3)) tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) terms = tfidf_vectorizer.get_feature_names() # dist = 1 - cosine_similarity(tfidf_matrix) km = KMeans(n_clusters=10, max_iter=1000, verbose=1).fit(tfidf_matrix) clusters = km.labels_.tolist() # uncomment the below to save your model # since I've already run my model I am loading from the pickle if(generate_pkl): joblib.dump(km, pkl_file) print("Generated pkl file " + pkl_file) km = joblib.load(pkl_file) clusters = km.labels_.tolist() films = {'title': title, 'synopsis': synopses, 'cluster': clusters, } total_count = len(films['synopsis']) csvRows = [] for idx in range(total_count): csvRows.append({ 'title': films['title'][idx], 'cluster': films['cluster'][idx] }) print('Creating cluster.csv') with open('cluster.csv', 'w') as output: writer = csv.DictWriter(output, csvRows[0].keys()) writer.writeheader() writer.writerows(csvRows) print("\ncreated cluster.csv")
结果不是很令人满意。他们非常平均。可以做些什么来改进我的聚类算法?我仍然想使用K-Means,但还有什么方法可以代替Tf-Idf?
另外,如果你们认为有更好的替代方案K-Means,请提出建议,如果您能指出我已经做过类似事情的来源/示例,它会更有帮助。
我将始终在接近 4000 万的卷上运行集群。