数据挖掘 - 使用 NLP 匹配 2 个关键字列表 - 吾爱随笔录

使用 NLP 匹配 2 个关键字列表

数据挖掘分类 nlp 文本

2022-03-03 01:07:23

我有两个列表，我想确定列表中哪些元素是常见的（含义或上下文相同或相似）。我们应该使用哪种 NLP 算法。

list-1= [US, Apple, Trump, Biden, Mango, French, German]

list-2= [State, iphone, ipad, ipod, president, person, Fruit, Language, Country]

1个回答

最简单的实现将使用以下步骤：

Step 1 : Iterate through both the list 
Step 2 : Calculate the Cossine Similarity between each word in list1 with list2
Step 3 : Decide the threshold on cossine similarity. Higher means stricter

代码如下：

list_1 = [ US, Apple, Trump, Biden, Mango, French, German]
list_2 = [State, iphone, ipad, ipod, president, person, Fruit, Language, Country]


# Download the package and model : 
from gensim.models import Word2Vec

similarity_dict = {}
for word_list1 in list_1:
    for word_list2 in list_2:
         model = Word2Vec.load(path/to/your/model)
         cosine_similarity = model.wv.similarity(word_list1, word_list2)

优点：

易于实施
使用 Word2vec 非常可靠，因为 word2vec 确保上下文
易于理解

缺点：

代码复杂度为 O(n*n)
它不适用于 word2vec 中词汇量不足的单词

其它你可能感兴趣的问题

上一篇无法理解论文中的 MSE 损失函数下一篇不平衡和有序分类的评估指标