使用 NLP 匹配 2 个关键字列表

数据挖掘 分类 nlp 文本
2022-03-03 01:07:23

我有两个列表,我想确定列表中哪些元素是常见的(含义或上下文相同或相似)。我们应该使用哪种 NLP 算法。

list-1= [US, Apple, Trump, Biden, Mango, French, German]

list-2= [State, iphone, ipad, ipod, president, person, Fruit, Language, Country]
1个回答

最简单的实现将使用以下步骤:

Step 1 : Iterate through both the list 
Step 2 : Calculate the Cossine Similarity between each word in list1 with list2
Step 3 : Decide the threshold on cossine similarity. Higher means stricter

代码如下:

list_1 = [ US, Apple, Trump, Biden, Mango, French, German]
list_2 = [State, iphone, ipad, ipod, president, person, Fruit, Language, Country]


# Download the package and model : 
from gensim.models import Word2Vec

similarity_dict = {}
for word_list1 in list_1:
    for word_list2 in list_2:
         model = Word2Vec.load(path/to/your/model)
         cosine_similarity = model.wv.similarity(word_list1, word_list2)
         

优点:

  1. 易于实施

  2. 使用 Word2vec 非常可靠,因为 word2vec 确保上下文

  3. 易于理解

缺点:

  1. 代码复杂度为 O(n*n)

  2. 它不适用于 word2vec 中词汇量不足的单词