我有两个列表,我想确定列表中哪些元素是常见的(含义或上下文相同或相似)。我们应该使用哪种 NLP 算法。
list-1= [US, Apple, Trump, Biden, Mango, French, German]
list-2= [State, iphone, ipad, ipod, president, person, Fruit, Language, Country]
我有两个列表,我想确定列表中哪些元素是常见的(含义或上下文相同或相似)。我们应该使用哪种 NLP 算法。
list-1= [US, Apple, Trump, Biden, Mango, French, German]
list-2= [State, iphone, ipad, ipod, president, person, Fruit, Language, Country]
最简单的实现将使用以下步骤:
Step 1 : Iterate through both the list
Step 2 : Calculate the Cossine Similarity between each word in list1 with list2
Step 3 : Decide the threshold on cossine similarity. Higher means stricter
代码如下:
list_1 = [ US, Apple, Trump, Biden, Mango, French, German]
list_2 = [State, iphone, ipad, ipod, president, person, Fruit, Language, Country]
# Download the package and model :
from gensim.models import Word2Vec
similarity_dict = {}
for word_list1 in list_1:
for word_list2 in list_2:
model = Word2Vec.load(path/to/your/model)
cosine_similarity = model.wv.similarity(word_list1, word_list2)
优点:
易于实施
使用 Word2vec 非常可靠,因为 word2vec 确保上下文
易于理解
缺点:
代码复杂度为 O(n*n)
它不适用于 word2vec 中词汇量不足的单词