我正在尝试比较一个名称列表(复制到一个干净的文件和一个凌乱的文件中)。然后我将这些文件相互比较。我的问题是它只返回每个结果的前 1 个结果,它本身就是(每个文件中的相同记录)。我试图捕捉的是第二个结果,这将是最接近的匹配,而不是它本身。
names = pd.read_csv('C:/Temp/messynames.txt', sep='\t')
org_names = names['VariationName'].unique()
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(org_names)
clean_org_names = pd.read_csv('C:/Temp/cleannames.txt', sep='\t')
org_name_clean = clean_org_names['StandardName'].unique()
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf = vectorizer.fit_transform(org_name_clean)
nbrs = NearestNeighbors(n_neighbors=3, n_jobs=-1).fit(tfidf)
unique_org = set(names['VariationName'].values)
def getNearestN(query):
queryTFIDF_ = vectorizer.transform(query)
distances, indices = nbrs.kneighbors(queryTFIDF_)
return distances, indices
distances, indices = getNearestN(unique_org)
unique_org = list(unique_org) #need to convert back to a list
matches = []
for i,j in enumerate(indices):
temp = [round(distances[i][0],2), clean_org_names.values[j][0][0],unique_org[i]]
matches.append(temp)
matches = pd.DataFrame(matches, columns=['Match confidence (lower is better)','Matched name','Original name'])
matches.to_csv('C:/Temp/matchednames.txt', sep='\t', encoding='utf-8', index=False, quoting=3)
对于具有以下四个名称的文件:
NOKIA
NOKIAA
NOKIA LMD
NOKIA LTD
结果如下所示:
Match confidence Matched name Original name
0 0.0 NOKIA LMD NOKIA LMD
1 0.0 NOKIAA NOKIAA
2 0.0 NOKIA NOKIA
3 0.0 NOKIA LTD NOKIA LTD
我正在尝试更多类似的东西:
Match confidence Matched name Original name
0 0.1 NOKIA LTD NOKIA LMD
1 0.1 NOKIA NOKIAA