我想清除停用词中的大量文本数据。我已经有下面链接中提供的停用词数据。在我看来,如果我有一个预先构建的停用词树,我可以节省很多时间。我想在这个预先构建的树中搜索文本的每个单词,如果该单词在树中,我将其从文本中删除,否则我将其保留。
O(n * l) 到 O(n*log(l))。
如果您有比预建树搜索更好的建议,我将不胜感激与我分享。
我想清除停用词中的大量文本数据。我已经有下面链接中提供的停用词数据。在我看来,如果我有一个预先构建的停用词树,我可以节省很多时间。我想在这个预先构建的树中搜索文本的每个单词,如果该单词在树中,我将其从文本中删除,否则我将其保留。
O(n * l) 到 O(n*log(l))。
如果您有比预建树搜索更好的建议,我将不胜感激与我分享。
最后,我用轮胎树找到了这个答案,但我想知道你是否有更好的选择:
读取数据:
#readindg stopword data
stopwords = pd.read_csv('STOPWORDS',header=None)
轮胎树:
#creating tire tree
class TrieNode:
# Trie node class
def __init__(self):
self.children = [None]*15000
# isEndOfWord is True if node represent the end of the word
self.isEndOfWord = False
class Trie:
# Trie data structure class
def __init__(self):
self.root = self.getNode()
def getNode(self):
# Returns new trie node (initialized to NULLs)
return TrieNode()
def _charToIndex(self,ch):
# private helper function
# Converts key current character into index
# use only 'a' through 'z' and lower case
return ord(ch)-ord('!')
def insert(self,key):
# If not present, inserts key into trie
# If the key is prefix of trie node,
# just marks leaf node
pCrawl = self.root
length = len(key)
for level in range(length):
index = self._charToIndex(key[level])
# if current character is not present
if not pCrawl.children[index]:
pCrawl.children[index] = self.getNode()
pCrawl = pCrawl.children[index]
# mark last node as leaf
pCrawl.isEndOfWord = True
def search(self, key):
# Search key in the trie
# Returns true if key presents
# in trie, else false
pCrawl = self.root
length = len(key)
for level in range(length):
index = self._charToIndex(key[level])
if not pCrawl.children[index]:
return False
pCrawl = pCrawl.children[index]
return pCrawl != None and pCrawl.isEndOfWord
使用示例:
# Input keys (use only 'a' through 'z' and lower case)
keys = list(stopwords.loc[:,0])
output = ["Not present in trie",
"Present in trie"]
# Trie object
t = Trie()
# Construct trie
for key in keys:
t.insert(key)
print("{} ---- {}".format("از",output[t.search("از")]))
输出:
از ---- Present in trie