数据挖掘 - 如何在波斯文本上创建可搜索树？ - 吾爱随笔录

如何在波斯文本上创建可搜索树？

数据挖掘 Python 文本挖掘

2022-02-20 13:23:59

我想清除停用词中的大量文本数据。我已经有下面链接中提供的停用词数据。在我看来，如果我有一个预先构建的停用词树，我可以节省很多时间。我想在这个预先构建的树中搜索文本的每个单词，如果该单词在树中，我将其从文本中删除，否则我将其保留。

O(n * l) 到 O(n*log(l))。

这是我的停用词

如果您有比预建树搜索更好的建议，我将不胜感激与我分享。

1个回答

最后，我用轮胎树找到了这个答案，但我想知道你是否有更好的选择：

读取数据：

#readindg stopword data
stopwords = pd.read_csv('STOPWORDS',header=None)

轮胎树：

#creating tire tree
class TrieNode: 

    # Trie node class 
    def __init__(self): 
        self.children = [None]*15000

        # isEndOfWord is True if node represent the end of the word 
        self.isEndOfWord = False

class Trie: 

    # Trie data structure class 
    def __init__(self): 
        self.root = self.getNode() 

    def getNode(self): 

        # Returns new trie node (initialized to NULLs) 
        return TrieNode() 

    def _charToIndex(self,ch): 

        # private helper function 
        # Converts key current character into index 
        # use only 'a' through 'z' and lower case 

        return ord(ch)-ord('!') 


    def insert(self,key): 

        # If not present, inserts key into trie 
        # If the key is prefix of trie node, 
        # just marks leaf node 
        pCrawl = self.root 
        length = len(key) 
        for level in range(length): 
            index = self._charToIndex(key[level]) 

            # if current character is not present 
            if not pCrawl.children[index]: 
                pCrawl.children[index] = self.getNode() 
            pCrawl = pCrawl.children[index] 

        # mark last node as leaf 
        pCrawl.isEndOfWord = True

    def search(self, key): 

        # Search key in the trie 
        # Returns true if key presents 
        # in trie, else false 
        pCrawl = self.root 
        length = len(key) 
        for level in range(length): 
            index = self._charToIndex(key[level]) 
            if not pCrawl.children[index]: 
                return False
            pCrawl = pCrawl.children[index] 

        return pCrawl != None and pCrawl.isEndOfWord

使用示例：

# Input keys (use only 'a' through 'z' and lower case) 
keys = list(stopwords.loc[:,0])

output = ["Not present in trie", 
        "Present in trie"] 

# Trie object 
t = Trie() 

# Construct trie 
for key in keys: 
    t.insert(key) 


print("{} ---- {}".format("از",output[t.search("از")]))

输出：

از ---- Present in trie

其它你可能感兴趣的问题

上一篇学生答案评估下一篇对期望看到新文本的模型使用 tfidf 矩阵是否有意义？