将分类数据分解为 3 个以上的类别

数据挖掘 熊猫 分类数据 分类编码
2022-03-01 21:12:41

我有一堆分类的词性数据,我想将它们折叠成更少的类别。np.where() 不行,因为我想在末尾有 6 个类别:名词、动词、形容词、副词、介词和其他。

我发现我可以将 pandas.replace() 与字典结合使用来执行此操作。

所以,我制作了以下字典:

mappings = {"NN" : "noun", "NNS" : "noun", "NNP" : "noun",
            "VB" : "verb", "VBD" : "verb", "VBG" : "verb", "VBN" : "verb", "VBP" : "verb", "VBZ" : "verb",
            "JJ" : "adj", "JJR" : "adj", "JJS" : "adj",
            "RB" : "adv", "RBR" : "adv", "RBS" : "adv",
            "IN" : "prep"}

问题是,数据中存在更多的词性。有没有办法让我将所有其他词性都推到“其他”类别中,还是我必须手动输入所有其他可能的词性?

1个回答

你可以使用numpy select 函数

你需要适应,但它会是这样的:

nouns = ["NN","NNS","NNP"]
verbs = ["VB","VBD","VBG","VBN","VBP","VBZ"]
adjs = ["JJ","JJR","JJS"]
advs = ["RB","RBR","RBS"]
preps = ["IN"]

condlist = [
           df.my_colum.isin(nouns),
           df.my_colum.isin(verbs),
           df.my_colum.isin(adjs),
           df.my_colum.isin(advs),
           df.my_colum.isin(preps),
           ]


choicelist = ["noun","verb","adj","adv","prep"]


df["gruop"] = np.select(condlist= condlist, choicelist= choicelist, default = "other")