从食物列表中检索食物组

数据挖掘 Python 熊猫
2022-02-27 21:04:23

我有一个食品数据框,如下所示:我必须创建一个 food_group 列表,给出它所属的食品组,例如,所有类型的酸奶都应该放在一个名为 yogurt 的组中。

我使用了一个片段来获取逗号分隔名称的第一段,但我没有得到像将所有酸奶放在一个组中那样的结果

food_group_0 = [i.split(',') for i in data['name']]

food_group = [item[0] for item in food_group_0]


#To count how many of each entry there are in the list you can use the Counter class in the collections module:
from collections import Counter
c = Counter(food_group) 
print(c)

数据框

0                                          4-Grain Flakes
1                             4-Grain Flakes, Gluten Free
2                  4-Grain Flakes, Riihikosken Vehnämylly
3                                                  Almond
4                          Almond Drink, Sweetened, Alrpo
5                        Almond Drink, Unsweetened, Alrpo
6                                         Amaranth Flakes
7                                                 Anchovy
8                               Apple, Average, With Skin
9                           Apple, Domestic, Without Skin
10                             Apple, Domestic, With Skin
11                                           Apple, Dried
12                          Apple, Imported, Without Skin
13                             Apple, Imported, With Skin
14                                            Apple Chips
15                 Apple Crisp Delight, Apple, Oat Flakes
16                                              Apple Jam
17                    Apple Juice, Unsweetened, Vitamin C
18                 Apple Kissel, Apple Soup, Dried Apples
19                 Apple Kissel, Apple Soup, Fresh Apples
20      Apple Pie, Basic Sweet Dough, Gluten-Free, Con...
21             Apple Pie, Basic Sweet Dough, Low-Fat Milk
22      Apple Pie, Basic Sweet Dough, Naturally Gluten...
23               Apple Pie, Basic Sweet Dough, Whole Milk
24                            Apple Pie, Shortbread Crust
25      Apple Pie, Shortbread Crust, Gluten-Free, Cont...
26      Apple Pie, Shortbread Crust, Naturally Gluten-...
27             Apple Pie, Shortbread Crust With Sour Milk
28                          Apple Pie, Soft, Low-Fat Milk
29         Apple Pie With Quark Filling, Shortbread Crust
                              ...                        
4068    Yoghurt, Plain, A+, Fat 2.5%, 1 Ug Vitamin D, ...
4069    Yoghurt, Plain, A+, Fat 2.5%, Lactose-Free, 1 ...
4070    Yoghurt, Plain, A+, Fat 4%, 1 Ug Vitamin D, La...
4071    Yoghurt, Plain, A+, Fatfree, 1 Ug Vitamin D, L...
4072    Yoghurt, Plain, A+ Greek, 2 % Fat, Lactose-Fre...
4073             Yoghurt, Plain, Ab, 0.2% Fat, Probiotics
4074             Yoghurt, Plain, Ab, 2.5% Fat, Probiotics
4075                    Yoghurt, Plain, Activia, 3.4% Fat
4076    Yoghurt, Plain, Arla Protein, 1% Fat, Lactose-...
4077                    Yoghurt, Plain, Bulgarian, 9% Fat
4078                             Yoghurt, Plain, Fat-Free
4079    Yoghurt, Plain, Fat-Free, Lactose-Free, 1 Ug V...
4080    Yoghurt, Plain, Fat-Free, Low-Lactose, 0.5 Ug ...
4081          Yoghurt, Plain, Greek, 7% Fat, Lactose-Free
4082                      Yoghurt, Plain, Organic, 3% Fat
4083    Yoghurt, Plain, Pirkka Reducol, 2.5% Fat, Low-...
4084                      Yoghurt, Turkish/Greek, 10% Fat
4085        Yoghurt, Turkish/Greek, 10% Fat, Lactose-Free
4086                                        Yoghurt Sauce
4087                           Yoghurt With Jam, Fat-Free
4088       Yoghurt With Muesli, A+, Fat 3.5%, Low-Lactose
4089    Yoghurt With Quark, Flavoured, Arla, 1.4% Fat,...
4090    Yoghurt With Quark, Flavoured, Luonto+, 1.2% F...
4091    Yoghurt With Quark, Flavoured, Valio, 1.7% Fat...
4092                                   Zander, Pike-Perch
4093                        Zucchini, Boiled Without Salt
4094                              Zucchini, Summer Squash
4095                     Zucchini Filled With Minced Meat
4096                   Zucchini Filled With Soya And Rice
4097                      Zucchini Filled With Vegetables
1个回答

您实际上可以对列本身进行字符串吐出和索引 - 无需提取列并进行列表推导。

下面我将第一个逗号之前的内容放在一个名为的列中food_group,然后将其放在同一列之后的第一个字段中,并将其放在一个名为sub_cat-egory 的新列中:

df["food_group"] = df.name.str.split(",").str[0]
df["sub_cat"] = df.name.str.split(",").str[1]

以下是一些酸奶数据的示例输出:

    id                                               name      food_group     sub_cat

44  4082                    Yoghurt, Plain, Organic, 3% Fat    Yoghurt        Plain
45  4083  Yoghurt, Plain, Pirkka Reducol, 2.5% Fat, Low-...    Yoghurt        Plain
46  4084                    Yoghurt, Turkish/Greek, 10% Fat    Yoghurt        Turkish/Greek
47  4085      Yoghurt, Turkish/Greek, 10% Fat, Lactose-Free    Yoghurt        Turkish/Greek
48  4086                                      Yoghurt Sauce    Yoghurt Sauce  NaN

请注意,任何空的字段都用 填充NaN当您的name列仅包含一个字段(即没有逗号)时,就会发生这种情况。

编辑

这是我的数据框的顶部,经过上述操作:

In [13]: df.head(10)                                                                                                                                                   
Out[13]: 
   id                                    name       food_group                  sub_cat
0   0                          4-Grain Flakes   4-Grain Flakes                      NaN
1   1             4-Grain Flakes, Gluten Free   4-Grain Flakes              Gluten Free
2   2  4-Grain Flakes, Riihikosken Vehnämylly   4-Grain Flakes   Riihikosken Vehnämylly
3   3                                  Almond           Almond                      NaN
4   4          Almond Drink, Sweetened, Alrpo     Almond Drink                Sweetened
5   5        Almond Drink, Unsweetened, Alrpo     Almond Drink              Unsweetened
6   6                         Amaranth Flakes  Amaranth Flakes                      NaN
7   7                                 Anchovy          Anchovy                      NaN
8   8               Apple, Average, With Skin            Apple                  Average
9   9           Apple, Domestic, Without Skin            Apple                 Domestic

编辑

为了用另一个字符串替换一行,给定所需的字符串在该行中,您可以执行以下操作:

for keyword in keywords:
    df["new_col"] = df.name.apply(lambda x: keyword if keyword in x else x)

哪里keywords可能是这样的列表:

keywords = ["Yogurt", "chicken", "Drink"]

它仍然需要手动定义关键字列表并循环遍历它们。你也可以使这个对单词的大小写不敏感,但是做所有的事情,例如小写:

lower_keywords = ["yogurt", "chicken", "drink"]

for keyword in lower_keywords:
    df["new_col"] = df.name.apply(lambda x: keyword if keyword in x.tolower() else x)

您可以继续从这两个新列创建多索引,但这可能不是必需的 - 这取决于您之后想要对数据执行的操作。