KeyError:根据另一个数据框的值从数据框中选择文本

数据挖掘 Python 熊猫 nlp 数据框
2022-03-07 16:21:42

我有以下两个数据框badgescomments. 我从数据框创建了一个“黄金用户”列表,badgesClass=1.

这里Name的意思是“徽章的名称”和徽章Class的等级(1=金,2=银,3=铜)。

我已经完成了文本预处理comments['Text'],现在想从comments['Text'].

我尝试了给定的代码,但出现错误:

"KeyError: "None of [Index(['1532', '290', '1946', '1459', '6094', '766', '10446', '3106', '1',\n       '1587',\n       ...\n       '35760', '45979', '113061', '35306', '104330', '40739', '4181', '58888',\n       '2833', '58158'],\n      dtype='object', length=1708)] are in the [index]". Please provide me a way to fix this.

数据框 1(徽章)

   Id | UserId |  Name          |        Date              |Class | TagBased
   2  | 23     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   3  | 22     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   4  | 21     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   5  | 20     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False
   6  | 19     | Autobiographer | 2016-01-12T18:44:49.267  |   3  | False

数据框 2(评论)

   Id|                    Text                             |    UserId  
    6|  [2006, course, allen, knutsons, 2001, course, ...  |    3   
    8|  [also, theo, johnsonfreyd, note, mark, haimans...  |    1

代码

for index,rows in comments.iterrows():
  gold_comments = rows[comments.Text.loc[gold_users]]
  Counter(gold_comments)
1个回答

您可以考虑这个简单的示例,并以此解决您的问题。我有关于动物和水果的报价数据集。我需要找出每个类别中出现频率最高的单词。Count Vectorizer 在这里很有用

考虑数据:

在此处输入图像描述

代码片段:

from sklearn.feature_extraction.text import CountVectorizer

def return_word_count_segment_wise(data, type):
    
    tfidf_vec = CountVectorizer(max_features=5)
    model = tfidf_vec.fit(data[data['Type'] == type].description)
    model_transform = tfidf_vec.transform(data[data['Type'] == type].description)
    
    feature_list = model.get_feature_names();    
    count_list = model_transform.toarray().sum(axis=0)    
    return dict(zip(feature_list,count_list))


return_word_count_segment_wise(data, 'Animal') 

输出: {'cats': 3, 'is': 2, 'love': 4, 'my': 4, 'than': 3}

return_word_count_segment_wise(data, 'Fruits')

输出: {'fruit': 8, 'of': 5, 'that': 2, 'the': 3, 'we': 3}


回答评论中提出的问题:

尝试合并两个数据框,然后调用函数,同时使用类(1/2/3)过滤掉客户群

merged_df = pd.merge(badges, comments, on = 'UserId')

return_word_count_segment_wise(merge_df, 1) # Get top 10 words for Gold class 
return_word_count_segment_wise(merge_df, 2) # Get top 10 words for Silver class
return_word_count_segment_wise(merge_df, 3) # Get top 10 words for Bronze class

并且以防万一您无法合并,您可以使用此数据框过滤掉另一个数据框

to_check = comments[comments['userId'].isin(Badges[Badges['class'] == 1].userId)]
return_word_count_segment_wise(to_check, 3)