数据挖掘 - 在 R 中找到众数和频率 - 吾爱随笔录

在 R 中找到众数和频率

数据挖掘 r 统计数据数据数据清理 dplyr

2022-02-16 12:52:57

我试图在 R 中提出一个函数，它给出列的模式值以及该值出现的次数（或频率）。我希望它排除缺失（或空白）的值，并通过显示两个值来处理关系。当没有重复值时，我希望它返回第一个出现的值及其频率 1。

"Name         Color
 Drew         Blue
 Drew         Green
 Drew         Red
 Bob          Green
 Bob          Green
 Bob          Green
 Bob          Blue
 Jim          Red 
 Jim          Red
 Jim          blue
 Jim          blue

mode of Drew = Blue, 1
mode of Bob = Green, 3
mode of jim = Red, Blue, 2

这是我到目前为止的功能代码，它不包括 NA，但在存在平局且不显示频率时不显示两个值。任何帮助表示赞赏！

mode <- function(x) { if ( anyNA(x) ) x = x[!is.na(x)] ux <- unique(x) ux[which.max(tabulate(match(x, ux)))] }

1个回答

您不需要自定义函数来执行此操作。让dplyr处理它。假设您的数据位于名为的数据框中df，它可能如下所示：

df %>%                                       # Set up the pipe
subset(complete.cases(df)) %>%               # Removes rows with NA values
group_by(Name) %>%                           # Groups by the Name column
count(Color) %>%                             # Counts each Color by Name, creates a new column n
mutate(max = max(n)) %>%                     # Creates a new column for the max(n) by Name
subset(n == max(n)) %>%                      # Keeps only those rows where n equals max(n)
mutate(Keep == case_when(                    # Creates a dummy logical column named 'Keep'
   n > 1 ~ TRUE,                             # That is TRUEfor n > 1 to keep ties
   n == 1 & Color == head(Color, 1) ~ TRUE,  # That is TRUE for the first row of n = 1
   TRUE ~ FALSE)) %>%                        # That is FALSE for all other cases
subset(Keep) %>%                             # Keeps only those rows where Keep is TRUE
select(Name, Mode = Color, n)                # Keeps only the Name, Color, and n columns and
                                             # renames Color as Mode

这是输出

 # A tibble: 3 x 3
 # Groups:   Name [3]
   Name  Mode   Count
   <fct> <fct>  <int>
 1 Bob   Green      3
 2 Drew  Blue       1
 3 Jim   Blue       2
 4 Jim   Red        2

如果您想要一个函数，请将其包装在函数定义中：

my_mode_func <- function(df){
df %>% 
   subset(complete.cases(df)) %>%
   group_by(Name) %>%
   count(Color) %>%
   mutate(max = max(n)) %>%
   subset(n == max) %>%
   mutate(Keep = case_when(
      n > 1 ~ TRUE,
      n == 1 & Color == head(Color,1) ~ TRUE,
      TRUE ~ FALSE)) %>%
   subset(Keep) %>%
   select(Name, Mode = Color, Count = n)
}

其它你可能感兴趣的问题

上一篇月份和工作日是调查的标尺吗？如何测试有效性？下一篇GridSearchCV 使用随机森林注册管道