我想在熊猫数据框中做子组

数据挖掘 Python 熊猫 数据框 python-3.x
2022-02-25 22:04:27

我有一个如下的数据框,我想先按trip_number 分组,然后按模式分组。

trip_number         mode    Km          time
    1               0       10      2021-10-01 01:44:50+00:00   
    1               0       10      2021-10-01 01:44:51+00:00
    1               0       10      2021-10-01 01:44:52+00:00
    1               1       10.2    2021-10-01 01:44:53+00:00
    1               1       10.3    2021-10-01 01:44:54+00:00
    1               1       10.5    2021-10-01 01:44:55+00:00
    1               1       11      2021-10-01 01:44:56+00:00
    1               1       11.2    2021-10-01 01:44:57+00:00
    1               0       11.2    2021-10-01 01:44:58+00:00
    1               0       11.2    2021-10-01 01:44:59+00:00
    1               0       11.2    2021-10-01 01:45:00+00:00
    1               0       11.2    2021-10-01 01:45:01+00:00
    2               0       11.2    2021-10-01 01:45:02+00:00
    2               0       11.2    2021-10-01 01:45:03+00:00
    2               0       11.2    2021-10-01 01:45:04+00:00   
    2               0       11.2    2021-10-01 01:45:05+00:00
    2               1       11.2    2021-10-01 01:45:06+00:00
    2               1       11.5    2021-10-01 01:45:07+00:00
    2               1       11.6    2021-10-01 01:45:08+00:00
    2               0       11.6    2021-10-01 01:45:09+00:00
    2               0       11.6    2021-10-01 01:45:10+00:00
    2               0       11.6    2021-10-01 01:45:11+00:00
    2               0       11.6    2021-10-01 01:45:12+00:00
    2               0       11.6    2021-10-01 01:45:13+00:00
    2               0       11.6    2021-10-01 01:45:14+00:00 

有人可以帮助我或指导我如何实现这一目标。我想要的输出数据框是这样的:

trip_number     mode    startkm     endkm           starttime                   endtime                     subtrip
    1           0       10          10      2021-10-01 01:44:50+00:00       2021-10-01 01:44:52+00:00           1
    1           1       10.2        11.2    2021-10-01 01:44:53+00:00       2021-10-01 01:44:57+00:00           2
    1           0       11.2        11.2    2021-10-01 01:44:58+00:00       2021-10-01 01:45:01+00:00           3
    2           0       11.2        11.2    2021-10-01 01:45:02+00:00       2021-10-01 01:45:05+00:00           1
    2           1       11.2        11.6    2021-10-01 01:45:06+00:00       2021-10-01 01:45:08+00:00           2
    2           0       11.6        11.6    2021-10-01 01:45:09+00:00       2021-10-01 01:45:14+00:00           3
1个回答

首先,您必须创建一个指示子行程的列,这可以通过将模式列的值与它之前的行中的值进行比较来完成。然后,您可以简单地对 、 和 列进行分组trip_numbersubtrip聚合mode以获得所需列的最小值和最大值。

(
    df
    # compare value for mode column with previous row
    .assign(subtrip = lambda x: (x["mode"] != x.groupby("trip_number")["mode"].shift(1))) 
    # use cumsum() to get the subtrip number within each group
    .assign(subtrip = lambda x: x.groupby("trip_number")["subtrip"].transform(pd.Series.cumsum))
    .groupby(["trip_number", "subtrip", "mode"])
    .agg(
        startkm = ("Km", "min"),
        endkm = ("Km", "max"),
        starttime = ("time", "min"),
        endtime = ("time", "max"),
    )
    .reset_index()
)
行程号码 次行程 模式 开始公里 终点公里 开始时间 时间结束
1 1 0 10 10 2021-10-01 01:44:50+00:00 2021-10-01 01:44:52+00:00
1 2 1 10.2 11.2 2021-10-01 01:44:53+00:00 2021-10-01 01:44:57+00:00
1 3 0 11.2 11.2 2021-10-01 01:44:58+00:00 2021-10-01 01:45:01+00:00
2 1 0 11.2 11.2 2021-10-01 01:45:02+00:00 2021-10-01 01:45:05+00:00
2 2 1 11.2 11.6 2021-10-01 01:45:06+00:00 2021-10-01 01:45:08+00:00
2 3 0 11.6 11.6 2021-10-01 01:45:09+00:00 2021-10-01 01:45:14+00:00