基于比较两个数据框 [Python],使用 Pandas 生成 3D“矩阵”

计算科学 Python
2021-12-14 04:27:42

大家,早安。我正在使用 Python 和 Pandas。

我有两个 DataFrame,类型如下:

df_C = pd.DataFrame(data=[[-3,-1,-1], [5,3,3], [3,3,1], [-1,-1,-3], [-3,-1,-1], [2,3,1], [1,1,1]], columns=['C1','C2','C3'])

   C1  C2  C3
0  -3  -1  -1
1   5   3   3
2   3   3   1
3  -1  -1  -3
4  -3  -1  -1
5   2   3   1
6   1   1   1


df_F = pd.DataFrame(data=[[-1,1,-1,-1,-1],[1,1,1,1,1],[1,1,1,-1,1],[1,-1,-1,-1,1],[-1,0,0,-1,-1],[1,1,1,-1,0],[1,1,-1,1,-1]], columns=['F1','F2','F3','F4','F5'])

   F1  F2  F3  F4  F5
0  -1   1  -1  -1  -1
1   1   1   1   1   1
2   1   1   1  -1   1
3   1  -1  -1  -1   1
4  -1   0   0  -1  -1
5   1   1   1  -1   0
6   1   1  -1   1  -1

我希望能够“交叉”这两个 DataFrame,以生成 3D 或一个,如下所示:

矩阵 3D

生成的新数据必须将df_F 的值与 df_C值进行比较,并考虑以下因素:

  • 如果两个值都是正数,则生成 1
  • 如果两个值都是负数,则生成 1
  • 如果一个值为正,另一个为负,则生成 0
  • 如果任何值为零,则生成 None (NaN)

真表

数据df_Cdf_F的比较

df_C vs df_F = 3D
  +       +     1
  +       -     0
  +       0     None
  -       +     0
  -       -     1
  -       0     None
  0       +     None
  0       -     None
  0       0     None

您是编程专家,请您指导我,当我生成这个矩阵时,我会比较这些值。我想和熊猫一起做。我已经使用循环 (for) 和条件 (if) 完成了它,但它在视觉上令人不快,我认为使用 Pandas 会更高效和优雅。

谢谢你。

1个回答

Pandas 目前不提供 3D 数据结构,有时我确实喜欢这种结构,但该选项(截至今天)已过时并已被删除。但是,可以使用具有三个键列(或索引级别)的长格式(也称为 EAV)来表示此类数据。

牢记这一点;来自 DataScience 小组的Jan Šimbera建议使用以下代码:

(
    df_C
    # Transform to long format (two columns: former column names under `variable`
    # and corresponding values under `value`) plus the original index.
    .melt(ignore_index=False)
    # Join with the other dataframe, similarly transformed. join() implicitly joins
    # on indexes, so this will generate all combinations of the `variable` column values.
    .join(df_F.melt(ignore_index=False), lsuffix='_C', rsuffix='_F')
    # Make the index a regular column.
    .rename_axis('index')
    .reset_index()
    # Your rules can be expressed by multiplying the two value columns and examining the sign.
    .assign(combined=lambda df: df.value_C * df.value_F)
    .assign(output=lambda df:
        # Uses the Pandas nullable boolean type (three values: True, False, NA).
        pd.Series(pd.NA, index=df.index, dtype='boolean')
        # If combined is positive, both values were non-zero with the same sign.
        .mask(df.combined > 0, True)
        # If combined is negative, both values were non-zero with opposite signs.
        .mask(df.combined < 0, False)
        # If combined is zero, either of the values was zero, and the NA is retained.
    )
    # Remove intermediary values. The first three columns can also be transformed
    # to a MultiIndex.
    [['index', 'variable_C', 'variable_F', 'output']]
)

完整的代码是:

import pandas as pd

df_C = pd.DataFrame(data=[[-3,-1,-1], [5,3,3], [3,3,1], [-1,-1,-3], [-3,-1,-1], [2,3,1], [1,1,1]], columns=['C1','C2','C3'])

df_F = pd.DataFrame(data=[[-1,1,-1,-1,-1],[1,1,1,1,1],[1,1,1,-1,1],[1,-1,-1,-1,1],[-1,0,0,-1,-1],[1,1,1,-1,0],[1,1,-1,1,-1]], columns=['F1','F2','F3','F4','F5'])

eav = df_C.melt(ignore_index=False)
    .join(df_F.melt(ignore_index=False), lsuffix='_C', rsuffix='_F')
    .rename_axis('index')
    .reset_index()
    .assign(combined=lambda df: df.value_C * df.value_F)
    .assign( output=lambda df:
        pd.Series(pd.NA, index=df.index, dtype='boolean')
        .mask(df.combined > 0, True)
        .mask(df.combined < 0, False) )
    [['index', 'variable_C', 'variable_F', 'output']]

结果,我们得到了一个 DataFrame,其结构如下:

>>> eav

     index variable_C variable_F  output
0        0         C1         F1    True
1        0         C1         F2   False
2        0         C1         F3    True
3        0         C1         F4    True
4        0         C1         F5    True
..     ...        ...        ...     ...
100      6         C3         F1    True
101      6         C3         F2    True
102      6         C3         F3   False
103      6         C3         F4    True
104      6         C3         F5   False

另一种可能的解决方案是使用 Numpy ...在这种情况下,我们有两种可能的解决方案,一种是“长”,另一种是短。

第一个。解决方案 [Numpy]

这个解决方案感谢计算机科学元组的Cassandra Sinclair,她建议:

# The first step is to observe that the relationship can be achieved by multiplication of signs. With numpy.sign(x) we get 0 if x is zero, 1 if positive and -1 if negative, since you check for sign equality, multiplication by the same sign value will always be 1, multiplication by 0 always yields 0 and multiplication by opposite signs yields -1.

import numpy as np
import pandas as pd

df_C = pd.DataFrame(data=[[-3,-1,-1], [5,3,3], [3,3,1], [-1,-1,-3], [-3,-1,-1], [2,3,1], [1,1,1]], columns=['C1','C2','C3'])

df_F = pd.DataFrame(data=[[-1,1,-1,-1,-1],[1,1,1,1,1],[1,1,1,-1,1],[1,-1,-1,-1,1],[-1,0,0,-1,-1],[1,1,1,-1,0],[1,1,-1,1,-1]], columns=['F1','F2','F3','F4','F5'])


Cs = np.sign(df_C.values)
Fs = np.sign(df_F.values)

# The next step is to make the correct kind of broadcast. Using A[:, None], we introduce a new dimension after the first:

assert Cs[:, None].shape == (7, 1, 3)

# So we will expand F with an additional dimension in the middle, so that we can do element wise multiplication of every value in the column of F with one value in the column of C. We also need to expand C, so that the last axis has just a single value.

F2 = Fs[:, None]
C2 = Cs[:,:, None]

#Finally, we multiply and cache the intermediate values so that we can use np.where to replace 0 with None and -1 with 0. However, you should keep -1,0,1 as it uses less memory, avoids multiple copies and is easier to work with.

S = F2*C2
assert S.shape == (7,3,5)
S = np.where(S==0, None, S)
S = np.where(S==-1, 0, S)

结果,我们得到一个“numpy.ndarray”,具有以下结构:

>>> S
array([[[1, 0, 1, 1, 1],
        [1, 0, 1, 1, 1],
        [1, 0, 1, 1, 1]],

       [[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]],

       [[1, 1, 1, 0, 1],
        [1, 1, 1, 0, 1],
        [1, 1, 1, 0, 1]],

       [[0, 1, 1, 1, 0],
        [0, 1, 1, 1, 0],
        [0, 1, 1, 1, 0]],

       [[1, None, None, 1, 1],
        [1, None, None, 1, 1],
        [1, None, None, 1, 1]],

       [[1, 1, 1, 0, None],
        [1, 1, 1, 0, None],
        [1, 1, 1, 0, None]],

       [[1, 1, 0, 1, 0],
        [1, 1, 0, 1, 0],
        [1, 1, 0, 1, 0]]], dtype=object)

最后...

第二。解决方案 [Numpy]

来自 stackoverflow 团队的Shubham Sharma并没有提出一种非常优雅的方法来解决它……他告诉我们:

Numpy 广播和 np.select

广播并将 df_C 中的值与 df_F 中的值相乘,使得结果乘积矩阵的形状为 (3, 7, 5),然后测试乘积矩阵中的值为正、负的条件或零并分配相应的值 1、0 和 NaN,条件为 True

import numpy as np
import pandas as pd

df_C = pd.DataFrame(data=[[-3,-1,-1], [5,3,3], [3,3,1], [-1,-1,-3], [-3,-1,-1], [2,3,1], [1,1,1]], columns=['C1','C2','C3'])
    
df_F = pd.DataFrame(data=[[-1,1,-1,-1,-1],[1,1,1,1,1],[1,1,1,-1,1],[1,-1,-1,-1,1],[-1,0,0,-1,-1],[1,1,1,-1,0],[1,1,-1,1,-1]], columns=['F1','F2','F3','F4','F5'])

a = df_C.values.T[:, :, None] * df_F.values
a = np.select([a > 0, a < 0], [1, 0], np.nan)

这向我们抛出了一个“numpy.ndarray”,具有以下结构:

>>> a
array([[[ 1.,  0.,  1.,  1.,  1.],
    [ 1.,  1.,  1.,  1.,  1.],
    [ 1.,  1.,  1.,  0.,  1.],
    [ 0.,  1.,  1.,  1.,  0.],
    [ 1., nan, nan,  1.,  1.],
    [ 1.,  1.,  1.,  0., nan],
    [ 1.,  1.,  0.,  1.,  0.]],

   [[ 1.,  0.,  1.,  1.,  1.],
    [ 1.,  1.,  1.,  1.,  1.],
    [ 1.,  1.,  1.,  0.,  1.],
    [ 0.,  1.,  1.,  1.,  0.],
    [ 1., nan, nan,  1.,  1.],
    [ 1.,  1.,  1.,  0., nan],
    [ 1.,  1.,  0.,  1.,  0.]],

   [[ 1.,  0.,  1.,  1.,  1.],
    [ 1.,  1.,  1.,  1.,  1.],
    [ 1.,  1.,  1.,  0.,  1.],
    [ 0.,  1.,  1.,  1.,  0.],
    [ 1., nan, nan,  1.,  1.],
    [ 1.,  1.,  1.,  0., nan],
    [ 1.,  1.,  0.,  1.,  0.]]])

对所有人,非常感谢您的帮助!你们是出色的程序员,您为我们提供的解决方案非常出色。完全感谢!