Pandas 目前不提供 3D 数据结构,有时我确实喜欢这种结构,但该选项(截至今天)已过时并已被删除。但是,可以使用具有三个键列(或索引级别)的长格式(也称为 EAV)来表示此类数据。
牢记这一点;来自 DataScience 小组的Jan Šimbera建议使用以下代码:
(
df_C
# Transform to long format (two columns: former column names under `variable`
# and corresponding values under `value`) plus the original index.
.melt(ignore_index=False)
# Join with the other dataframe, similarly transformed. join() implicitly joins
# on indexes, so this will generate all combinations of the `variable` column values.
.join(df_F.melt(ignore_index=False), lsuffix='_C', rsuffix='_F')
# Make the index a regular column.
.rename_axis('index')
.reset_index()
# Your rules can be expressed by multiplying the two value columns and examining the sign.
.assign(combined=lambda df: df.value_C * df.value_F)
.assign(output=lambda df:
# Uses the Pandas nullable boolean type (three values: True, False, NA).
pd.Series(pd.NA, index=df.index, dtype='boolean')
# If combined is positive, both values were non-zero with the same sign.
.mask(df.combined > 0, True)
# If combined is negative, both values were non-zero with opposite signs.
.mask(df.combined < 0, False)
# If combined is zero, either of the values was zero, and the NA is retained.
)
# Remove intermediary values. The first three columns can also be transformed
# to a MultiIndex.
[['index', 'variable_C', 'variable_F', 'output']]
)
完整的代码是:
import pandas as pd
df_C = pd.DataFrame(data=[[-3,-1,-1], [5,3,3], [3,3,1], [-1,-1,-3], [-3,-1,-1], [2,3,1], [1,1,1]], columns=['C1','C2','C3'])
df_F = pd.DataFrame(data=[[-1,1,-1,-1,-1],[1,1,1,1,1],[1,1,1,-1,1],[1,-1,-1,-1,1],[-1,0,0,-1,-1],[1,1,1,-1,0],[1,1,-1,1,-1]], columns=['F1','F2','F3','F4','F5'])
eav = df_C.melt(ignore_index=False)
.join(df_F.melt(ignore_index=False), lsuffix='_C', rsuffix='_F')
.rename_axis('index')
.reset_index()
.assign(combined=lambda df: df.value_C * df.value_F)
.assign( output=lambda df:
pd.Series(pd.NA, index=df.index, dtype='boolean')
.mask(df.combined > 0, True)
.mask(df.combined < 0, False) )
[['index', 'variable_C', 'variable_F', 'output']]
结果,我们得到了一个 DataFrame,其结构如下:
>>> eav
index variable_C variable_F output
0 0 C1 F1 True
1 0 C1 F2 False
2 0 C1 F3 True
3 0 C1 F4 True
4 0 C1 F5 True
.. ... ... ... ...
100 6 C3 F1 True
101 6 C3 F2 True
102 6 C3 F3 False
103 6 C3 F4 True
104 6 C3 F5 False
另一种可能的解决方案是使用 Numpy ...在这种情况下,我们有两种可能的解决方案,一种是“长”,另一种是短。
第一个。解决方案 [Numpy]
这个解决方案感谢计算机科学元组的Cassandra Sinclair,她建议:
# The first step is to observe that the relationship can be achieved by multiplication of signs. With numpy.sign(x) we get 0 if x is zero, 1 if positive and -1 if negative, since you check for sign equality, multiplication by the same sign value will always be 1, multiplication by 0 always yields 0 and multiplication by opposite signs yields -1.
import numpy as np
import pandas as pd
df_C = pd.DataFrame(data=[[-3,-1,-1], [5,3,3], [3,3,1], [-1,-1,-3], [-3,-1,-1], [2,3,1], [1,1,1]], columns=['C1','C2','C3'])
df_F = pd.DataFrame(data=[[-1,1,-1,-1,-1],[1,1,1,1,1],[1,1,1,-1,1],[1,-1,-1,-1,1],[-1,0,0,-1,-1],[1,1,1,-1,0],[1,1,-1,1,-1]], columns=['F1','F2','F3','F4','F5'])
Cs = np.sign(df_C.values)
Fs = np.sign(df_F.values)
# The next step is to make the correct kind of broadcast. Using A[:, None], we introduce a new dimension after the first:
assert Cs[:, None].shape == (7, 1, 3)
# So we will expand F with an additional dimension in the middle, so that we can do element wise multiplication of every value in the column of F with one value in the column of C. We also need to expand C, so that the last axis has just a single value.
F2 = Fs[:, None]
C2 = Cs[:,:, None]
#Finally, we multiply and cache the intermediate values so that we can use np.where to replace 0 with None and -1 with 0. However, you should keep -1,0,1 as it uses less memory, avoids multiple copies and is easier to work with.
S = F2*C2
assert S.shape == (7,3,5)
S = np.where(S==0, None, S)
S = np.where(S==-1, 0, S)
结果,我们得到一个“numpy.ndarray”,具有以下结构:
>>> S
array([[[1, 0, 1, 1, 1],
[1, 0, 1, 1, 1],
[1, 0, 1, 1, 1]],
[[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]],
[[1, 1, 1, 0, 1],
[1, 1, 1, 0, 1],
[1, 1, 1, 0, 1]],
[[0, 1, 1, 1, 0],
[0, 1, 1, 1, 0],
[0, 1, 1, 1, 0]],
[[1, None, None, 1, 1],
[1, None, None, 1, 1],
[1, None, None, 1, 1]],
[[1, 1, 1, 0, None],
[1, 1, 1, 0, None],
[1, 1, 1, 0, None]],
[[1, 1, 0, 1, 0],
[1, 1, 0, 1, 0],
[1, 1, 0, 1, 0]]], dtype=object)
最后...
第二。解决方案 [Numpy]
来自 stackoverflow 团队的Shubham Sharma并没有提出一种非常优雅的方法来解决它……他告诉我们:
Numpy 广播和 np.select
广播并将 df_C 中的值与 df_F 中的值相乘,使得结果乘积矩阵的形状为 (3, 7, 5),然后测试乘积矩阵中的值为正、负的条件或零并分配相应的值 1、0 和 NaN,条件为 True
import numpy as np
import pandas as pd
df_C = pd.DataFrame(data=[[-3,-1,-1], [5,3,3], [3,3,1], [-1,-1,-3], [-3,-1,-1], [2,3,1], [1,1,1]], columns=['C1','C2','C3'])
df_F = pd.DataFrame(data=[[-1,1,-1,-1,-1],[1,1,1,1,1],[1,1,1,-1,1],[1,-1,-1,-1,1],[-1,0,0,-1,-1],[1,1,1,-1,0],[1,1,-1,1,-1]], columns=['F1','F2','F3','F4','F5'])
a = df_C.values.T[:, :, None] * df_F.values
a = np.select([a > 0, a < 0], [1, 0], np.nan)
这向我们抛出了一个“numpy.ndarray”,具有以下结构:
>>> a
array([[[ 1., 0., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 0., 1.],
[ 0., 1., 1., 1., 0.],
[ 1., nan, nan, 1., 1.],
[ 1., 1., 1., 0., nan],
[ 1., 1., 0., 1., 0.]],
[[ 1., 0., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 0., 1.],
[ 0., 1., 1., 1., 0.],
[ 1., nan, nan, 1., 1.],
[ 1., 1., 1., 0., nan],
[ 1., 1., 0., 1., 0.]],
[[ 1., 0., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 0., 1.],
[ 0., 1., 1., 1., 0.],
[ 1., nan, nan, 1., 1.],
[ 1., 1., 1., 0., nan],
[ 1., 1., 0., 1., 0.]]])
对所有人,非常感谢您的帮助!你们是出色的程序员,您为我们提供的解决方案非常出色。完全感谢!