我有以下df
x1 x2 x3 x4
1000 5000 0.8 restaurant1
2000 7000 0.75 restaurant1
500 1000 0.5 restaurant2
700 1400 0.6 restaurant2
1000 5000 0.8 restaurant2
100 600 0.9 restaurant3
200 1200 0.9 restaurant3
50 1000 0.9 restaurant3
对 2 个集群应用 Kmeans 算法会发生什么y:
x1 x2 x3 x4 Y
1000 5000 0.8 restaurant1 1
2000 7000 0.75 restaurant1 1
500 1000 0.5 restaurant2 2
700 1400 0.6 restaurant2 2
1000 5000 0.8 restaurant2 1
100 600 0.9 restaurant3 2
200 1200 0.9 restaurant3 2
50 1000 0.9 restaurant3 2
可能的期望输出:
x1 x2 x3 x4 Y
1000 5000 0.8 restaurant1 1
2000 7000 0.75 restaurant1 1
500 1000 0.5 restaurant2 2
700 1400 0.6 restaurant2 2
1000 5000 0.8 restaurant2 2
100 600 0.9 restaurant3 2
200 1200 0.9 restaurant3 2
50 1000 0.9 restaurant3 2
或者
x1 x2 x3 x4 Y
1000 5000 0.8 restaurant1 1
2000 7000 0.75 restaurant1 1
500 1000 0.5 restaurant2 1
700 1400 0.6 restaurant2 1
1000 5000 0.8 restaurant2 1
100 600 0.9 restaurant3 2
200 1200 0.9 restaurant3 2
50 1000 0.9 restaurant3 2
我想设置这个边界:一家餐厅必须属于 1 个且仅属于 1 个集群。
我明白为什么会有这个输出,但我怎么能避免和修复它呢?
下面是我在笔记本中使用的代码:
#Converting float64 to numpy array
x1=df['x1'].to_numpy()
x2=df['x2'].to_numpy()
x3=(df['x5']/df['x2']).to_numpy()
x4=df_joint_raw['x4'].cat.codes.to_numpy()
X=np.stack((x1,x2,x3,x4),axis=1)
#Getting clusters
y_pred=KMeans(n_clusters=2, random_state=0).fit_predict(X)
