Kmode 解决您的问题
Kmeans 算法最适合对大型数据集进行聚类,但是它将其使用限制为数值
另一方面,Kmodes 将 kmeans 范式扩展到分类域,并且还能够对混合数据进行聚类,如本文所述,A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
k-modes 用于聚类分类变量。它根据数据点之间匹配类别的数量定义集群。(这与更知名的 k-means 算法形成对比,后者基于欧几里德距离对数值数据进行聚类。)k-prototypes 算法结合了 k-modes 和 k-means,能够对混合的数值/分类数据进行聚类。
kmodes 算法对 k-means 算法进行了三个主要修改,
IE,
- 使用不同的差异度量,
- 用 k 模式替换 k 均值,并且
- 使用基于频率的方法来更新模式。
用法:
import numpy as np
from kmodes.kmodes import KModes
# random categorical data
data = np.random.choice(20, (100, 10))
km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)
clusters = km.fit_predict(data)
# Print the cluster centroids
print(km.cluster_centroids_)
参考
- 示例 1:直接应用于分类
x = ["Dog", "Blue", "Female", "Sad"]
y = ["Cat", "Yellow", "Male", "Happy"]
z = ["Sheep", "Yellow", "Male", "Happy"]
a = ["Sheep", "Yellow", "Female", "Happy"]
df2 = pd.DataFrame([x,y,z,a], columns= ["Pet", "Sky", "Gender", "Feeling"])
km_2 = KModes(n_clusters=2, init="Huang")
km_2.fit_predict(df2)
km_2.cluster_centroids_
- 数值示例
x = [0,1,0]
y = [0,1,1]
z = [1,0,1]
a = [1,0,1]
b = [1,0,0]
df = pd.DataFrame([x,y,z, a, b], columns= ["Pet", "Sky", "Gender"])
km = KModes(n_clusters=2, init='Huang')
result = km.fit_predict(df)
km.cluster_centroids_
Out[14]:
array([[1, 0, 1],
[0, 1, 0]])
In [15]:
km.labels_
- 带有分类和数值数据的示例 3
iris_df = pd.read_csv("../input_data/iris.csv")
iris_df.head()
from kmodes.kprototypes import KPrototypes
kP = KPrototypes(n_clusters=3, init='Huang', n_init=1, verbose=True)
kP.fit_predict(iris_df, categorical=[5])
kP.cluster_centroids_
Out[28]:
[array([[125.5 , 6.588, 2.974, 5.552, 2.026],
[ 25.5 , 5.006, 3.428, 1.462, 0.246],
[ 75.5 , 5.936, 2.77 , 4.26 , 1.326]]), array([['virginica'],
['setosa'],
['versicolor']], dtype='<U10')]
iris_df["cluster_id"] = kP.labels_
# testing to confirm
iris_df[iris_df.Species == 'versicolor']