Image clustering K-means algorithm implementation

Image clustering K-means algorithm implementation

0, image clustering

0.1 What is image clustering?

Clustering is a widely used exploratory data analysis technique. Intuitively speaking, clustering is a task of grouping objects, grouping similar objects into one category, and dissimilar objects into different categories.

When the clustering object is an image, it is called image clustering.

For a more detailed introduction, please refer to the reference material [1] at the end of the article.

Image clustering is to classify images according to similarity in a given image set, according to the content of the image, without prior knowledge. This makes the classified images have high intra-class similarity and low inter-class similarity. There is a Chinese saying that "things gather by kind, and people are divided by groups", which is probably what it means.

0.2 Classification of clustering algorithms

As mentioned earlier, clustering is to gather objects with high similarity into one category. How to measure the similarity between objects is a key issue.

According to the scale of clustering, clustering methods can be divided into the following three types:

  • Distance-based clustering algorithm: Use a variety of distances to measure the similarity between data objects
  • Density-based clustering algorithm: According to the appropriate density function to classify
  • Clustering algorithm based on interconnectivity: usually based on graph or hypergraph model , clustering highly connected objects into one category.

The following part mainly introduces the K-meansclustering method.

1. K-means clustering algorithm

K-meansThe algorithm is a distance-based clustering algorithm, also called K or K , and often called Lloyd's ( Lloyd) algorithm. Iteratively divides each point in the data set into the cluster closest to it, and the distance refers to the distance from the data point to the cluster center.

1.1 Basic principles of K-means

K-meansThe idea of the algorithm is very simple. For a given sample set, the samples are divided into Kclusters according to the distance between the samples . Connect the data in the clusters as closely as possible, and make the distance between the clusters as large as possible.

KmeansSteps :

  1. Randomly initialize kcluster center coordinates
  2. Calculate kthe distance from all objects in the data set to the center of a cluster, and divide the data points into the nearest cluster
  3. For each cluster, recalculate the centroid of the cluster, which is the average value of the coordinates of the nodes in the current cluster
  4. Repeat steps 2 and 3 until convergence

The termination conditions are:

  • No more redistribution
  • Maximum number of iterations reached
  • All class centers move less than a certain value


  1. Greedy algorithm: often falls into a local optimal solution

    • KSelection of the number of class centers
    • Initial point selection
  2. Sensitive to noise or outliers

    It is impossible to distinguish which are noise or outliers, and only one category can be judged for each data point. This will cause the sample centroid to shift, resulting in misjudgment or reduced clustering tightness.

  3. The influence of sample point shape on clustering

    K-meansThe algorithm has a good effect on convex data. It can divide the data into spherical clusters based on clustering, but it can't do anything for data points with non-convex shapes, such as circular data. The left side of the figure below shows K-meansthe clustering effect of the method.

1.2 Parameters in K-means

1. K value

The number of cluster centers Kneeds to be given in advance. KThe selection of the value directly affects the final clustering effect.

Selection method:

  1. elbow methodBy plotting Kthe relationship with the loss function, select the Kvalue at the inflection point . That is, try using multiple values, and take the turning point where the clustering index is optimal or improved.
  2. Experience selection. According to manual experience K, select a few first , and randomly initialize the center for many times to select the most suitable empirically.

It is usually selected based on experience, because the inflection point is not obvious in actual operation and the efficiency is too low, so it is not allowed to do this in practice.

elbow methodAlso called the "elbow method", it uses the change trend of the sum of squares of errors (SSE) as Kan indicator of the selected value.

Among them, is the first cluster, is the sample point in the middle, is the centroid, is the clustering error of all samples, and indicates the quality of the clustering effect.

As shown below, when Kthe time ranging from 2 to 7, corresponding to the clustering result, when the K=5effect of the best.

2. Initial cluster center (centroid)

K-meansThe final classification results obtained with different initial points may also be different. In actual use, we do not know which of the data sets to be clustered are of our concern, labeland it is unrealistic to manually specify the centroid in advance.

The general method of selecting the initial centroid is:

  • random selection
  • Kmeans++the way
    • First class center -> random selection
    • Recorded as the distance between the data point and the nearest cluster center
    • Select the next cluster center, the selected probability is proportional to
    • And so on, to the first Kone.

3. Distance measurement

Distance isK-means an index to measure the similarity of sample points in a cluster.

K-meansThe more commonly used distance measurement methods are:

  • Euclidean distance
  • Cosine similarity
  • Manhattan distance

2. Sklearn implemented by K-means

PythonThe clustering implementation method sklearnis provided in the library in K-means, we can call it directly.

For image clustering, we need to extract the features that represent the content of the image, which is a dimensional feature vector. There is an image whose feature vector is expressed as and the dimension is.


from sklearn.cluster import KMeans
import numpy as np

# X 
X = np.array([[1, 2], [1, 4], [1, 0],
           [4, 2], [4, 4], [4, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
# array([0, 0, 0, 1, 1, 1], dtype=int32)
kmeans.predict([[0, 0], [4, 4]])
# array([0, 1], dtype=int32)
# array([[ 1.,  2.],
#        [ 4.,  2.]])

2.1 KMeans class

class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto')


parameter meaning
n_clsuters int, Optional, the default value is 8. The number of cluster centers, that is, the number of clusters
init { k-means++ , 'random'Or an ndarray}, the method of initializing the centroid, the default is 'k-means++'to random randomly select the initial centroid from the training data, if you pass an ndarray, it should behave as such (n_clusters, n_features)and give the initial centroid
n_init int, By default 10, the number of times of running the algorithm is initialized with different centroids, and the final solution is the optimal result selected in the meaning of inertia
max_iter int, By default 300, K-meansthe maximum number of iterations to execute an algorithm
tol floatType, default0.0001
precompute_distances { auto, True, False}Pre-calculate the distance value (faster, but occupies more memory). When only running clustering results for a set of data, there is no need for pre-selection calculation.
verbose intType, default 0, whether to print the intermediate process, 0 is not to print
random_state intType, RandomStateinstance of or None, optional, default None. If it is int, it random_stateis the seed used by the random number generator, if it is an RandomStateinstance, it random_stateis the random number generator, if it is None, the random number generator is np.randoman RandomStateinstance of
n_jobs intType, the amount of computing power used, is achieved by calculating each n_init running in parallel. If it is -1, all CPUs are used. If it is specified 1, parallel code is not used, which is convenient for debugging. If the value is less than -1, use (n_cpus + 1 + n_jobs). For n_jobs = -2, use n_cpus-1.
algorithm Optional 'auto', 'full','elkan'. 'full' is the traditional K-means algorithm,'elkan' is the elkan K-means algorithm, the default value of'auto' will decide how to choose between'full' and'elkan' according to whether the data value is sparse. Generally, choose'elkan' for dense data, otherwise it is'full'.

Main attributes:

Attributes meaning
cluster_centers_ Vector [n_clsuters, n_features], the coordinates of the center of each cluster
Labels_ The classification label of each data, starting from 0
inertia_ floatType, the sum of the distances from each data point to the centroid of its cluster, used to evaluate whether the number of clusters is appropriate

2.2 KMeans class method

1. fit()

Cluster the data set after Kmeans determines the category.


def fit(self, X, y=None)
    random_state = check_random_state(self.random_state)
    X = self._check_fit_data(X)

    self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_ =/
            X, n_clusters=self.n_clusters, init=self.init,
            n_init=self.n_init, max_iter=self.max_iter, verbose=self.verbose,
            tol=self.tol, random_state=random_state, copy_x=self.copy_x,
            n_jobs=self.n_jobs, algorithm=self.algorithm,
    return self

Call the k_meansfunction internally to cluster and return self.

Call k_means()function, returns self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_.

  • self.cluster_centers_: Cluster center, shape is (k, n_features)
  • self.labels_: int, Cluster index value, shape is(n_samples,)
  • self.inertia_: Clustering distortion value (the sum of the squares of all observed distances in the training set)
  • self.n_iter_: Optimal number of iterations corresponding to the results, only when return_n_iterset Trueon return.

2. predict()

According to the clustering results, determine the category

def predict(self, X)
  • X: {array, sparse matrix}, Shape is[n_samples, n_features] Return value:
  • labels: array, Shape is [n_samples,]. Each sample belongs to the category index of the cluster.

3. fit_predict

def fit_predict(self, X, y=None)

return value:

  • labels: array, Shape is [n_samples,]. Each sample belongs to the category index value of the cluster.

Calculate the cluster and predict the cluster index of each sample.

It is equivalent to fit(X)calling the predict(X)function after calling the method .

Note: In this function, the attribute is returned.

4. transform

def transform(self, X)

Convert X into cluster-distance space

return value:

  • X_new: array, Shape is[n_samples, k]

5. fit_transform

def fit_transform(self, X, y=None)

Perform clustering operations and Xtransform them into distance space.

It is equivalent to fit(X)calling the transform(X)function after calling the method , but it is more effective.

Important, the Kmeans method in sklearn cannot specify the distance measurement method, and the Euclidean distance is used by default

K-meansEuclidean distance is used by default, which is the basis of measurement at the beginning of the algorithm design. The reason is that the calculation of the average is involved.

From: Cluster Analysis-What kind of distance metric does sklearn's kmeans use? -IT House-Programmer Software Development Technology Sharing Community

3. scipy implemented by K-means

scipyK-meansAlgorithms are also implemented in the library .

The center index or cluster index is also called "code" , and the mapping table from code to center is called "code book" .

3.1 kmeans function

kmeansClustering using functions requires two steps to achieve

  1. Use kmeansfunction to generate codebookand distortion values
  2. Use the vqfunction to codebookassign each observation data, and get the distance of each observation data to its nearest center point.


import scipy
from scipy.cluster.vq import kmeans, vq, whiten
import numpy as np

# , 20 , 4 :

data=whiten(points) #  
codebook, variance = kmeans(data, 4) 
#  vq vq label 
code, distance = vq(data, codebook)

>>> codebook # (4,4)
array([[-1.227829  , -0.41256122, -0.1342359 , -0.98257834],
       [ 1.01190005, -0.34999089, -0.13180372,  0.06394479],
       [ 0.01156929, -0.39212056,  1.86893218, -0.34921357],
       [ 0.21946277,  1.36809613,  0.87196001,  0.9213216 ]])

>>> code
array([2, 0, 0, 2, 0, 2, 1, 3, 1, 1, 3, 0, 1, 0, 1, 1, 3, 2, 3, 2],
array([1.32927696, 0.99594691, 1.38351644, 1.22323281, 1.12605626,
       2.04444249, 0.55554746, 2.06947197, 1.44928466, 1.09481098,
       1.60957745, 1.07210177, 1.3848659 , 0.6393925 , 0.69392457,
       1.06200234, 1.09091552, 0.87726365, 0.76938663, 1.96214695])

kmeans function definition:

def kmeans(obs, k_or_guess, iter=20, thresh=1e-5, check_finite=True)


  • obs: ndarray, MxNArray, each row represents an observation vector. Feature must be inwhiten processed function
  • k_or_guess: intOr ndarray, the number of generated center points, each center point is assigned one code, which is also code_bookthe row index of the center of mass in the generated matrix, and the initial k center is selected by randomly selecting observation values from the observation matrix. You can also kxNspecify the initial k center point by passing in an array.
  • iter: int, Optional value. The number of times the k-means algorithm is run , and the one with the lowest distortion is returned code book. If k_or_guessthe initial centroid is specified for the parameter array, this parameter will be ignored. This parameter does not represent the number of iterations of the k-means algorithm .
  • thresh: float, Optional value. If the distortion change since the last k-means iteration is less than or equal to the threshold, the k-means algorithm is terminated.
  • check_finite: boolOptional value, Truedefault: . Whether to check that the input matrix contains only finite numbers. Disabling may improve performance, but if the input does contain infinity or NaN, it may cause problems (crash, termination).


  • codebook: ndarray, (k,N)An array of dimensions consisting of k centroids
  • distortion: floatType, the average Euclidean distance (not squared) between the observed value and the generated center point . Please note that K-meansthe standard definition of distortion in the algorithm is the sum of squared distances.

Note: 1. In the kmeans function, the iter parameter is used to specify the number of times to run the K-means algorithm, not the number of iterations. The algorithm termination condition can only be specified by the thresh parameter. 2. The distance metric uses the average Euclidean distance (not squared)

vq function definition:

def vq(obs, code_book, check_finite=True)

Assign code bookeach of them codeto observations. MXNEach observation vector in the array is compared code bookwith the centroid in and is assigned the closest centroidcode .

obsThe features in should have unit variance, which can be achieved by passing them to the whitenfunction. code bookIt can be created using K-meansalgorithms or other encoding algorithms.


  • obs: MxNArray, each row represents an observation value. Features must be whitenprocessed by functions.
  • code_book: ndarray. Usually generated using the k-means algorithm, each row represents a different feature value coderepresented codeby the column .
  • check_finite: bool, Optional value, default is True. Whether to check that the input array contains only limited values. Disabling may improve performance, but if the input does contain infinity or NaN, it may cause problems (crash, termination).

return value:

  • code: ndarray, An array of length M, used to store each observation data code book.
  • dist: ndarray, (M,)The distortion value of each observed data to its latest code.

3.2 kmeans2 function

This function is also used to implement K-meansalgorithms. The algorithm attempts to minimize the Euclidean distance between the observation and the centroid , and includes several initialization methods.

scipy.cluster.vq.kmeans2(data, k, iter=10, thresh=1e-05, minit='random', missing='warn', check_finite=True)


  • data: ndarray, (M,N)The array contains M observation data with N dimensions.
  • k: int or ndarray, The number of clusters. If the minitparameter is matrix, or if one ndarrayis given , it is interpreted as the initial cluster and used instead.
  • iter: int, Optional value, k-meansthe number of iterations for the algorithm to run , note that the meaning is different from the kmeansfunction iterparameters.
  • thresh: float, Optional value, not used
  • minit: str, Optional value, initialization method. Alternatively random, points, ++and matrix.
    • random: Generate k centroids from Gauss, the mean and variance are estimated based on the data
    • points: Randomly select k observations (rows) from the data as the initial center
    • ++: kmeans++Select k observations according to the method
    • matrix: Interpret the k parameter as the (k,M) array of the initial centroid
  • missing: str, Optional value, used to solve the empty clustering method, the available methods are warnsum raise.
    • warn: Give a warning and continue
    • raise: Trigger ClusterErrorand terminate the algorithm
  • check_finite: bool, Optional value, whether to check that the input matrix contains only finite numbers, defaultTrue

return value:

  • centroid:: ndarrayAn array of (k, N), representing k-meansthe center point of the last iteration of the method
  • label: ndarray, The code or index value of each observation. label [i] is the code or index of the centroid that is closest to the i-th observation


import scipy
from scipy.cluster.vq import kmeans2, whiten
import numpy as np

# , 20 , 4 :

data=whiten(points) #  
centroid, label = kmeans2(data, 5)

>>> centroid
array([[ 0.52132816,  0.97577703, -0.30863464, -1.30546523],
       [-0.27344139, -0.81129939, -0.59560322,  0.47788319],
       [ 1.99658961, -0.10701021,  1.09921144,  0.51397034],
       [-0.37598454,  1.72180727, -0.18127439,  0.58114466],
       [ 0.25895367, -0.01881385,  1.25681737,  0.03119893]])
>>> label
array([1, 0, 3, 0, 1, 1, 2, 4, 0, 1, 1, 0, 4, 4, 0, 3, 1, 4, 3, 2],

4. Reference materials

[1] A preliminary study on various clustering algorithms-Zheng Han Andrew.Hann-

[2] Feature extraction method: Kmeans clustering-Jack_Kuo's blog-CSDN blog

[3] sklearn.cluster.KMeans scikit-learn 0.17 documentation

[4] K-means clustering and vector quantization (scipy.cluster.vq) SciPy v1.3.1 Reference Guide