# 0, image clustering

## 0.1 What is image clustering?

Clustering is a widely used exploratory data analysis technique. Intuitively speaking, clustering is a task of grouping objects, grouping similar objects into one category, and dissimilar objects into different categories.

When the clustering object is an image, it is called image clustering.

For a more detailed introduction, please refer to the reference material  at the end of the article.

Image clustering is to classify images according to similarity in a given image set, according to the content of the image, without prior knowledge. This makes the classified images have high intra-class similarity and low inter-class similarity. There is a Chinese saying that "things gather by kind, and people are divided by groups", which is probably what it means.

## 0.2 Classification of clustering algorithms

As mentioned earlier, clustering is to gather objects with high similarity into one category. How to measure the similarity between objects is a key issue.

According to the scale of clustering, clustering methods can be divided into the following three types:

• Distance-based clustering algorithm: Use a variety of distances to measure the similarity between data objects
• Density-based clustering algorithm: According to the appropriate density function to classify
• Clustering algorithm based on interconnectivity: usually based on graph or hypergraph model , clustering highly connected objects into one category.

The following part mainly introduces the `K-means`clustering method.

# 1. K-means clustering algorithm

`K-means`The algorithm is a distance-based clustering algorithm, also called `K `or `K `, and often called Lloyd's ( `Lloyd`) algorithm. Iteratively divides each point in the data set into the cluster closest to it, and the distance refers to the distance from the data point to the cluster center.

## 1.1 Basic principles of K-means

`K-means`The idea of the algorithm is very simple. For a given sample set, the samples are divided into `K`clusters according to the distance between the samples . Connect the data in the clusters as closely as possible, and make the distance between the clusters as large as possible.

`Kmeans`Steps :

1. Randomly initialize `k`cluster center coordinates
2. Calculate `k`the distance from all objects in the data set to the center of a cluster, and divide the data points into the nearest cluster
3. For each cluster, recalculate the centroid of the cluster, which is the average value of the coordinates of the nodes in the current cluster
4. Repeat steps 2 and 3 until convergence

The termination conditions are:

• No more redistribution
• Maximum number of iterations reached
• All class centers move less than a certain value

`K-means`problem

1. Greedy algorithm: often falls into a local optimal solution

• `K`Selection of the number of class centers
• Initial point selection
2. Sensitive to noise or outliers

It is impossible to distinguish which are noise or outliers, and only one category can be judged for each data point. This will cause the sample centroid to shift, resulting in misjudgment or reduced clustering tightness.

3. The influence of sample point shape on clustering

`K-means`The algorithm has a good effect on convex data. It can divide the data into spherical clusters based on clustering, but it can't do anything for data points with non-convex shapes, such as circular data. The left side of the figure below shows `K-means`the clustering effect of the method.

## 1.2 Parameters in K-means

### 1. K value

The number of cluster centers `K`needs to be given in advance. `K`The selection of the value directly affects the final clustering effect.

Selection method:

1. `elbow method`By plotting `K`the relationship with the loss function, select the `K`value at the inflection point . That is, try using multiple values, and take the turning point where the clustering index is optimal or improved.
2. Experience selection. According to manual experience `K`, select a few first , and randomly initialize the center for many times to select the most suitable empirically.

It is usually selected based on experience, because the inflection point is not obvious in actual operation and the efficiency is too low, so it is not allowed to do this in practice.

`elbow method`Also called the "elbow method", it uses the change trend of the sum of squares of errors (SSE) as `K`an indicator of the selected value.

Among them, is the first cluster, is the sample point in the middle, is the centroid, is the clustering error of all samples, and indicates the quality of the clustering effect.

As shown below, when `K`the time ranging from 2 to 7, corresponding to the clustering result, when the `K=5`effect of the best.

### 2. Initial cluster center (centroid)

`K-means`The final classification results obtained with different initial points may also be different. In actual use, we do not know which of the data sets to be clustered are of our concern, `label`and it is unrealistic to manually specify the centroid in advance.

The general method of selecting the initial centroid is:

• random selection
• `Kmeans++`the way
• First class center -> random selection
• Recorded as the distance between the data point and the nearest cluster center
• Select the next cluster center, the selected probability is proportional to
• And so on, to the first `K`one.

### 3. Distance measurement

Distance is`K-means` an index to measure the similarity of sample points in a cluster.

`K-means`The more commonly used distance measurement methods are:

• Euclidean distance
• Cosine similarity
• Manhattan distance

# 2. Sklearn implemented by K-means

`Python`The clustering implementation method `sklearn`is provided in the library in `K-means`, we can call it directly.

For image clustering, we need to extract the features that represent the content of the image, which is a dimensional feature vector. There is an image whose feature vector is expressed as and the dimension is.

Example:

``````from sklearn.cluster import KMeans
import numpy as np

# X
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
# array([0, 0, 0, 1, 1, 1], dtype=int32)
kmeans.predict([[0, 0], [4, 4]])
# array([0, 1], dtype=int32)
kmeans.cluster_centers_
# array([[ 1.,  2.],
#        [ 4.,  2.]])
``````

## 2.1 KMeans class

``````class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto')
``````

parameter:

parameter meaning
`n_clsuters` `int`, Optional, the default value is `8`. The number of cluster centers, that is, the number of clusters
`init` { ` k-means++ `, `'random'`Or an ndarray}, the method of initializing the centroid, the default is `'k-means++'`to ` random `randomly select the initial centroid from the training data, if you pass an ndarray, it should behave as such `(n_clusters, n_features)`and give the initial centroid
`n_init` `int`, By default `10`, the number of times of running the algorithm is initialized with different centroids, and the final solution is the optimal result selected in the meaning of inertia
`max_iter` `int`, By default `300`, `K-means`the maximum number of iterations to execute an algorithm
`tol` `float`Type, default`0.0001`
`precompute_distances` { `auto, True, False`}Pre-calculate the distance value (faster, but occupies more memory). When only running clustering results for a set of data, there is no need for pre-selection calculation.
`verbose` `int`Type, default `0`, whether to print the intermediate process, 0 is not to print
`random_state` `int`Type, `RandomState`instance of or `None`, optional, default `None`. If it is `int`, it `random_state`is the seed used by the random number generator, if it is an `RandomState`instance, it `random_state`is the random number generator, if it is `None`, the random number generator is `np.random`an `RandomState`instance of
`n_jobs` `int`Type, the amount of computing power used, is achieved by calculating each n_init running in parallel. If it is `-1`, all CPUs are used. If it is specified `1`, parallel code is not used, which is convenient for debugging. If the value is less than -1, use `(n_cpus + 1 + n_jobs)`. For `n_jobs = -2`, use `n_cpus-1`.
`algorithm` Optional ` 'auto', 'full','elkan'`. 'full' is the traditional K-means algorithm,'elkan' is the elkan K-means algorithm, the default value of'auto' will decide how to choose between'full' and'elkan' according to whether the data value is sparse. Generally, choose'elkan' for dense data, otherwise it is'full'.

Main attributes:

Attributes meaning
`cluster_centers_` Vector [n_clsuters, n_features], the coordinates of the center of each cluster
`Labels_` The classification label of each data, starting from 0
`inertia_` `float`Type, the sum of the distances from each data point to the centroid of its cluster, used to evaluate whether the number of clusters is appropriate

## 2.2 KMeans class method

1. fit()

Cluster the data set after Kmeans determines the category.

definition:

``````def fit(self, X, y=None)
random_state = check_random_state(self.random_state)
X = self._check_fit_data(X)

self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_ =/
k_means(
X, n_clusters=self.n_clusters, init=self.init,
n_init=self.n_init, max_iter=self.max_iter, verbose=self.verbose,
precompute_distances=self.precompute_distances,
tol=self.tol, random_state=random_state, copy_x=self.copy_x,
n_jobs=self.n_jobs, algorithm=self.algorithm,
return_n_iter=True)
return self
``````

Call the `k_means`function internally to cluster and return `self`.

Call `k_means()`function, returns `self.cluster_centers_`, `self.labels_`, `self.inertia_`, `self.n_iter_`.

• `self.cluster_centers_`: Cluster center, shape is `(k, n_features)`
• `self.labels_`: `int`, Cluster index value, shape is`(n_samples,)`
• `self.inertia_`: Clustering distortion value (the sum of the squares of all observed distances in the training set)
• `self.n_iter_`: Optimal number of iterations corresponding to the results, only when `return_n_iter`set `True`on return.

2. predict()

According to the clustering results, determine the category

``````def predict(self, X)
``````
• `X`: `{array, sparse matrix}`, Shape is`[n_samples, n_features]` Return value:
• `labels`: `array`, Shape is `[n_samples,]`. Each sample belongs to the category index of the cluster.

3. fit_predict

``````def fit_predict(self, X, y=None)
return self.fit(X).labels_
``````

return value:

• `labels`: `array`, Shape is `[n_samples,]`. Each sample belongs to the category index value of the cluster.

Calculate the cluster and predict the cluster index of each sample.

It is equivalent to `fit(X)`calling the `predict(X)`function after calling the method .

Note: In this function, the self.fit(X).labels_ attribute is returned.

4. transform

``````def transform(self, X)
``````

Convert X into cluster-distance space

return value:

• `X_new`: `array`, Shape is`[n_samples, k]`

5. fit_transform

``````def fit_transform(self, X, y=None)
``````

Perform clustering operations and `X`transform them into distance space.

It is equivalent to `fit(X)`calling the `transform(X)`function after calling the method , but it is more effective.

Important, the Kmeans method in sklearn cannot specify the distance measurement method, and the Euclidean distance is used by default

`K-means`Euclidean distance is used by default, which is the basis of measurement at the beginning of the algorithm design. The reason is that the calculation of the average is involved.

From: Cluster Analysis-What kind of distance metric does sklearn's kmeans use? -IT House-Programmer Software Development Technology Sharing Community

# 3. scipy implemented by K-means

`scipy``K-means`Algorithms are also implemented in the library .

The center index or cluster index is also called "code" , and the mapping table from code to center is called "code book" .

## 3.1 kmeans function

`kmeans`Clustering using functions requires two steps to achieve

1. Use `kmeans`function to generate `codebook`and distortion values
2. Use the `vq`function to `codebook`assign each observation data, and get the distance of each observation data to its nearest center point.

Example:

``````import scipy
from scipy.cluster.vq import kmeans, vq, whiten
import numpy as np

# , 20 , 4 :
points=scipy.randn(20,4)

data=whiten(points) #
#
codebook, variance = kmeans(data, 4)
#  vq vq label
code, distance = vq(data, codebook)

#
>>> codebook # (4,4)
array([[-1.227829  , -0.41256122, -0.1342359 , -0.98257834],
[ 1.01190005, -0.34999089, -0.13180372,  0.06394479],
[ 0.01156929, -0.39212056,  1.86893218, -0.34921357],
[ 0.21946277,  1.36809613,  0.87196001,  0.9213216 ]])
>>>variance
1.221658211170926

>>> code
array([2, 0, 0, 2, 0, 2, 1, 3, 1, 1, 3, 0, 1, 0, 1, 1, 3, 2, 3, 2],
dtype=int32)
>>>distance
array([1.32927696, 0.99594691, 1.38351644, 1.22323281, 1.12605626,
2.04444249, 0.55554746, 2.06947197, 1.44928466, 1.09481098,
1.60957745, 1.07210177, 1.3848659 , 0.6393925 , 0.69392457,
1.06200234, 1.09091552, 0.87726365, 0.76938663, 1.96214695])
``````

kmeans function definition:

``````def kmeans(obs, k_or_guess, iter=20, thresh=1e-5, check_finite=True)
``````

parameter:

• `obs`: `ndarray`, `MxN`Array, each row represents an observation vector. Feature must be in`whiten` processed function
• `k_or_guess`: `int`Or `ndarray`, the number of generated center points, each center point is assigned one `code`, which is also `code_book`the row index of the center of mass in the generated matrix, and the initial k center is selected by randomly selecting observation values from the observation matrix. You can also `kxN`specify the initial k center point by passing in an array.
• `iter`: `int`, Optional value. The number of times the k-means algorithm is run , and the one with the lowest distortion is returned `code book`. If `k_or_guess`the initial centroid is specified for the parameter array, this parameter will be ignored. This parameter does not represent the number of iterations of the k-means algorithm .
• `thresh`: `float`, Optional value. If the distortion change since the last k-means iteration is less than or equal to the threshold, the k-means algorithm is terminated.
• `check_finite`: `bool`Optional value, `True`default: . Whether to check that the input matrix contains only finite numbers. Disabling may improve performance, but if the input does contain infinity or NaN, it may cause problems (crash, termination).

return:

• `codebook`: `ndarray`, `(k,N)`An array of dimensions consisting of k centroids
• `distortion`: `float`Type, the average Euclidean distance (not squared) between the observed value and the generated center point . Please note that `K-means`the standard definition of distortion in the algorithm is the sum of squared distances.

Note: 1. In the kmeans function, the iter parameter is used to specify the number of times to run the K-means algorithm, not the number of iterations. The algorithm termination condition can only be specified by the thresh parameter. 2. The distance metric uses the average Euclidean distance (not squared)

vq function definition:

``````def vq(obs, code_book, check_finite=True)
``````

Assign `code book`each of them `code`to observations. `MXN`Each observation vector in the array is compared `code book`with the centroid in and is assigned the closest centroid`code` .

`obs`The features in should have unit variance, which can be achieved by passing them to the `whiten`function. `code book`It can be created using `K-means`algorithms or other encoding algorithms.

parameter:

• `obs`: `MxN`Array, each row represents an observation value. Features must be `whiten`processed by functions.
• `code_book`: `ndarray`. Usually generated using the k-means algorithm, each row represents a different feature value `code`represented `code`by the column .
• `check_finite`: `bool`, Optional value, default is `True`. Whether to check that the input array contains only limited values. Disabling may improve performance, but if the input does contain infinity or NaN, it may cause problems (crash, termination).

return value:

• `code`: `ndarray`, An array of length M, used to store each observation data `code book`.
• `dist`: `ndarray`, `(M,)`The distortion value of each observed data to its latest code.

## 3.2 kmeans2 function

This function is also used to implement `K-means`algorithms. The algorithm attempts to minimize the Euclidean distance between the observation and the centroid , and includes several initialization methods.

``````scipy.cluster.vq.kmeans2(data, k, iter=10, thresh=1e-05, minit='random', missing='warn', check_finite=True)
``````

parameter:

• `data`: `ndarray`, `(M,N)`The array contains M observation data with N dimensions.
• `k`: `int or ndarray`, The number of clusters. If the `minit`parameter is `matrix`, or if one `ndarray`is given , it is interpreted as the initial cluster and used instead.
• `iter`: `int`, Optional value, `k-means`the number of iterations for the algorithm to run , note that the meaning is different from the `kmeans`function `iter`parameters.
• `thresh`: `float`, Optional value, not used
• `minit`: `str`, Optional value, initialization method. Alternatively `random`, `points`, `++`and `matrix`.
• `random`: Generate k centroids from Gauss, the mean and variance are estimated based on the data
• `points`: Randomly select k observations (rows) from the data as the initial center
• `++`: `kmeans++`Select k observations according to the method
• `matrix`: Interpret the k parameter as the (k,M) array of the initial centroid
• `missing`: `str`, Optional value, used to solve the empty clustering method, the available methods are `warn`sum `raise`.
• `warn`: Give a warning and continue
• `raise`: Trigger `ClusterError`and terminate the algorithm
• `check_finite`: `bool`, Optional value, whether to check that the input matrix contains only finite numbers, default`True`

return value:

• `centroid`:: `ndarray`An array of (k, N), representing `k-means`the center point of the last iteration of the method
• `label`: `ndarray`, The code or index value of each observation. label [i] is the code or index of the centroid that is closest to the i-th observation

Example

``````import scipy
from scipy.cluster.vq import kmeans2, whiten
import numpy as np

# , 20 , 4 :
points=scipy.randn(20,4)

data=whiten(points) #
centroid, label = kmeans2(data, 5)

#
>>> centroid
array([[ 0.52132816,  0.97577703, -0.30863464, -1.30546523],
[-0.27344139, -0.81129939, -0.59560322,  0.47788319],
[ 1.99658961, -0.10701021,  1.09921144,  0.51397034],
[-0.37598454,  1.72180727, -0.18127439,  0.58114466],
[ 0.25895367, -0.01881385,  1.25681737,  0.03119893]])
>>> label
array([1, 0, 3, 0, 1, 1, 2, 4, 0, 1, 1, 0, 4, 4, 0, 3, 1, 4, 3, 2],
dtype=int32)
``````