SUMMAR TASK-10

Mangal Hansdah
6 min readAug 12, 2021

TASK DESCRIPTION:- Create a blog/article/video about explaining k-mean clustering and its real use case in the security domain.

If you don’t no about what is cluster, what is k-means cluster then this blog is cook for you. It will briefly say all about it.

WHAT IS CLUSTERING ?

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

TYPES OF CLUSTERING

Broadly speaking, clustering can be divided into two subgroups :

  • Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. Hard clustering refers to a binary kind of grouping — either a data point belongs to a certain group, or it does not. So when sorting through the dataset of cars and animals, a data point will either be: a car (1) or not a car (0) or an animal (1) or not an animal (0).
  • Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, given a dataset of cats and dogs, an algorithm may evaluate a specific data point as: A cat (0.3) or a dog (0.7). Rather than grouping them in a yes-or-no fashion, there is much more ambiguity and lee-way in this method of clustering.

WHAT IS K-MEANS CLUSTERING ALGORITHM ?

K-MEANS clustering is one of the simplest and popular unsupervised machine learning algorithms. As a part of the unsupervised learning method, clustering attempts to identify a relationship between n-observations( data points) without being trained by the response variable.

With the intent of obtaining data points under the same class as identical as possible, and the data points in a separate class as dissimilar as possible.

Basically, in the process of clustering, one can identify which observations are alike and classify them significantly in that manner. Keeping this perspective in mind, k-means clustering is the most straightforward and frequently practised clustering method to categorize a dataset into a bunch of k classes (groups).

MATHEMATICAL FORMULATION FOR K-MEANS ALGORITHM:

D= {x1,x2,…,xi,…,xm} à data set of m records

xi= (xi1,xi2,…,xin) à each record is an n-dimensional vector

Finding Cluster Centers that Minimize Distortion:

Solution can be found by setting the partial derivative of Distortion w.r.t. each cluster center to zero.

For any k clusters, the value of k should be such that even if we increase the value of k from after several levels of clustering the distortion remains constant. The achieved point is called the “Elbow”.

This is the ideal value of k, for the clusters created.

How Does the K-means clustering algorithm work?

k-means clustering tries to group similar kinds of items in form of clusters. It finds the similarity between the items and groups them into the clusters. K-means clustering algorithm works in three steps. Let’s see what are these three steps.

  1. Select the k values.
  2. Initialize the centroids.
  3. Select the group and find the average.

Let us understand the above steps with the help of the figure because a good picture is better than the thousands of words.

We will understand each figure one by one.

  • Figure 1 shows the representation of data of two different items. the first item has shown in blue color and the second item has shown in red color. Here I am choosing the value of K randomly as 2. There are different methods by which we can choose the right k values.
  • In figure 2, Join the two selected points. Now to find out centroid, we will draw a perpendicular line to that line. The points will move to their centroid. If you will notice there, then you will see that some of the red points are now moved to the blue points. Now, these points belong to the group of blue color items.
  • The same process will continue in figure 3. we will join the two points and draw a perpendicular line to that and find out the centroid. Now the two points will move to its centroid and again some of the red points get converted to blue points.
  • The same process is happening in figure 4. This process will be continued until and unless we get two completely different clusters of these groups.

Applications

kmeans algorithm is very popular and used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc. The goal usually when we undergo a cluster analysis is either:

  1. Get a meaningful intuition of the structure of the data we’re dealing with.
  2. Cluster-then-predict where different models will be built for different subgroups if we believe there is a wide variation in the behaviors of different subgroups. An example of that is clustering patients into different subgroups and build a model for each subgroup to predict the probability of the risk of having heart attack.

Drawbacks

Kmeans algorithm is good in capturing structure of the data if clusters have a spherical-like shape. It always try to construct a nice spherical shape around the centroid. That means, the minute the clusters have a complicated geometric shapes, kmeans does a poor job in clustering the data. We’ll illustrate three cases where kmeans will not perform well.

First, kmeans algorithm doesn’t let data points that are far-away from each other share the same cluster even though they obviously belong to the same cluster. Below is an example of data points on two different horizontal lines that illustrates how kmeans tries to group half of the data points of each horizontal lines together.

Conclusion:-

Kmeans clustering is one of the most popular clustering algorithms and usually the first thing practitioners apply when solving clustering tasks to get an idea of the structure of the dataset. The goal of kmeans is to group data points into distinct non-overlapping subgroups. It does a very good job when the clusters have a kind of spherical shapes. However, it suffers as the geometric shapes of clusters deviates from spherical shapes. Moreover, it also doesn’t learn the number of clusters from the data and requires it to be pre-defined. To be a good practitioner, it’s good to know the assumptions behind algorithms/methods so that you would have a pretty good idea about the strength and weakness of each method. This will help you decide when to use each method and under what circumstances. In this post, we covered both strength, weaknesses, and some evaluation methods related to kmeans.

Below are the main takeaways:

  • Scale/standardize the data when applying kmeans algorithm.
  • Elbow method in selecting number of clusters doesn’t usually work because the error function is monotonically decreasing for all ks.
  • Kmeans gives more weight to the bigger clusters.
  • Kmeans assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn’t work well when clusters are in different shapes such as elliptical clusters.
  • If there is overlapping between clusters, kmeans doesn’t have an intrinsic measure for uncertainty for the examples belong to the overlapping region in order to determine for which cluster to assign each data point.
  • Kmeans may still cluster the data even if it can’t be clustered such as data that comes from uniform distributions.

🗨THANK YOU FOR READING THIS BLOG🙏🙏🙏….

--

--