Skip to main content

Unsupervised Machine Learning

 

🔍 1. What is Unsupervised Machine Learning?

Unsupervised learning is a type of machine learning where the model is trained on data without labeled outputs. The goal is to find hidden patterns, structure, or relationships in the data.


🧠 2. Types of Unsupervised Learning

TypeDescriptionExample Use Case
ClusteringGrouping similar data points into clustersCustomer segmentation
Dimensionality ReductionReducing the number of features while preserving informationImage compression, PCA
AssociationDiscovering rules that describe large portions of your dataMarket basket analysis
Anomaly DetectionIdentifying rare items or events that differ from the normFraud detection

🧪 3. Top Unsupervised Learning Algorithms

  1. K-Means Clustering

  2. Hierarchical Clustering

  3. DBSCAN

  4. Principal Component Analysis (PCA)

  5. Autoencoders

  6. Apriori (Association Rule Mining)


🧰 4. Real-Time Dataset Example: Customer Segmentation (Mall Customers)

Dataset: Mall Customers Dataset (Kaggle)


5. Algorithm 1: K-Means Clustering

🔧 Step-by-Step Code in Python


import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score # Load dataset data = pd.read_csv("Mall_Customers.csv") X = data.iloc[:, [3, 4]].values # Annual Income and Spending Score # Scaling scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Find optimal number of clusters using Elbow Method inertia = [] for k in range(1, 11): km = KMeans(n_clusters=k, random_state=42) km.fit(X_scaled) inertia.append(km.inertia_) plt.plot(range(1, 11), inertia, marker='o') plt.title('Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.grid(True) plt.show()

📊 Final Clustering and Visualization


# Apply KMeans with k=5 kmeans = KMeans(n_clusters=5, random_state=42) y_kmeans = kmeans.fit_predict(X_scaled) # Plot plt.scatter(X_scaled[y_kmeans == 0, 0], X_scaled[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1') plt.scatter(X_scaled[y_kmeans == 1, 0], X_scaled[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2') plt.scatter(X_scaled[y_kmeans == 2, 0], X_scaled[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3') plt.scatter(X_scaled[y_kmeans == 3, 0], X_scaled[y_kmeans == 3, 1], s=100, c='cyan', label='Cluster 4') plt.scatter(X_scaled[y_kmeans == 4, 0], X_scaled[y_kmeans == 4, 1], s=100, c='magenta', label='Cluster 5') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids') plt.title('Customer Segments') plt.xlabel('Annual Income (scaled)') plt.ylabel('Spending Score (scaled)') plt.legend() plt.show()

✅ Evaluate using Silhouette Score


score = silhouette_score(X_scaled, y_kmeans) print(f"Silhouette Score: {score:.2f}")

🧭 6. Algorithm 2: Hierarchical Clustering


from scipy.cluster.hierarchy import dendrogram, linkage from sklearn.cluster import AgglomerativeClustering # Dendrogram linked = linkage(X_scaled, method='ward') plt.figure(figsize=(10, 5)) dendrogram(linked) plt.title("Dendrogram") plt.xlabel("Samples") plt.ylabel("Euclidean distances") plt.show() # Apply clustering hc = AgglomerativeClustering(n_clusters=5) y_hc = hc.fit_predict(X_scaled) # Plot plt.scatter(X_scaled[y_hc == 0, 0], X_scaled[y_hc == 0, 1], s=100, c='red', label='Cluster 1') plt.scatter(X_scaled[y_hc == 1, 0], X_scaled[y_hc == 1, 1], s=100, c='blue', label='Cluster 2') plt.scatter(X_scaled[y_hc == 2, 0], X_scaled[y_hc == 2, 1], s=100, c='green', label='Cluster 3') plt.scatter(X_scaled[y_hc == 3, 0], X_scaled[y_hc == 3, 1], s=100, c='cyan', label='Cluster 4') plt.scatter(X_scaled[y_hc == 4, 0], X_scaled[y_hc == 4, 1], s=100, c='magenta', label='Cluster 5') plt.title('Hierarchical Clustering') plt.xlabel('Annual Income (scaled)') plt.ylabel('Spending Score (scaled)') plt.legend() plt.show()

⚙️ 7. Algorithm 3: DBSCAN


from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.5, min_samples=5) labels = db.fit_predict(X_scaled) # Plot plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='rainbow') plt.title('DBSCAN Clustering') plt.xlabel('Annual Income (scaled)') plt.ylabel('Spending Score (scaled)') plt.show()

📉 8. Algorithm 4: PCA (Dimensionality Reduction)


from sklearn.decomposition import PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) plt.scatter(X_pca[:, 0], X_pca[:, 1]) plt.title("PCA - Reduced to 2D") plt.xlabel("PC1") plt.ylabel("PC2") plt.grid(True) plt.show() explained_variance = pca.explained_variance_ratio_ print("Explained variance by each component:", explained_variance)

📈 9. Accuracy Metrics for Unsupervised Learning

Unsupervised models don’t use accuracy in the traditional sense. Evaluation metrics include:

  • Silhouette Score (KMeans, DBSCAN)

  • Davies–Bouldin Index

  • Calinski-Harabasz Index

  • Reconstruction Loss (Autoencoders)

  • Explained Variance (PCA)


🧠 Summary Table

AlgorithmBest ForKey ParameterEvaluation Metric
K-MeansSpherical clustersk (clusters)Silhouette Score
HierarchicalHierarchical groupingsDistance metricDendrogram, Score
DBSCANArbitrary shape clusterseps, min_samplesNoise detection, Score
PCAReducing featuresn_componentsExplained Variance


make_blobs dataset parameters
ChatGPT said:

The make_blobs function from sklearn.datasets is used to generate synthetic datasets for clustering, especially for testing unsupervised learning algorithms like K-Means, DBSCAN, etc.


📦 Syntax


from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=100, centers=3, n_features=2, cluster_std=1.0, random_state=42)

🔧 Parameters of make_blobs

ParameterTypeDescription
n_samplesint or tupleTotal number of points (or number per cluster if tuple)
n_featuresintNumber of features (dimensions)
centersint or arrayNumber of clusters, or coordinates of cluster centers
cluster_stdfloat or arrayStandard deviation (spread) of the clusters
center_boxtuple (min, max)Range of cluster centers if centers is int
shuffleboolWhether to shuffle samples
random_stateint or NoneSeed for reproducibility
return_centersboolIf True, also returns the centers used to generate blobs (default is False)

📊 Example: Generate and Plot Clusters


import matplotlib.pyplot as plt from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=300, centers=4, n_features=2, cluster_std=1.0, random_state=42) plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50) plt.title("Synthetic Dataset using make_blobs") plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.grid(True) plt.show()

🧪 Useful Variants

1. Different std per cluster


X, y = make_blobs(n_samples=300, centers=3, cluster_std=[0.5, 1.5, 0.8], random_state=42)

2. Custom Centers


import numpy as np centers = np.array([[1, 1], [-1, -1], [1, -1]]) X, y = make_blobs(n_samples=300, centers=centers, cluster_std=0.6, random_state=42)

Let me know if you want a comparison between datasets generated using make_blobs and make_moons or make_circles (used for non-linear clustering).



🧾 Default Output

python
from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42) print(X.shape) # (100, 2)
  • X is a 2D array: shape (100, 2) → Two unnamed features

  • y is a 1D array: shape (100,) → Labels (used for evaluation or visualization only)


🐼 Add Column Names (Optional)

You can assign column names like this:

python
import pandas as pd df = pd.DataFrame(X, columns=['Feature1', 'Feature2']) df['Label'] = y print(df.head())

Sample Output:

Feature1Feature2Label
-2.698.322
-9.479.031
-9.228.011
4.571.220
3.330.390

Would you like me to help generate a full dataset with make_blobs and save it to a CSV or use it in clustering visualizations?

You said:
give me the example of K-MEANS using make_blobs() dataset step by step in detail using python code
ChatGPT said:

Here is a step-by-step detailed example of K-Means Clustering using the make_blobs() dataset in Python, including dataset creation, visualization, clustering, and evaluation.


🧪 Step 1: Import Required Libraries

python
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score

🎯 Step 2: Generate Synthetic Dataset using make_blobs()

python
# Generate 300 samples, 3 clusters, 2 features X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42) # Optional: Create a DataFrame for better readability df = pd.DataFrame(X, columns=['Feature1', 'Feature2']) df['TrueLabel'] = y_true # View first few rows print(df.head())

📊 Step 3: Visualize the Original Data

python
plt.scatter(df['Feature1'], df['Feature2'], c=df['TrueLabel'], cmap='viridis', s=50) plt.title('Generated Dataset (True Labels)') plt.xlabel('Feature1') plt.ylabel('Feature2') plt.grid(True) plt.show()

📌 Step 4: Apply K-Means Clustering

python
# Create a KMeans instance with 3 clusters kmeans = KMeans(n_clusters=3, random_state=42) # Fit the model kmeans.fit(X) # Predict clusters pred_labels = kmeans.predict(X) # Add predictions to DataFrame df['KMeansLabel'] = pred_labels

📍 Step 5: Visualize K-Means Clusters with Centroids

python
# Plot the clustered data plt.scatter(df['Feature1'], df['Feature2'], c=df['KMeansLabel'], cmap='rainbow', s=50) # Plot the cluster centers centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.6, marker='X', label='Centroids') plt.title('K-Means Clustering Results') plt.xlabel('Feature1') plt.ylabel('Feature2') plt.legend() plt.grid(True) plt.show()

📈 Step 6: Evaluate Using Silhouette Score

python
score = silhouette_score(X, pred_labels) print(f'Silhouette Score: {score:.2f}')

The Silhouette Score ranges from -1 to 1. A value close to 1 means clusters are well separated.


🧠 Step 7: Determine Optimal Number of Clusters (Elbow Method)

python
inertia = [] K = range(1, 11) for k in K: km = KMeans(n_clusters=k, random_state=42) km.fit(X) inertia.append(km.inertia_) # Sum of squared distances to closest cluster center # Plot Elbow plt.plot(K, inertia, 'bo-') plt.xlabel('Number of Clusters (k)') plt.ylabel('Inertia (Within-cluster Sum of Squares)') plt.title('Elbow Method For Optimal k') plt.grid(True) plt.show()

Look for the "elbow" point in the curve — the point after which inertia decreases slowly.


📌 Summary

StepDescription
1Import libraries
2Generate synthetic dataset with make_blobs()
3Visualize initial data
4Apply K-Means clustering
5Plot predicted clusters with centroids
6Evaluate with silhouette score
7Use Elbow method to select optimal k

Would you like the same example in a Jupyter Notebook format or saved as a .py file? I can also help you compare results using DBSCAN or Hierarchical clustering on this same dataset.

You said:
describe elbow method in details
ChatGPT said:

Sure! Here's a detailed explanation of the Elbow Method—a popular technique to determine the optimal number of clusters (k) in K-Means clustering.


📌 What is the Elbow Method?

The Elbow Method helps to choose the optimal number of clusters (k) by plotting the inertia (within-cluster sum of squares) as a function of the number of clusters and finding the "elbow point"—where the rate of decrease sharply slows down.


📐 Inertia: What Are We Measuring?

  • Inertia is the sum of squared distances between each point and its closest cluster center.

  • It measures how tight the clusters are.

Inertia=i=1kxjCixjμi2\text{Inertia} = \sum_{i=1}^{k} \sum_{x_j \in C_i} \| x_j - \mu_i \|^2

Where:

  • CiC_i = Cluster i

  • μi\mu_i = Centroid of cluster i

  • xjCix_j \in C_i = Data points in cluster i

Lower inertia is better, but too many clusters will always give you lower inertia — so we must find the balance.


🎯 Goal of the Elbow Method

Find the value of k where adding more clusters doesn't significantly reduce inertia — this is the elbow point.


📉 How to Implement the Elbow Method in Python

python
from sklearn.datasets import make_blobs from sklearn.cluster import KMeans import matplotlib.pyplot as plt # Step 1: Generate sample data X, y = make_blobs(n_samples=300, centers=5, cluster_std=0.60, random_state=0) # Step 2: Calculate inertia for different values of k inertia = [] K = range(1, 11) for k in K: kmeans = KMeans(n_clusters=k, random_state=0) kmeans.fit(X) inertia.append(kmeans.inertia_) # Step 3: Plot the Elbow curve plt.figure(figsize=(8, 5)) plt.plot(K, inertia, 'bo-') plt.xlabel('Number of Clusters (k)') plt.ylabel('Inertia') plt.title('Elbow Method to Determine Optimal k') plt.grid(True) plt.show()

🧠 Interpreting the Plot

  • The X-axis = number of clusters (k)

  • The Y-axis = inertia

  • You'll notice:

    • Inertia decreases sharply at first

    • Then the curve flattens out

  • The point where the decrease slows down is the “elbow” — the optimal number of clusters


✅ Example

kInertia
12500
21500
3900
4650
5600
6580

In this case, the elbow is around k=3 or 4 — adding more clusters reduces inertia only slightly.


📦 Pros and Cons of Elbow Method

✅ Pros:

  • Simple to implement

  • Helps avoid overfitting (too many clusters)

❌ Cons:

  • Elbow point is sometimes ambiguous

  • Doesn’t work well when clusters aren't clearly separated

  • Doesn't account for non-spherical shapes (K-Means limitation)


💡 Alternatives to Elbow Method

  • Silhouette Score

  • Gap Statistic

  • Davies–Bouldin Index


Would you like a combined visualization of Elbow + Silhouette Score to strengthen your cluster validation?

You said:
give me the in depth tutorial of hierarchical clustering with example using make_blobs() dataset and also compare the accuracy from k-means
ChatGPT said:

Here's an in-depth tutorial on Hierarchical Clustering using the make_blobs() dataset in Python. This includes step-by-step implementation, visualizations (dendrogram), clustering, and a comparison with K-Means using Silhouette Score.


🧠 Hierarchical Clustering – In Depth

📌 What is Hierarchical Clustering?

Hierarchical Clustering builds a tree-like structure (dendrogram) of nested clusters. It comes in two types:

1. Agglomerative (Bottom-Up) — Most common

  • Each point starts as a cluster.

  • Pairs of clusters are merged until one cluster remains.

2. Divisive (Top-Down)

  • Starts with one cluster of all data points.

  • Recursively splits the cluster into sub-clusters.


🔢 Distance Metrics

Hierarchical clustering uses a distance metric to compute similarity:

  • Euclidean (default)

  • Manhattan, Cosine, etc.

🔗 Linkage Criteria

Used to define the distance between clusters:

  • Single Linkage: min distance between points

  • Complete Linkage: max distance

  • Average Linkage: mean distance

  • Ward’s Method: minimizes variance (recommended for Euclidean data)


✅ Step-by-Step Example using make_blobs()

✅ Step 1: Import Libraries

python
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.cluster import AgglomerativeClustering, KMeans from sklearn.metrics import silhouette_score from scipy.cluster.hierarchy import dendrogram, linkage

✅ Step 2: Generate Dataset

python
# Create dataset with 3 true clusters X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42) # Optional: Create DataFrame df = pd.DataFrame(X, columns=['Feature1', 'Feature2']) df['TrueLabel'] = y_true

✅ Step 3: Plot Original Data

python
plt.scatter(df['Feature1'], df['Feature2'], c=y_true, cmap='viridis', s=50) plt.title("Original Dataset (True Labels)") plt.xlabel("Feature1") plt.ylabel("Feature2") plt.grid(True) plt.show()

✅ Step 4: Plot Dendrogram

python
linked = linkage(X, method='ward') # 'ward' minimizes variance plt.figure(figsize=(12, 6)) dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=False) plt.title("Dendrogram (Ward's Linkage)") plt.xlabel("Data Points") plt.ylabel("Distance") plt.grid(True) plt.show()

✅ Step 5: Apply Agglomerative Clustering

python
# Create Agglomerative model with 3 clusters agglo = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward') agglo_labels = agglo.fit_predict(X) # Add to DataFrame df['AggloLabel'] = agglo_labels

✅ Step 6: Visualize Hierarchical Clustering

python
plt.scatter(df['Feature1'], df['Feature2'], c=agglo_labels, cmap='rainbow', s=50) plt.title("Agglomerative Hierarchical Clustering") plt.xlabel("Feature1") plt.ylabel("Feature2") plt.grid(True) plt.show()

✅ Step 7: Compare with K-Means

python
# Apply KMeans kmeans = KMeans(n_clusters=3, random_state=42) kmeans_labels = kmeans.fit_predict(X) df['KMeansLabel'] = kmeans_labels

✅ Step 8: Evaluate Using Silhouette Score

python
kmeans_score = silhouette_score(X, kmeans_labels) agglo_score = silhouette_score(X, agglo_labels) print(f"K-Means Silhouette Score: {kmeans_score:.3f}") print(f"Agglomerative Clustering Silhouette Score: {agglo_score:.3f}")

✅ Step 9: Final Comparison Table

MethodSilhouette Score (0-1)Notes
K-Meanse.g. 0.71Fast, works well for spherical clusters
Agglomerative (Ward)e.g. 0.70Works well with hierarchical structure

⚠ Note: Scores vary slightly based on dataset randomness. The closer to 1, the better the clustering.


🧠 When to Use Hierarchical Clustering?

✅ Use it when:

  • You want a hierarchical view of clusters (dendrogram)

  • You don't know how many clusters exist upfront

  • The dataset is small to medium

❌ Avoid it when:

  • Dataset is very large (slow performance)

  • You need real-time clustering (it's non-incremental)

Comments

Popular posts from this blog

DSA in C# | Data Structure and Algorithm using C#

  DSA in C# |  Data Structure and Algorithm using C#: Lecture 1: Introduction to Data Structures and Algorithms (1 Hour) 1.1 What are Data Structures? Data Structures are ways to store and organize data so it can be used efficiently. Think of data structures as containers that hold data in a specific format. Types of Data Structures: Primitive Data Structures : These are basic structures built into the language. Example: int , float , char , bool in C#. Example : csharp int age = 25;  // 'age' stores an integer value. bool isStudent = true;  // 'isStudent' stores a boolean value. Non-Primitive Data Structures : These are more complex and are built using primitive types. They are divided into: Linear : Arrays, Lists, Queues, Stacks (data is arranged in a sequence). Non-Linear : Trees, Graphs (data is connected in more complex ways). Example : // Array is a simple linear data structure int[] number...

Conditional Statement in Python

It is used to solve condition-based problems using if and else block-level statement. it provides a separate block for  if statement, else statement, and elif statement . elif statement is similar to elseif statement of C, C++ and Java languages. Type of Conditional Statement:- 1) Simple if:- We can write a single if statement also in python, it will execute when the condition is true. for example, One real-world problem is here?? we want to display the salary of employees when the salary will be above 10000 otherwise not displayed. Syntax:- if(condition):    statements The solution to the above problem sal = int(input("Enter salary")) if sal>10000:     print("Salary is "+str(sal)) Q)  WAP to increase the salary of employees from 500 if entered salary will be less than 10000 otherwise the same salaries will be displayed. Solution:- x = int(input("enter salary")) if x<10000:     x=x+500 print(x)   Q) WAP to display th...

JSP Page design using Internal CSS

  JSP is used to design the user interface of an application, CSS is used to provide set of properties. Jsp provide proper page template to create user interface of dynamic web application. We can write CSS using three different ways 1)  inline CSS:-   we will write CSS tag under HTML elements <div style="width:200px; height:100px; background-color:green;"></div> 2)  Internal CSS:-  we will write CSS under <style> block. <style type="text/css"> #abc { width:200px;  height:100px;  background-color:green; } </style> <div id="abc"></div> 3) External CSS:-  we will write CSS to create a separate file and link it into HTML Web pages. create a separate file and named it style.css #abc { width:200px;  height:100px;  background-color:green; } go into Jsp page and link style.css <link href="style.css"  type="text/css" rel="stylesheet"   /> <div id="abc"> </div> Exam...