Skip to main content

Unsupervised Machine Learning

 

🔍 1. What is Unsupervised Machine Learning?

Unsupervised learning is a type of machine learning where the model is trained on data without labeled outputs. The goal is to find hidden patterns, structure, or relationships in the data.


🧠 2. Types of Unsupervised Learning

TypeDescriptionExample Use Case
ClusteringGrouping similar data points into clustersCustomer segmentation
Dimensionality ReductionReducing the number of features while preserving informationImage compression, PCA
AssociationDiscovering rules that describe large portions of your dataMarket basket analysis
Anomaly DetectionIdentifying rare items or events that differ from the normFraud detection

🧪 3. Top Unsupervised Learning Algorithms

  1. K-Means Clustering

  2. Hierarchical Clustering

  3. DBSCAN

  4. Principal Component Analysis (PCA)

  5. Autoencoders

  6. Apriori (Association Rule Mining)


🧰 4. Real-Time Dataset Example: Customer Segmentation (Mall Customers)

Dataset: Mall Customers Dataset (Kaggle)


5. Algorithm 1: K-Means Clustering

🔧 Step-by-Step Code in Python


import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score # Load dataset data = pd.read_csv("Mall_Customers.csv") X = data.iloc[:, [3, 4]].values # Annual Income and Spending Score # Scaling scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Find optimal number of clusters using Elbow Method inertia = [] for k in range(1, 11): km = KMeans(n_clusters=k, random_state=42) km.fit(X_scaled) inertia.append(km.inertia_) plt.plot(range(1, 11), inertia, marker='o') plt.title('Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.grid(True) plt.show()

📊 Final Clustering and Visualization


# Apply KMeans with k=5 kmeans = KMeans(n_clusters=5, random_state=42) y_kmeans = kmeans.fit_predict(X_scaled) # Plot plt.scatter(X_scaled[y_kmeans == 0, 0], X_scaled[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1') plt.scatter(X_scaled[y_kmeans == 1, 0], X_scaled[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2') plt.scatter(X_scaled[y_kmeans == 2, 0], X_scaled[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3') plt.scatter(X_scaled[y_kmeans == 3, 0], X_scaled[y_kmeans == 3, 1], s=100, c='cyan', label='Cluster 4') plt.scatter(X_scaled[y_kmeans == 4, 0], X_scaled[y_kmeans == 4, 1], s=100, c='magenta', label='Cluster 5') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids') plt.title('Customer Segments') plt.xlabel('Annual Income (scaled)') plt.ylabel('Spending Score (scaled)') plt.legend() plt.show()

✅ Evaluate using Silhouette Score


score = silhouette_score(X_scaled, y_kmeans) print(f"Silhouette Score: {score:.2f}")

🧭 6. Algorithm 2: Hierarchical Clustering


from scipy.cluster.hierarchy import dendrogram, linkage from sklearn.cluster import AgglomerativeClustering # Dendrogram linked = linkage(X_scaled, method='ward') plt.figure(figsize=(10, 5)) dendrogram(linked) plt.title("Dendrogram") plt.xlabel("Samples") plt.ylabel("Euclidean distances") plt.show() # Apply clustering hc = AgglomerativeClustering(n_clusters=5) y_hc = hc.fit_predict(X_scaled) # Plot plt.scatter(X_scaled[y_hc == 0, 0], X_scaled[y_hc == 0, 1], s=100, c='red', label='Cluster 1') plt.scatter(X_scaled[y_hc == 1, 0], X_scaled[y_hc == 1, 1], s=100, c='blue', label='Cluster 2') plt.scatter(X_scaled[y_hc == 2, 0], X_scaled[y_hc == 2, 1], s=100, c='green', label='Cluster 3') plt.scatter(X_scaled[y_hc == 3, 0], X_scaled[y_hc == 3, 1], s=100, c='cyan', label='Cluster 4') plt.scatter(X_scaled[y_hc == 4, 0], X_scaled[y_hc == 4, 1], s=100, c='magenta', label='Cluster 5') plt.title('Hierarchical Clustering') plt.xlabel('Annual Income (scaled)') plt.ylabel('Spending Score (scaled)') plt.legend() plt.show()

⚙️ 7. Algorithm 3: DBSCAN


from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.5, min_samples=5) labels = db.fit_predict(X_scaled) # Plot plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='rainbow') plt.title('DBSCAN Clustering') plt.xlabel('Annual Income (scaled)') plt.ylabel('Spending Score (scaled)') plt.show()

📉 8. Algorithm 4: PCA (Dimensionality Reduction)


from sklearn.decomposition import PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) plt.scatter(X_pca[:, 0], X_pca[:, 1]) plt.title("PCA - Reduced to 2D") plt.xlabel("PC1") plt.ylabel("PC2") plt.grid(True) plt.show() explained_variance = pca.explained_variance_ratio_ print("Explained variance by each component:", explained_variance)

📈 9. Accuracy Metrics for Unsupervised Learning

Unsupervised models don’t use accuracy in the traditional sense. Evaluation metrics include:

  • Silhouette Score (KMeans, DBSCAN)

  • Davies–Bouldin Index

  • Calinski-Harabasz Index

  • Reconstruction Loss (Autoencoders)

  • Explained Variance (PCA)


🧠 Summary Table

AlgorithmBest ForKey ParameterEvaluation Metric
K-MeansSpherical clustersk (clusters)Silhouette Score
HierarchicalHierarchical groupingsDistance metricDendrogram, Score
DBSCANArbitrary shape clusterseps, min_samplesNoise detection, Score
PCAReducing featuresn_componentsExplained Variance


make_blobs dataset parameters
ChatGPT said:

The make_blobs function from sklearn.datasets is used to generate synthetic datasets for clustering, especially for testing unsupervised learning algorithms like K-Means, DBSCAN, etc.


📦 Syntax


from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=100, centers=3, n_features=2, cluster_std=1.0, random_state=42)

🔧 Parameters of make_blobs

ParameterTypeDescription
n_samplesint or tupleTotal number of points (or number per cluster if tuple)
n_featuresintNumber of features (dimensions)
centersint or arrayNumber of clusters, or coordinates of cluster centers
cluster_stdfloat or arrayStandard deviation (spread) of the clusters
center_boxtuple (min, max)Range of cluster centers if centers is int
shuffleboolWhether to shuffle samples
random_stateint or NoneSeed for reproducibility
return_centersboolIf True, also returns the centers used to generate blobs (default is False)

📊 Example: Generate and Plot Clusters


import matplotlib.pyplot as plt from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=300, centers=4, n_features=2, cluster_std=1.0, random_state=42) plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50) plt.title("Synthetic Dataset using make_blobs") plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.grid(True) plt.show()

🧪 Useful Variants

1. Different std per cluster


X, y = make_blobs(n_samples=300, centers=3, cluster_std=[0.5, 1.5, 0.8], random_state=42)

2. Custom Centers


import numpy as np centers = np.array([[1, 1], [-1, -1], [1, -1]]) X, y = make_blobs(n_samples=300, centers=centers, cluster_std=0.6, random_state=42)

Let me know if you want a comparison between datasets generated using make_blobs and make_moons or make_circles (used for non-linear clustering).



🧾 Default Output

python
from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42) print(X.shape) # (100, 2)
  • X is a 2D array: shape (100, 2) → Two unnamed features

  • y is a 1D array: shape (100,) → Labels (used for evaluation or visualization only)


🐼 Add Column Names (Optional)

You can assign column names like this:

python
import pandas as pd df = pd.DataFrame(X, columns=['Feature1', 'Feature2']) df['Label'] = y print(df.head())

Sample Output:

Feature1Feature2Label
-2.698.322
-9.479.031
-9.228.011
4.571.220
3.330.390

Would you like me to help generate a full dataset with make_blobs and save it to a CSV or use it in clustering visualizations?

You said:
give me the example of K-MEANS using make_blobs() dataset step by step in detail using python code
ChatGPT said:

Here is a step-by-step detailed example of K-Means Clustering using the make_blobs() dataset in Python, including dataset creation, visualization, clustering, and evaluation.


🧪 Step 1: Import Required Libraries

python
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score

🎯 Step 2: Generate Synthetic Dataset using make_blobs()

python
# Generate 300 samples, 3 clusters, 2 features X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42) # Optional: Create a DataFrame for better readability df = pd.DataFrame(X, columns=['Feature1', 'Feature2']) df['TrueLabel'] = y_true # View first few rows print(df.head())

📊 Step 3: Visualize the Original Data

python
plt.scatter(df['Feature1'], df['Feature2'], c=df['TrueLabel'], cmap='viridis', s=50) plt.title('Generated Dataset (True Labels)') plt.xlabel('Feature1') plt.ylabel('Feature2') plt.grid(True) plt.show()

📌 Step 4: Apply K-Means Clustering

python
# Create a KMeans instance with 3 clusters kmeans = KMeans(n_clusters=3, random_state=42) # Fit the model kmeans.fit(X) # Predict clusters pred_labels = kmeans.predict(X) # Add predictions to DataFrame df['KMeansLabel'] = pred_labels

📍 Step 5: Visualize K-Means Clusters with Centroids

python
# Plot the clustered data plt.scatter(df['Feature1'], df['Feature2'], c=df['KMeansLabel'], cmap='rainbow', s=50) # Plot the cluster centers centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.6, marker='X', label='Centroids') plt.title('K-Means Clustering Results') plt.xlabel('Feature1') plt.ylabel('Feature2') plt.legend() plt.grid(True) plt.show()

📈 Step 6: Evaluate Using Silhouette Score

python
score = silhouette_score(X, pred_labels) print(f'Silhouette Score: {score:.2f}')

The Silhouette Score ranges from -1 to 1. A value close to 1 means clusters are well separated.


🧠 Step 7: Determine Optimal Number of Clusters (Elbow Method)

python
inertia = [] K = range(1, 11) for k in K: km = KMeans(n_clusters=k, random_state=42) km.fit(X) inertia.append(km.inertia_) # Sum of squared distances to closest cluster center # Plot Elbow plt.plot(K, inertia, 'bo-') plt.xlabel('Number of Clusters (k)') plt.ylabel('Inertia (Within-cluster Sum of Squares)') plt.title('Elbow Method For Optimal k') plt.grid(True) plt.show()

Look for the "elbow" point in the curve — the point after which inertia decreases slowly.


📌 Summary

StepDescription
1Import libraries
2Generate synthetic dataset with make_blobs()
3Visualize initial data
4Apply K-Means clustering
5Plot predicted clusters with centroids
6Evaluate with silhouette score
7Use Elbow method to select optimal k

Would you like the same example in a Jupyter Notebook format or saved as a .py file? I can also help you compare results using DBSCAN or Hierarchical clustering on this same dataset.

You said:
describe elbow method in details
ChatGPT said:

Sure! Here's a detailed explanation of the Elbow Method—a popular technique to determine the optimal number of clusters (k) in K-Means clustering.


📌 What is the Elbow Method?

The Elbow Method helps to choose the optimal number of clusters (k) by plotting the inertia (within-cluster sum of squares) as a function of the number of clusters and finding the "elbow point"—where the rate of decrease sharply slows down.


📐 Inertia: What Are We Measuring?

  • Inertia is the sum of squared distances between each point and its closest cluster center.

  • It measures how tight the clusters are.

Inertia=i=1kxjCixjμi2\text{Inertia} = \sum_{i=1}^{k} \sum_{x_j \in C_i} \| x_j - \mu_i \|^2

Where:

  • CiC_i = Cluster i

  • μi\mu_i = Centroid of cluster i

  • xjCix_j \in C_i = Data points in cluster i

Lower inertia is better, but too many clusters will always give you lower inertia — so we must find the balance.


🎯 Goal of the Elbow Method

Find the value of k where adding more clusters doesn't significantly reduce inertia — this is the elbow point.


📉 How to Implement the Elbow Method in Python

python
from sklearn.datasets import make_blobs from sklearn.cluster import KMeans import matplotlib.pyplot as plt # Step 1: Generate sample data X, y = make_blobs(n_samples=300, centers=5, cluster_std=0.60, random_state=0) # Step 2: Calculate inertia for different values of k inertia = [] K = range(1, 11) for k in K: kmeans = KMeans(n_clusters=k, random_state=0) kmeans.fit(X) inertia.append(kmeans.inertia_) # Step 3: Plot the Elbow curve plt.figure(figsize=(8, 5)) plt.plot(K, inertia, 'bo-') plt.xlabel('Number of Clusters (k)') plt.ylabel('Inertia') plt.title('Elbow Method to Determine Optimal k') plt.grid(True) plt.show()

🧠 Interpreting the Plot

  • The X-axis = number of clusters (k)

  • The Y-axis = inertia

  • You'll notice:

    • Inertia decreases sharply at first

    • Then the curve flattens out

  • The point where the decrease slows down is the “elbow” — the optimal number of clusters


✅ Example

kInertia
12500
21500
3900
4650
5600
6580

In this case, the elbow is around k=3 or 4 — adding more clusters reduces inertia only slightly.


📦 Pros and Cons of Elbow Method

✅ Pros:

  • Simple to implement

  • Helps avoid overfitting (too many clusters)

❌ Cons:

  • Elbow point is sometimes ambiguous

  • Doesn’t work well when clusters aren't clearly separated

  • Doesn't account for non-spherical shapes (K-Means limitation)


💡 Alternatives to Elbow Method

  • Silhouette Score

  • Gap Statistic

  • Davies–Bouldin Index


Would you like a combined visualization of Elbow + Silhouette Score to strengthen your cluster validation?

You said:
give me the in depth tutorial of hierarchical clustering with example using make_blobs() dataset and also compare the accuracy from k-means
ChatGPT said:

Here's an in-depth tutorial on Hierarchical Clustering using the make_blobs() dataset in Python. This includes step-by-step implementation, visualizations (dendrogram), clustering, and a comparison with K-Means using Silhouette Score.


🧠 Hierarchical Clustering – In Depth

📌 What is Hierarchical Clustering?

Hierarchical Clustering builds a tree-like structure (dendrogram) of nested clusters. It comes in two types:

1. Agglomerative (Bottom-Up) — Most common

  • Each point starts as a cluster.

  • Pairs of clusters are merged until one cluster remains.

2. Divisive (Top-Down)

  • Starts with one cluster of all data points.

  • Recursively splits the cluster into sub-clusters.


🔢 Distance Metrics

Hierarchical clustering uses a distance metric to compute similarity:

  • Euclidean (default)

  • Manhattan, Cosine, etc.

🔗 Linkage Criteria

Used to define the distance between clusters:

  • Single Linkage: min distance between points

  • Complete Linkage: max distance

  • Average Linkage: mean distance

  • Ward’s Method: minimizes variance (recommended for Euclidean data)


✅ Step-by-Step Example using make_blobs()

✅ Step 1: Import Libraries

python
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.cluster import AgglomerativeClustering, KMeans from sklearn.metrics import silhouette_score from scipy.cluster.hierarchy import dendrogram, linkage

✅ Step 2: Generate Dataset

python
# Create dataset with 3 true clusters X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42) # Optional: Create DataFrame df = pd.DataFrame(X, columns=['Feature1', 'Feature2']) df['TrueLabel'] = y_true

✅ Step 3: Plot Original Data

python
plt.scatter(df['Feature1'], df['Feature2'], c=y_true, cmap='viridis', s=50) plt.title("Original Dataset (True Labels)") plt.xlabel("Feature1") plt.ylabel("Feature2") plt.grid(True) plt.show()

✅ Step 4: Plot Dendrogram

python
linked = linkage(X, method='ward') # 'ward' minimizes variance plt.figure(figsize=(12, 6)) dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=False) plt.title("Dendrogram (Ward's Linkage)") plt.xlabel("Data Points") plt.ylabel("Distance") plt.grid(True) plt.show()

✅ Step 5: Apply Agglomerative Clustering

python
# Create Agglomerative model with 3 clusters agglo = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward') agglo_labels = agglo.fit_predict(X) # Add to DataFrame df['AggloLabel'] = agglo_labels

✅ Step 6: Visualize Hierarchical Clustering

python
plt.scatter(df['Feature1'], df['Feature2'], c=agglo_labels, cmap='rainbow', s=50) plt.title("Agglomerative Hierarchical Clustering") plt.xlabel("Feature1") plt.ylabel("Feature2") plt.grid(True) plt.show()

✅ Step 7: Compare with K-Means

python
# Apply KMeans kmeans = KMeans(n_clusters=3, random_state=42) kmeans_labels = kmeans.fit_predict(X) df['KMeansLabel'] = kmeans_labels

✅ Step 8: Evaluate Using Silhouette Score

python
kmeans_score = silhouette_score(X, kmeans_labels) agglo_score = silhouette_score(X, agglo_labels) print(f"K-Means Silhouette Score: {kmeans_score:.3f}") print(f"Agglomerative Clustering Silhouette Score: {agglo_score:.3f}")

✅ Step 9: Final Comparison Table

MethodSilhouette Score (0-1)Notes
K-Meanse.g. 0.71Fast, works well for spherical clusters
Agglomerative (Ward)e.g. 0.70Works well with hierarchical structure

⚠ Note: Scores vary slightly based on dataset randomness. The closer to 1, the better the clustering.


🧠 When to Use Hierarchical Clustering?

✅ Use it when:

  • You want a hierarchical view of clusters (dendrogram)

  • You don't know how many clusters exist upfront

  • The dataset is small to medium

❌ Avoid it when:

  • Dataset is very large (slow performance)

  • You need real-time clustering (it's non-incremental)

Comments

Popular posts from this blog

Uncontrolled form input in React-JS

  Uncontrolled form input in React-JS? If we want to take input from users without any separate event handling then we can uncontrolled the data binding technique. The uncontrolled input is similar to the traditional HTML form inputs. The DOM itself handles the form data. Here, the HTML elements maintain their own state that will be updated when the input value changes. To write an uncontrolled component, you need to use a ref to get form values from the DOM. In other words, there is no need to write an event handler for every state update. You can use a ref to access the input field value of the form from the DOM. Example of Uncontrolled Form Input:- import React from "react" ; export class Info extends React . Component {     constructor ( props )     {         super ( props );         this . fun = this . fun . bind ( this ); //event method binding         this . input = React . createRef ();...

JSP Page design using Internal CSS

  JSP is used to design the user interface of an application, CSS is used to provide set of properties. Jsp provide proper page template to create user interface of dynamic web application. We can write CSS using three different ways 1)  inline CSS:-   we will write CSS tag under HTML elements <div style="width:200px; height:100px; background-color:green;"></div> 2)  Internal CSS:-  we will write CSS under <style> block. <style type="text/css"> #abc { width:200px;  height:100px;  background-color:green; } </style> <div id="abc"></div> 3) External CSS:-  we will write CSS to create a separate file and link it into HTML Web pages. create a separate file and named it style.css #abc { width:200px;  height:100px;  background-color:green; } go into Jsp page and link style.css <link href="style.css"  type="text/css" rel="stylesheet"   /> <div id="abc"> </div> Exam...

JDBC Database Connectivity using JSP and Servlet, Database connectivity on Java

JDBC Database Connectivity using JSP and Servlet, Database connectivity on Java JDBC:-   JDBC means Java database connectivity, it is used to connect from the front-end(application server) to the back-end(database server) in the case of Java web application. The database provides a set of tables to store records and JDBC will work similarly to the  bridge between the database table and application form. 1)  Class.forName("drivername")  // Manage Drive         Class.formName("com.mysql.jdbc.Driver");  // MYSQL      Class.forName ("oracle.jdbc.driver.OracleDriver"); //Oracle 2)  Manage Connection String     It establish connection from application server to database server, Java provide DriverManage class and getConnection that will return Connection object.    Connection conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/databasename","username","password"); 3)  Manage Statement to...