Unsupervised Machine Learning

🔍 1. What is Unsupervised Machine Learning?

Unsupervised learning is a type of machine learning where the model is trained on data without labeled outputs. The goal is to find hidden patterns, structure, or relationships in the data.

🧠 2. Types of Unsupervised Learning

Type	Description	Example Use Case
Clustering	Grouping similar data points into clusters	Customer segmentation
Dimensionality Reduction	Reducing the number of features while preserving information	Image compression, PCA
Association	Discovering rules that describe large portions of your data	Market basket analysis
Anomaly Detection	Identifying rare items or events that differ from the norm	Fraud detection

🧪 3. Top Unsupervised Learning Algorithms

K-Means Clustering
Hierarchical Clustering
DBSCAN
Principal Component Analysis (PCA)
Autoencoders
Apriori (Association Rule Mining)

🧰 4. Real-Time Dataset Example: Customer Segmentation (Mall Customers)

Dataset: Mall Customers Dataset (Kaggle)

✅ 5. Algorithm 1: K-Means Clustering

🔧 Step-by-Step Code in Python


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Load dataset
data = pd.read_csv("Mall_Customers.csv")
X = data.iloc[:, [3, 4]].values  # Annual Income and Spending Score

# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find optimal number of clusters using Elbow Method
inertia = []
for k in range(1, 11):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X_scaled)
    inertia.append(km.inertia_)

plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.grid(True)
plt.show()

📊 Final Clustering and Visualization


# Apply KMeans with k=5
kmeans = KMeans(n_clusters=5, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)

# Plot
plt.scatter(X_scaled[y_kmeans == 0, 0], X_scaled[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X_scaled[y_kmeans == 1, 0], X_scaled[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X_scaled[y_kmeans == 2, 0], X_scaled[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(X_scaled[y_kmeans == 3, 0], X_scaled[y_kmeans == 3, 1], s=100, c='cyan', label='Cluster 4')
plt.scatter(X_scaled[y_kmeans == 4, 0], X_scaled[y_kmeans == 4, 1], s=100, c='magenta', label='Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('Customer Segments')
plt.xlabel('Annual Income (scaled)')
plt.ylabel('Spending Score (scaled)')
plt.legend()
plt.show()

✅ Evaluate using Silhouette Score


score = silhouette_score(X_scaled, y_kmeans)
print(f"Silhouette Score: {score:.2f}")

🧭 6. Algorithm 2: Hierarchical Clustering


from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering

# Dendrogram
linked = linkage(X_scaled, method='ward')
plt.figure(figsize=(10, 5))
dendrogram(linked)
plt.title("Dendrogram")
plt.xlabel("Samples")
plt.ylabel("Euclidean distances")
plt.show()

# Apply clustering
hc = AgglomerativeClustering(n_clusters=5)
y_hc = hc.fit_predict(X_scaled)

# Plot
plt.scatter(X_scaled[y_hc == 0, 0], X_scaled[y_hc == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X_scaled[y_hc == 1, 0], X_scaled[y_hc == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X_scaled[y_hc == 2, 0], X_scaled[y_hc == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(X_scaled[y_hc == 3, 0], X_scaled[y_hc == 3, 1], s=100, c='cyan', label='Cluster 4')
plt.scatter(X_scaled[y_hc == 4, 0], X_scaled[y_hc == 4, 1], s=100, c='magenta', label='Cluster 5')
plt.title('Hierarchical Clustering')
plt.xlabel('Annual Income (scaled)')
plt.ylabel('Spending Score (scaled)')
plt.legend()
plt.show()

⚙️ 7. Algorithm 3: DBSCAN


from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.5, min_samples=5)
labels = db.fit_predict(X_scaled)

# Plot
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='rainbow')
plt.title('DBSCAN Clustering')
plt.xlabel('Annual Income (scaled)')
plt.ylabel('Spending Score (scaled)')
plt.show()

📉 8. Algorithm 4: PCA (Dimensionality Reduction)


from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title("PCA - Reduced to 2D")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid(True)
plt.show()

explained_variance = pca.explained_variance_ratio_
print("Explained variance by each component:", explained_variance)

📈 9. Accuracy Metrics for Unsupervised Learning

Unsupervised models don’t use accuracy in the traditional sense. Evaluation metrics include:

Silhouette Score (KMeans, DBSCAN)
Davies–Bouldin Index
Calinski-Harabasz Index
Reconstruction Loss (Autoencoders)
Explained Variance (PCA)

🧠 Summary Table

Algorithm	Best For	Key Parameter	Evaluation Metric
K-Means	Spherical clusters	k (clusters)	Silhouette Score
Hierarchical	Hierarchical groupings	Distance metric	Dendrogram, Score
DBSCAN	Arbitrary shape clusters	eps, min_samples	Noise detection, Score
PCA	Reducing features	n_components	Explained Variance

make_blobs dataset parameters

The make_blobs function from sklearn.datasets is used to generate synthetic datasets for clustering, especially for testing unsupervised learning algorithms like K-Means, DBSCAN, etc.

📦 Syntax


from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=100, centers=3, n_features=2, cluster_std=1.0, random_state=42)

🔧 Parameters of `make_blobs`

Parameter	Type	Description
n_samples	int or tuple	Total number of points (or number per cluster if tuple)
n_features	int	Number of features (dimensions)
centers	int or array	Number of clusters, or coordinates of cluster centers
cluster_std	float or array	Standard deviation (spread) of the clusters
center_box	tuple (min, max)	Range of cluster centers if `centers` is int
shuffle	bool	Whether to shuffle samples
random_state	int or `None`	Seed for reproducibility
return_centers	bool	If `True`, also returns the centers used to generate blobs (default is `False`)

📊 Example: Generate and Plot Clusters


import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4, n_features=2, cluster_std=1.0, random_state=42)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50)
plt.title("Synthetic Dataset using make_blobs")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()

🧪 Useful Variants

1. Different std per cluster


X, y = make_blobs(n_samples=300, centers=3, cluster_std=[0.5, 1.5, 0.8], random_state=42)

2. Custom Centers


import numpy as np
centers = np.array([[1, 1], [-1, -1], [1, -1]])
X, y = make_blobs(n_samples=300, centers=centers, cluster_std=0.6, random_state=42)

Let me know if you want a comparison between datasets generated using make_blobs and make_moons or make_circles (used for non-linear clustering).

🧾 Default Output

python
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42)
print(X.shape)  # (100, 2)

X is a 2D array: shape (100, 2) → Two unnamed features
y is a 1D array: shape (100,) → Labels (used for evaluation or visualization only)

🐼 Add Column Names (Optional)

You can assign column names like this:

python
import pandas as pd

df = pd.DataFrame(X, columns=['Feature1', 'Feature2'])
df['Label'] = y

print(df.head())

Sample Output:

Feature1	Feature2	Label
-2.69	8.32	2
-9.47	9.03	1
-9.22	8.01	1
4.57	1.22	0
3.33	0.39	0

Would you like me to help generate a full dataset with make_blobs and save it to a CSV or use it in clustering visualizations?

give me the example of K-MEANS using make_blobs() dataset step by step in detail using python code

Here is a step-by-step detailed example of K-Means Clustering using the make_blobs() dataset in Python, including dataset creation, visualization, clustering, and evaluation.

🧪 Step 1: Import Required Libraries

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

🎯 Step 2: Generate Synthetic Dataset using `make_blobs()`

python
# Generate 300 samples, 3 clusters, 2 features
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

# Optional: Create a DataFrame for better readability
df = pd.DataFrame(X, columns=['Feature1', 'Feature2'])
df['TrueLabel'] = y_true

# View first few rows
print(df.head())

📊 Step 3: Visualize the Original Data

python
plt.scatter(df['Feature1'], df['Feature2'], c=df['TrueLabel'], cmap='viridis', s=50)
plt.title('Generated Dataset (True Labels)')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.grid(True)
plt.show()

📌 Step 4: Apply K-Means Clustering

python
# Create a KMeans instance with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the model
kmeans.fit(X)

# Predict clusters
pred_labels = kmeans.predict(X)

# Add predictions to DataFrame
df['KMeansLabel'] = pred_labels

📍 Step 5: Visualize K-Means Clusters with Centroids

python
# Plot the clustered data
plt.scatter(df['Feature1'], df['Feature2'], c=df['KMeansLabel'], cmap='rainbow', s=50)

# Plot the cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.6, marker='X', label='Centroids')

plt.title('K-Means Clustering Results')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.legend()
plt.grid(True)
plt.show()

📈 Step 6: Evaluate Using Silhouette Score

python
score = silhouette_score(X, pred_labels)
print(f'Silhouette Score: {score:.2f}')

The Silhouette Score ranges from -1 to 1. A value close to 1 means clusters are well separated.

🧠 Step 7: Determine Optimal Number of Clusters (Elbow Method)

python
inertia = []
K = range(1, 11)

for k in K:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertia.append(km.inertia_)  # Sum of squared distances to closest cluster center

# Plot Elbow
plt.plot(K, inertia, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (Within-cluster Sum of Squares)')
plt.title('Elbow Method For Optimal k')
plt.grid(True)
plt.show()

Look for the "elbow" point in the curve — the point after which inertia decreases slowly.

📌 Summary

Step	Description
1	Import libraries
2	Generate synthetic dataset with `make_blobs()`
3	Visualize initial data
4	Apply K-Means clustering
5	Plot predicted clusters with centroids
6	Evaluate with silhouette score
7	Use Elbow method to select optimal `k`

Would you like the same example in a Jupyter Notebook format or saved as a .py file? I can also help you compare results using DBSCAN or Hierarchical clustering on this same dataset.

describe elbow method in details

Sure! Here's a detailed explanation of the Elbow Method—a popular technique to determine the optimal number of clusters (k) in K-Means clustering.

📌 What is the Elbow Method?

The Elbow Method helps to choose the optimal number of clusters (k) by plotting the inertia (within-cluster sum of squares) as a function of the number of clusters and finding the "elbow point"—where the rate of decrease sharply slows down.

📐 Inertia: What Are We Measuring?

Inertia is the sum of squared distances between each point and its closest cluster center.
It measures how tight the clusters are.

\text{Inertia} = \sum_{i=1}^{k} \sum_{x_j \in C_i} \| x_j - \mu_i \|^2

Where:

$C_i$ = Cluster i
$\mu_i$ = Centroid of cluster i
$x_j \in C_i$ = Data points in cluster i

Lower inertia is better, but too many clusters will always give you lower inertia — so we must find the balance.

🎯 Goal of the Elbow Method

Find the value of k where adding more clusters doesn't significantly reduce inertia — this is the elbow point.

📉 How to Implement the Elbow Method in Python

python
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Step 1: Generate sample data
X, y = make_blobs(n_samples=300, centers=5, cluster_std=0.60, random_state=0)

# Step 2: Calculate inertia for different values of k
inertia = []
K = range(1, 11)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

# Step 3: Plot the Elbow curve
plt.figure(figsize=(8, 5))
plt.plot(K, inertia, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method to Determine Optimal k')
plt.grid(True)
plt.show()

🧠 Interpreting the Plot

The X-axis = number of clusters (k)
The Y-axis = inertia
You'll notice:
- Inertia decreases sharply at first
- Then the curve flattens out
The point where the decrease slows down is the “elbow” — the optimal number of clusters

✅ Example

k	Inertia
1	2500
2	1500
3	900
4	650
5	600
6	580

In this case, the elbow is around k=3 or 4 — adding more clusters reduces inertia only slightly.

📦 Pros and Cons of Elbow Method

✅ Pros:

Simple to implement
Helps avoid overfitting (too many clusters)

❌ Cons:

Elbow point is sometimes ambiguous
Doesn’t work well when clusters aren't clearly separated
Doesn't account for non-spherical shapes (K-Means limitation)

💡 Alternatives to Elbow Method

Silhouette Score
Gap Statistic
Davies–Bouldin Index

Would you like a combined visualization of Elbow + Silhouette Score to strengthen your cluster validation?

give me the in depth tutorial of hierarchical clustering with example using make_blobs() dataset and also compare the accuracy from k-means

Here's an in-depth tutorial on Hierarchical Clustering using the make_blobs() dataset in Python. This includes step-by-step implementation, visualizations (dendrogram), clustering, and a comparison with K-Means using Silhouette Score.

🧠 Hierarchical Clustering – In Depth

📌 What is Hierarchical Clustering?

Hierarchical Clustering builds a tree-like structure (dendrogram) of nested clusters. It comes in two types:

1. Agglomerative (Bottom-Up) — Most common

Each point starts as a cluster.
Pairs of clusters are merged until one cluster remains.

2. Divisive (Top-Down)

Starts with one cluster of all data points.
Recursively splits the cluster into sub-clusters.

🔢 Distance Metrics

Hierarchical clustering uses a distance metric to compute similarity:

Euclidean (default)
Manhattan, Cosine, etc.

🔗 Linkage Criteria

Used to define the distance between clusters:

Single Linkage: min distance between points
Complete Linkage: max distance
Average Linkage: mean distance
Ward’s Method: minimizes variance (recommended for Euclidean data)

✅ Step-by-Step Example using `make_blobs()`

✅ Step 1: Import Libraries

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage

✅ Step 2: Generate Dataset

python
# Create dataset with 3 true clusters
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

# Optional: Create DataFrame
df = pd.DataFrame(X, columns=['Feature1', 'Feature2'])
df['TrueLabel'] = y_true

✅ Step 3: Plot Original Data

python
plt.scatter(df['Feature1'], df['Feature2'], c=y_true, cmap='viridis', s=50)
plt.title("Original Dataset (True Labels)")
plt.xlabel("Feature1")
plt.ylabel("Feature2")
plt.grid(True)
plt.show()

✅ Step 4: Plot Dendrogram

python
linked = linkage(X, method='ward')  # 'ward' minimizes variance

plt.figure(figsize=(12, 6))
dendrogram(linked,
           orientation='top',
           distance_sort='descending',
           show_leaf_counts=False)
plt.title("Dendrogram (Ward's Linkage)")
plt.xlabel("Data Points")
plt.ylabel("Distance")
plt.grid(True)
plt.show()

✅ Step 5: Apply Agglomerative Clustering

python
# Create Agglomerative model with 3 clusters
agglo = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
agglo_labels = agglo.fit_predict(X)

# Add to DataFrame
df['AggloLabel'] = agglo_labels

✅ Step 6: Visualize Hierarchical Clustering

python
plt.scatter(df['Feature1'], df['Feature2'], c=agglo_labels, cmap='rainbow', s=50)
plt.title("Agglomerative Hierarchical Clustering")
plt.xlabel("Feature1")
plt.ylabel("Feature2")
plt.grid(True)
plt.show()

✅ Step 7: Compare with K-Means

python
# Apply KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
df['KMeansLabel'] = kmeans_labels

✅ Step 8: Evaluate Using Silhouette Score

python
kmeans_score = silhouette_score(X, kmeans_labels)
agglo_score = silhouette_score(X, agglo_labels)

print(f"K-Means Silhouette Score: {kmeans_score:.3f}")
print(f"Agglomerative Clustering Silhouette Score: {agglo_score:.3f}")

✅ Step 9: Final Comparison Table

Method	Silhouette Score (0-1)	Notes
K-Means	e.g. 0.71	Fast, works well for spherical clusters
Agglomerative (Ward)	e.g. 0.70	Works well with hierarchical structure

⚠ Note: Scores vary slightly based on dataset randomness. The closer to 1, the better the clustering.

🧠 When to Use Hierarchical Clustering?

✅ Use it when:

You want a hierarchical view of clusters (dendrogram)
You don't know how many clusters exist upfront
The dataset is small to medium

❌ Avoid it when:

Dataset is very large (slow performance)
You need real-time clustering (it's non-incremental)

Unsupervised Machine Learning

🔍 1. What is Unsupervised Machine Learning?

🧠 2. Types of Unsupervised Learning

🧪 3. Top Unsupervised Learning Algorithms

🧰 4. Real-Time Dataset Example: Customer Segmentation (Mall Customers)

✅ 5. Algorithm 1: K-Means Clustering

🔧 Step-by-Step Code in Python

📊 Final Clustering and Visualization

✅ Evaluate using Silhouette Score

🧭 6. Algorithm 2: Hierarchical Clustering

⚙️ 7. Algorithm 3: DBSCAN

📉 8. Algorithm 4: PCA (Dimensionality Reduction)

📈 9. Accuracy Metrics for Unsupervised Learning

🧠 Summary Table

ChatGPT said:

📦 Syntax

🔧 Parameters of make_blobs

📊 Example: Generate and Plot Clusters

🧪 Useful Variants

1. Different std per cluster

2. Custom Centers

🧾 Default Output

🐼 Add Column Names (Optional)

Sample Output:

You said:

ChatGPT said:

🧪 Step 1: Import Required Libraries

🎯 Step 2: Generate Synthetic Dataset using make_blobs()

📊 Step 3: Visualize the Original Data

📌 Step 4: Apply K-Means Clustering

📍 Step 5: Visualize K-Means Clusters with Centroids

📈 Step 6: Evaluate Using Silhouette Score

🧠 Step 7: Determine Optimal Number of Clusters (Elbow Method)

📌 Summary

You said:

ChatGPT said:

📌 What is the Elbow Method?

📐 Inertia: What Are We Measuring?

🎯 Goal of the Elbow Method

📉 How to Implement the Elbow Method in Python

🧠 Interpreting the Plot

✅ Example

📦 Pros and Cons of Elbow Method

✅ Pros:

❌ Cons:

💡 Alternatives to Elbow Method

You said:

ChatGPT said:

🧠 Hierarchical Clustering – In Depth

📌 What is Hierarchical Clustering?

1. Agglomerative (Bottom-Up) — Most common

2. Divisive (Top-Down)

🔢 Distance Metrics

🔗 Linkage Criteria

✅ Step-by-Step Example using make_blobs()

✅ Step 1: Import Libraries

✅ Step 2: Generate Dataset

✅ Step 3: Plot Original Data

✅ Step 4: Plot Dendrogram

✅ Step 5: Apply Agglomerative Clustering

✅ Step 6: Visualize Hierarchical Clustering

✅ Step 7: Compare with K-Means

✅ Step 8: Evaluate Using Silhouette Score

✅ Step 9: Final Comparison Table

🧠 When to Use Hierarchical Clustering?

Labels

Comments

Post a Comment

Popular posts from this blog

DSA in C# | Data Structure and Algorithm using C#

Conditional Statement in Python

JSP Page design using Internal CSS

🔧 Parameters of `make_blobs`

🎯 Step 2: Generate Synthetic Dataset using `make_blobs()`

✅ Step-by-Step Example using `make_blobs()`