In depth Tutorial for Classification | type of algorithm of Classification

0

 

Classification Tutorial: Linear and Non-Linear Models with Heart Disease Dataset

Introduction

Classification is a machine learning task where we assign a category (class) to an input based on its features. For example, predicting whether a patient has heart disease (yes/no) based on medical data. Classification algorithms are divided into:

  • Linear Models: Assume classes can be separated by a straight line (or hyperplane in higher dimensions).
  • Non-Linear Models: Capture complex, curved boundaries between classes.

For Beginners: Think of classification like sorting fruits into apples and oranges. Linear models draw a straight line to separate them, while non-linear models draw wiggly lines to handle trickier cases where apples and oranges are mixed in complex patterns.

For Professionals: Classification involves learning a function ( f(X) \to y ), where ( X ) is the feature matrix and ( y ) is the class label (e.g., 0 or 1). Linear models assume ( f ) is a linear combination of features, while non-linear models use transformations or local patterns to model complex decision boundaries.

This tutorial covers seven algorithms:

  • Linear: Logistic Regression, Support Vector Machine (SVM, Linear Kernel).
  • Non-Linear: K-Nearest Neighbors (K-NN), Kernel SVM (RBF Kernel), Naive Bayes, Decision Tree, Random Forest.
  • Dataset: UCI Heart Disease dataset.
  • Tools: Python with scikit-learn, matplotlib, seaborn.

We’ll explain each algorithm’s theory, implement it, and visualize decision boundaries to show how it separates classes.

Dataset: UCI Heart Disease

The UCI Heart Disease dataset is a standard dataset for binary classification, available from the UCI Machine Learning Repository. It contains medical data to predict whether a patient has heart disease.

  • Features: 13 numerical and categorical features, e.g.:
    • age: Age in years.
    • chol: Serum cholesterol in mg/dl.
    • thalach: Maximum heart rate achieved.
    • cp: Chest pain type (1–4).
    • (See UCI link for full details.)
  • Target: Binary (0 = no heart disease, 1 = heart disease).
  • Size: 303 samples (165 no disease, 138 disease).
  • Source: We’ll load it via pandas from a processed CSV file (Cleveland dataset).

To simplify visualization, we’ll select two features (age and chol) for decision boundary plots. Later, we’ll use all features for performance evaluation.

For Beginners: Imagine this dataset as a table where each row is a patient, and columns are their health measurements (like age or cholesterol). We want to predict if they have heart disease based on these measurements.

For Professionals: The dataset is a matrix ( X \in \mathbb{R}^{303 \times 13} ) with labels ( y \in {0, 1}^{303} ). Features include continuous (e.g., chol) and ordinal (e.g., cp) variables, requiring preprocessing like standardization for continuous features and encoding for categorical ones.

Prerequisites

  • Python 3.x

  • Libraries: scikit-learn, numpy, pandas, matplotlib, seaborn

  • Install dependencies:

    pip install scikit-learn numpy pandas matplotlib seaborn
    
  • Dataset: Download the processed Cleveland dataset from UCI or use a hosted version (e.g., via Kaggle or GitHub).

Step 1: Theoretical Background

Before diving into code, let’s understand each algorithm’s theory, explained for both beginners and professionals.

1.1 Logistic Regression (Linear)

Beginner Explanation:

  • Imagine you’re trying to predict if a patient has heart disease. Logistic Regression looks at their features (like age and cholesterol) and gives a probability (e.g., 80% chance of disease). It draws a straight line to separate “disease” from “no disease” patients.
  • It’s like deciding whether a student passes an exam based on study hours: more hours, higher chance of passing.

Professional Explanation:

  • Logistic Regression models the probability of class 1 (e.g., heart disease) using the logistic function: [ P(y=1|X) = \frac{1}{1 + e^{-(w_0 + w_1x_1 + \dots + w_nx_n)}} ] where ( w_i ) are weights, ( x_i ) are features, and ( w_0 ) is the intercept.
  • The model is trained by maximizing the log-likelihood: [ \mathcal{L} = \sum_i [y_i \log(P(y_i=1)) + (1-y_i) \log(1-P(y_i=1))] ]
  • Strengths: Interpretable, fast, works well for linearly separable data.
  • Weaknesses: Cannot capture non-linear relationships.

1.2 Support Vector Machine (Linear Kernel)

Beginner Explanation:

  • SVM draws the best straight line to separate two groups (e.g., disease vs. no disease). It tries to keep the line as far away as possible from points in both groups, like a road with wide shoulders.
  • Think of it as finding the widest path between two crowds at a festival.

Professional Explanation:

  • Linear SVM finds the hyperplane ( w^T x + b = 0 ) that maximizes the margin between classes. The margin is the distance to the nearest points (support vectors).
  • It solves: [ \min_{w, b} \frac{1}{2} |w|^2 \text{ subject to } y_i(w^T x_i + b) \geq 1 ]
  • For soft margins (allowing some misclassifications), it minimizes: [ \min_{w, b, \xi} \frac{1}{2} |w|^2 + C \sum_i \xi_i ] where ( \xi_i ) are slack variables and ( C ) controls the trade-off.
  • Strengths: Robust to outliers, effective in high dimensions.
  • Weaknesses: Sensitive to scaling, limited to linear boundaries.

1.3 K-Nearest Neighbors (K-NN, Non-Linear)

Beginner Explanation:

  • K-NN is like asking your five closest friends for advice. It looks at the ( k ) nearest patients (based on features like age and cholesterol) and predicts the majority class among them.
  • If most of your neighbors have heart disease, it predicts you do too.

Professional Explanation:

  • K-NN classifies a point ( x ) by finding its ( k ) nearest neighbors in the feature space (using Euclidean distance) and assigning the majority class: [ y = \text{mode}(y_i \text{ for } i \in \text{nearest } k \text{ points}) ]
  • Distance metric: [ d(x, x') = \sqrt{\sum_i (x_i - x'_i)^2} ]
  • Hyperparameter: ( k ) (number of neighbors).
  • Strengths: Simple, captures local patterns, non-parametric.
  • Weaknesses: Slow for large datasets, sensitive to noise and scaling.

1.4 Kernel SVM (RBF Kernel, Non-Linear)

Beginner Explanation:

  • Kernel SVM is like Linear SVM but can draw curved lines. It transforms the data into a “magic space” where a straight line works, then brings it back to our world as a curved boundary.
  • Imagine bending a piece of paper to separate two groups easily.

Professional Explanation:

  • Kernel SVM maps data to a higher-dimensional space using a kernel function, allowing non-linear boundaries. The RBF (Radial Basis Function) kernel is: [ K(x, x') = \exp(-\gamma |x - x'|^2) ]
  • The decision function is: [ f(x) = \text{sign} \left( \sum_i \alpha_i y_i K(x, x_i) + b \right) ]
  • Hyperparameters: ( C ) (regularization), ( \gamma ) (kernel width).
  • Strengths: Flexible, captures complex patterns.
  • Weaknesses: Computationally intensive, requires tuning.

1.5 Naive Bayes (Gaussian, Non-Linear)

Beginner Explanation:

  • Naive Bayes is like a detective using clues. It calculates the probability of heart disease based on each feature (e.g., high cholesterol increases the chance) and combines them.
  • It assumes features don’t affect each other, like assuming age and cholesterol are unrelated.

Professional Explanation:

  • Naive Bayes applies Bayes’ theorem with the “naive” assumption of feature independence: [ P(y|x) = \frac{P(y) \prod_i P(x_i|y)}{P(x)} ]
  • For Gaussian Naive Bayes, each feature ( x_i ) follows a Gaussian distribution: [ P(x_i|y) = \frac{1}{\sqrt{2\pi \sigma_y^2}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2}\right) ]
  • Strengths: Fast, effective for small or text data.
  • Weaknesses: Independence assumption often unrealistic.

1.6 Decision Tree Classification (Non-Linear)

Beginner Explanation:

  • A Decision Tree is like a flowchart. It asks questions (e.g., “Is cholesterol > 200?”) and follows a path to decide “disease” or “no disease.”
  • It’s like a game of 20 questions to classify a patient.

Professional Explanation:

  • Decision Trees recursively split the feature space based on thresholds, forming a tree where leaves represent classes.
  • Splits are chosen to maximize information gain (e.g., using Gini impurity): [ \text{Gini} = 1 - \sum_i p_i^2 ]
  • Strengths: Interpretable, handles non-linear data.
  • Weaknesses: Prone to overfitting without pruning.

1.7 Random Forest Classification (Non-Linear)

Beginner Explanation:

  • Random Forest is like a team of Decision Trees voting together. Each tree makes a guess, and the majority wins.
  • It’s like asking multiple doctors for a diagnosis and trusting the most common answer.

Professional Explanation:

  • Random Forest is an ensemble of decision trees, each trained on a random subset of data and features (bagging and feature randomness).
  • The final prediction is the majority vote: [ y = \text{mode}(y_{\text{tree}1}, y{\text{tree}_2}, \dots) ]
  • Hyperparameters: Number of trees (( n_{\text{estimators}} )), max depth.
  • Strengths: Robust, reduces overfitting, high accuracy.
  • Weaknesses: Less interpretable, slower than single trees.

Step 2: Setup and Data Preparation

We load the Heart Disease dataset, preprocess it, and prepare for training.

Code: Data Loading and Preprocessing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from matplotlib.colors import ListedColormap

# Load dataset
url = "https://archive.ics.uci.edu/ml/
machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang',
 'oldpeak', 'slope', 'ca', 'thal', 'target']
data = pd.read_csv(url, names=columns)

# Preprocess: Handle missing values and convert target
data = data.replace('?', np.nan).dropna()
data['target'] = data['target'].apply(lambda x: 1 if x > 0 else 0)  # Binary: 0 = no disease
, 1 = disease

# Select two features for visualization
X_2d = data[['age', 'chol']].values
y = data['target'].values
feature_names = ['Age', 'Cholesterol']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_2d, y, test_size=0.3, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# For full feature set
X_full = data.drop('target', axis=1).values
X_full_train, X_full_test, y_full_train, y_full_test = train_test_split(X_full,
 y, test_size=0.3, random_state=42)
X_full_train_scaled = scaler.fit_transform(X_full_train)
X_full_test_scaled = scaler.transform(X_full_test)

# Function to plot decision boundaries
def plot_decision_boundary(model, X, y, title):
    h = 0.02  # Step size in mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=ListedColormap(['#FFAAAA', '#AAAAFF']), alpha=0.3)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=ListedColormap(['#FF0000', '#0000FF']),
 edgecolors='k')
    plt.xlabel('Age (standardized)')
    plt.ylabel('Cholesterol (standardized)')
    plt.title(title)
    plt.show()

Explanation:

  • Data Loading: The dataset is loaded from UCI’s repository as a CSV.
  • Preprocessing: Missing values (marked as ‘?’) are dropped, and the target is converted to binary (0 = no disease, 1 = disease).
  • Feature Selection: age and chol are used for 2D visualization; all 13 features are used later.
  • Splitting: 70% training, 30% testing.
  • Standardization: Features are scaled to mean=0, std=1 for algorithms like SVM and K-NN.
  • Visualization Function: Creates a mesh grid to plot decision boundaries, with red for disease and blue for no disease.

Step 3: Linear Models

We implement and explain Logistic Regression and Linear SVM.

3.1 Logistic Regression

Theory Recap:

  • Beginners: Logistic Regression gives a probability score (0 to 1) for heart disease. It uses a straight line to separate patients with and without disease.
  • Professionals: It models ( P(y=1|X) ) using the logistic function and optimizes weights via gradient descent to maximize log-likelihood.
How Logistic Regression Internally work:

Let’s break down **Logistic Regression** in a simple way:
What is Logistic Regression?
Logistic Regression is a machine learning algorithm used to predict the probability that something belongs to one of two categories (e.g., "Yes/No," "Healthy/Sick," "Spam/Not Spam"). For example, it can predict the chance someone has heart disease based on their age, weight, and other factors.
How Does It Work?
1. Inputs (Features):
   - You give the algorithm some data, like a person’s age, cholesterol level, or blood pressure. These are called features (denoted as \( x_1, x_2, \dots \)).
   
2. Weights and Intercept:
   - Each feature gets a **weight** (\( w_1, w_2, \dots \)) that shows how important it is. For example, high cholesterol might have a bigger weight for predicting heart disease.
   - There’s also an **intercept** (\( w_0 \)), a starting point that adjusts the prediction.

3. **Combining Features**:
   - The algorithm multiplies each feature by its weight, adds them up, and includes the intercept:
     \[ \text{Score} = w_0 + w_1 \cdot x_1 + w_2 \cdot x_2 + \dots \]
     This score can be any number (positive or negative).

4. **Logistic Function**:
   - To turn this score into a probability (a number between 0 and 1), we use a special function called the **logistic function**:
     \[ P(\text{Class 1}) = \frac{1}{1 + e^{-\text{Score}}} \]
     - If the score is a big positive number, the probability is close to 1 (e.g., "likely has heart disease").
     - If the score is a big negative number, the probability is close to 0 (e.g., "unlikely has heart disease").

5. **Output**:
   - The result is a probability. For example, a 0.75 probability might mean a 75% chance of heart disease. You can set a threshold (like 0.5) to decide: "If probability > 0.5, predict Class 1; otherwise, predict Class 0."

### How Does It Learn?
- The algorithm needs to find the best **weights** (\( w_0, w_1, w_2, \dots \)) to make accurate predictions.
- It does this by looking at training data (e.g., people with known heart disease status) and adjusting the weights to maximize the **log-likelihood**. This is a fancy way of saying it tries to make the predicted probabilities match the actual outcomes as closely as possible.
- For example:
  - If someone has heart disease (Class 1), the algorithm wants the predicted probability to be close to 1.
  - If someone doesn’t (Class 0), it wants the probability close to 0.

The algorithm tweaks the weights using math (like gradient descent) to get better and better at predicting.

### Why Is It Useful?
- It’s simple and works well for **binary classification** (two categories).
- It gives probabilities, not just "yes/no," which helps understand confidence in predictions.
- It’s fast and easy to interpret (e.g., you can see which features, like cholesterol, matter most).

### Example:
Imagine predicting if someone will buy a product based on their age and income:
- Features: Age = 30, Income = $50,000.
- The algorithm learns weights, say \( w_0 = -2 \), \( w_1 = 0.1 \) (for age), \( w_2 = 0.00005 \) (for income).
- Score: \( -2 + 0.1 \cdot 30 + 0.00005 \cdot 50000 = -2 + 3 + 2.5 = 3.5 \).
- Probability: \( \frac{1}{1 + e^{-3.5}} \approx 0.97 \).
- Result: 97% chance they’ll buy the product!


Implementation:

from sklearn.linear_model import LogisticRegression

# Train model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = log_reg.predict(X_test_scaled)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Plot decision boundary
plot_decision_boundary(log_reg, X_train_scaled, y_train, "Logistic Regression Decision Boundary")

Expected Output:

  • Accuracy: ~0.80–0.85 (with 2 features, improves with full set).
  • Decision Boundary: A straight line separating red (disease) and blue (no disease) points.
  • Classification Report: Balanced precision/recall, e.g., ~0.80 for both classes.

Explanation:

  • The model learns a linear boundary based on age and cholesterol.
  • The plot shows a clear, straight line, reflecting the linear assumption.
  • Standardization ensures stable optimization.

3.2 Support Vector Machine (Linear Kernel)

Theory Recap:

  • Beginners: Linear SVM draws a straight line to separate groups, keeping it as far as possible from points on both sides.
  • Professionals: It maximizes the margin by solving a constrained optimization problem, using support vectors to define the boundary.

Implementation:

from sklearn.svm import SVC

# Train model
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = svm_linear.predict(X_test_scaled)
print("Linear SVM Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Plot decision boundary
plot_decision_boundary(svm_linear, X_train_scaled, y_train, "Linear SVM Decision Boundary")

Expected Output:

  • Accuracy: ~0.80–0.85.
  • Decision Boundary: A straight line, slightly different from Logistic Regression due to margin maximization.
  • Classification Report: Similar to Logistic Regression.

Explanation:

  • The linear kernel ensures a linear boundary.
  • The plot highlights the margin, with support vectors near the boundary.
  • Scaling is critical for SVM’s optimization.

Step 4: Non-Linear Models

We implement and explain K-NN, Kernel SVM, Naive Bayes, Decision Tree, and Random Forest.

4.1 K-Nearest Neighbors (K-NN)

Theory Recap:

  • Beginners: K-NN looks at the 5 closest patients and predicts the majority class.
  • Professionals: It uses Euclidean distance to find neighbors and assigns the mode of their labels.

Implementation:

from sklearn.neighbors import KNeighborsClassifier

# Train model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test_scaled)
print("K-NN Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Plot decision boundary
plot_decision_boundary(knn, X_train_scaled, y_train, "K-NN Decision Boundary (k=5)")

Expected Output:

  • Accuracy: ~0.75–0.80.
  • Decision Boundary: Irregular, patchy regions based on local neighbors.
  • Classification Report: Slightly lower performance due to noise sensitivity.

Explanation:

  • K-NN adapts to local patterns, creating a non-linear boundary.
  • The plot shows a complex, region-based separation.
  • ( k=5 ) is a reasonable default; tuning ( k ) can improve results.

4.2 Kernel SVM (RBF Kernel)

Theory Recap:

  • Beginners: Kernel SVM draws curved lines by transforming data into a higher-dimensional space.
  • Professionals: The RBF kernel maps data to a space where a linear boundary is possible, solved via dual optimization.

Implementation:

# Train model
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = svm_rbf.predict(X_test_scaled)
print("Kernel SVM (RBF) Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Plot decision boundary
plot_decision_boundary(svm_rbf, X_train_scaled, y_train, "Kernel SVM (RBF) Decision Boundary")

Expected Output:

  • Accuracy: ~0.80–0.85.
  • Decision Boundary: Smooth, curved regions adapting to data clusters.
  • Classification Report: Improved over Linear SVM due to non-linear flexibility.

Explanation:

  • The RBF kernel creates a flexible boundary.
  • The plot shows smooth curves, capturing non-linear patterns.
  • Tuning ( C ) and ( \gamma ) can enhance performance.

4.3 Naive Bayes (Gaussian)

Theory Recap:

  • Beginners: Naive Bayes calculates probabilities for each feature and combines them to predict the class.
  • Professionals: It assumes feature independence and uses Gaussian distributions for continuous features.

Implementation:

from sklearn.naive_bayes import GaussianNB

# Train model
nb = GaussianNB()
nb.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = nb.predict(X_test_scaled)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Plot decision boundary
plot_decision_boundary(nb, X_train_scaled, y_train, "Naive Bayes Decision Boundary")

Expected Output:

  • Accuracy: ~0.75–0.80.
  • Decision Boundary: Elliptical or quadratic, based on Gaussian assumptions.
  • Classification Report: Moderate performance due to independence assumption.

Explanation:

  • The model assumes age and cholesterol are independent, which may not hold.
  • The plot shows a smooth, non-linear boundary.
  • Naive Bayes is fast and effective for small datasets.

4.4 Decision Tree Classification

Theory Recap:

  • Beginners: Decision Trees ask a series of yes/no questions to classify patients.
  • Professionals: They split the feature space to minimize impurity (e.g., Gini).

Implementation:

from sklearn.tree import DecisionTreeClassifier

# Train model
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = dt.predict(X_test_scaled)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Plot decision boundary
plot_decision_boundary(dt, X_train_scaled, y_train, "Decision Tree Decision Boundary")

Expected Output:

  • Accuracy: ~0.70–0.80.
  • Decision Boundary: Rectangular, axis-aligned regions.
  • Classification Report: Prone to overfitting without depth control.

Explanation:

  • The tree splits on features like age and cholesterol.
  • The plot shows blocky regions, reflecting the tree’s structure.
  • max_depth=5 prevents overfitting.

4.5 Random Forest Classification

Theory Recap:

  • Beginners: Random Forest combines many Decision Trees for a more reliable prediction.
  • Professionals: It uses bagging and feature randomness to reduce variance and improve accuracy.

Implementation:

from sklearn.ensemble import RandomForestClassifier

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test_scaled)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Plot decision boundary
plot_decision_boundary(rf, X_train_scaled, y_train, "Random Forest Decision Boundary")

Expected Output:

  • Accuracy: ~0.80–0.85.
  • Decision Boundary: Smoother, complex regions from ensemble averaging.
  • Classification Report: Among the best performers.

Explanation:

  • Random Forest averages multiple trees, reducing overfitting.
  • The plot shows a refined boundary.
  • n_estimators=100 ensures robustness.

Step 5: Performance Comparison (Full Feature Set)

We train all models on the full 13 features and compare performance.

Code: Comparison

models = {
    "Logistic Regression": LogisticRegression(random_state=42),
    "Linear SVM": SVC(kernel='linear', random_state=42),
    "K-NN": KNeighborsClassifier(n_neighbors=5),
    "Kernel SVM (RBF)": SVC(kernel='rbf', random_state=42),
    "Naive Bayes": GaussianNB(),
    "Decision Tree": DecisionTreeClassifier(max_depth=5, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42)
}

# Train and evaluate
results = []
for name, model in models.items():
    model.fit(X_full_train_scaled, y_full_train)
    y_pred = model.predict(X_full_test_scaled)
    acc = accuracy_score(y_full_test, y_pred)
    results.append({"Model": name, "Accuracy": acc})
    print(f"{name} (Full Features) Accuracy: {acc:.3f}")

# Plot results
results_df = pd.DataFrame(results)
plt.figure(figsize=(10, 6))
sns.barplot(x="Accuracy", y="Model", data=results_df.sort_values("Accuracy", ascending=False))
plt.title("Model Accuracy Comparison (Full Feature Set)")
plt.show()

Expected Output:

  • Accuracy (Full Features):
    • Random Forest: ~0.85–0.90
    • Kernel SVM: ~0.83–0.88
    • Logistic Regression: ~0.82–0.87
    • Linear SVM: ~0.82–0.87
    • K-NN: ~0.80–0.85
    • Naive Bayes: ~0.80–0.85
    • Decision Tree: ~0.75–0.80
  • Bar Plot: Shows Random Forest and Kernel SVM as top performers.

Explanation:

  • Using all 13 features improves accuracy, as they capture more information (e.g., thalach, cp).
  • Random Forest excels due to its robustness to noise and feature interactions.
  • Linear models perform well, indicating some linear separability in the data.

Step 6: Visualizing Decision Boundaries

The decision boundary plots show how each algorithm separates patients with and without heart disease:

  • Logistic Regression, Linear SVM: Straight lines, assuming linear separability.
  • K-NN: Patchy, local regions based on neighbors.
  • Kernel SVM: Smooth, curved boundaries.
  • Naive Bayes: Elliptical boundaries from Gaussian assumptions.
  • Decision Tree: Rectangular, axis-aligned splits.
  • Random Forest: Complex, smoothed boundaries from ensemble voting.

For Beginners: These plots are like maps showing where the model thinks “disease” or “no disease” patients are. Linear models use straight lines, while non-linear models use wiggly shapes.

For Professionals: The boundaries reflect the model’s hypothesis space. Linear models are constrained to ( w^T x + b = 0 ), while non-linear models (e.g., Kernel SVM) use kernel-induced feature mappings or local (K-NN) or probabilistic (Naive Bayes) approaches.

Step 7: Insights and Choosing an Algorithm

  • Linear Models:
    • Logistic Regression: Ideal for interpretable models and quick prototyping. Use when you need probabilities or linear separability is likely.
    • Linear SVM: Best for high-dimensional data with clear margins. Good for balanced datasets.
  • Non-Linear Models:
    • K-NN: Suitable for small datasets with local patterns. Avoid for large datasets due to computational cost.
    • Kernel SVM: Excellent for complex, non-linear patterns. Requires careful tuning of ( C ) and ( \gamma ).
    • Naive Bayes: Fast and effective for small or sparse data. Less accurate if features are correlated.
    • Decision Tree: Interpretable but prone to overfitting. Use with pruning or as a baseline.
    • Random Forest: Robust, high-performing, and handles feature interactions well. Use for high accuracy with moderate interpretability.
  • Heart Disease Dataset:
    • The dataset has a mix of linear and non-linear patterns, making Random Forest and Kernel SVM top performers.
    • Linear models perform well due to some linearly separable features (e.g., chol, thalach).
    • Naive Bayes and Decision Trees may underperform due to feature correlations or overfitting.
  • Practical Tips:
    • Preprocessing: Standardize continuous features; encode categorical features (e.g., cp, thal).
    • Tuning: Use GridSearchCV for hyperparameters (e.g., ( k ), ( C ), max_depth).
    • Evaluation: Use cross-validation and metrics like F1-score for imbalanced data.
    • Computational Cost: Naive Bayes and Logistic Regression are fastest; Random Forest and Kernel SVM are slower.

For Beginners: Start with Logistic Regression (easy to understand) or Random Forest (works well without much tweaking). Think of tuning as adjusting a recipe to taste better.

For Professionals: Random Forest and Kernel SVM are robust choices for this dataset, but consider feature selection (e.g., using thalach, cp) to reduce dimensionality. Evaluate models with ROC-AUC for medical applications, as false negatives are critical.

Conclusion

This tutorial provided a detailed guide to classification using linear and non-linear models on the UCI Heart Disease dataset. We covered:

  • Theory: Intuitive analogies for beginners and mathematical formulations for professionals.
  • Implementation: Python code for seven algorithms, with preprocessing and visualization.
  • Results: Random Forest and Kernel SVM achieved the highest accuracy (~0.85–0.90), while Logistic Regression and Linear SVM offered interpretability.
  • Visualizations: Decision boundary plots clarified each model’s behavior.

Next Steps:

  • Experiment with hyperparameter tuning (e.g., GridSearchCV).
  • Try feature selection or engineering (e.g., interaction terms).
  • Explore other datasets (e.g., Diabetes, Titanic).
  • Consider advanced models like XGBoost or neural networks for comparison.

For further learning, visit the UCI Machine Learning Repository or scikit-learn’s documentation.

Post a Comment

0Comments

POST Answer of Questions and ASK to Doubt

Post a Comment (0)