In-Depth Tutorial: Linear Regression for Machine Learning

This tutorial is structured to take you from understanding the theory to implementing and evaluating

linear regression models. We’ll also explore variations like multiple and polynomial regression, and

I’ll provide practical tips to avoid common pitfalls.

Top 5 Machine Learning algorithm:

Regression:

Simple Linear Regression

Multiple Linear Regression

Polynomial Regression

Ridge and Lasso Regression

Support Vector Regression

Classification:

Linear Models
- Logistic Regression
- Support Vector Machines
Non-linear Models
- K-Nearest Neighbours
- Kernel SVM
- Native Bayes
- Decision Tree Classification
- Random Forest Classification

Table of Contents

1. Understanding Linear Regression

- What is Linear Regression?

- The Math Behind It

- Linear Regression in Supervised Learning

2. Key Concepts and Assumptions

- Assumptions of Linear Regression

- Cost Function and Optimization

3. Practical Implementation: Simple Linear Regression

- Step-by-Step Python Example

- Visualizing and Interpreting Results

4. Multiple Linear Regression

- Theory and Use Cases

- Python Example with Multiple Features

5. Polynomial Regression

- Handling Nonlinear Relationships

- Python Example

6. Regularized Regression (Ridge and Lasso)

- Preventing Overfitting

- Python Examples

7. Evaluating Regression Models

- Metrics: MSE, RMSE, R²

- Cross-Validation

8. Practical Tips and Common Pitfalls

- Data Preprocessing

- Handling Outliers and Multicollinearity

9. Real-World Example: House Price Prediction

- End-to-End Python Project

1. Understanding Linear Regression

What is Linear Regression?

Linear regression predicts a continuous numerical value (e.g., house price, temperature) based on one or

more input features (e.g., house size, weather conditions).

It models the relationship as a straight line (or hyperplane in higher dimensions).

Simple Linear Regression: One feature.

Equation: y = mx + b

- y: Predicted value (target).

- x: Input feature.

- m: Slope (weight of the feature).

- b: Intercept (bias term).

- Multiple Linear Regression: Multiple features.

Equation: y = m₁x₁ + m₂x₂ + ... + mₙxₙ + b

The Math Behind It

Linear regression finds the best-fit line by minimizing the error between predicted and actual values.

The error is measured using a cost function, typically the Mean Squared Error (MSE):

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

- yᵢ: Actual value.

- ŷᵢ: Predicted value (mx + b).

- n: Number of data points.

The model adjusts m and b to make MSE as small as possible using an optimization technique called Gradient Descent:

- Start with random m and b.

- Compute the gradient (direction to reduce MSE).

- Update m and b iteratively until MSE converges.

Linear Regression in Supervised Learning

Linear regression is a supervised learning algorithm because it uses labeled data (input features paired with known

outputs). Supervised learning includes:

- Regression: Predicting numbers (e.g., linear regression).

- Classification: Predicting categories (e.g., logistic regression).

In linear regression:

1. You provide a dataset with features (X) and labels (y).

2. The model learns the relationship by optimizing m and b.

3. You can then predict y for new X.

2. Key Concepts and Assumptions

Assumptions of Linear Regression

For linear regression to work well, these assumptions should hold:

1. Linearity: The relationship between features and target is linear.

2. Independence: Observations are independent of each other.

3. Homoscedasticity: Constant variance of errors across all levels of features.

4. Normality: Errors are normally distributed (important for statistical tests).

5. No multicollinearity: Features are not highly correlated (for multiple regression).

If these assumptions are violated, the model may perform poorly. We’ll discuss how to check them later.

Cost Function and Optimization

The cost function (MSE) quantifies how wrong the predictions are. The goal is to find m and b that minimize MSE. Scikit-learn uses Ordinary Least Squares (OLS) or Gradient Descent internally:

- OLS: Solves for m and b analytically (fast for small datasets).

- Gradient Descent: Iteratively updates m and b (better for large datasets).

3. Practical Implementation: Simple Linear Regression

Let’s implement simple linear regression to predict student grades based on study hours.

Python Code

Import libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

Step 1: Create a dataset

data = {

'Hours': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],

'Grade': [50, 55, 60, 65, 70, 75, 80, 85, 90, 95]

}

df = pd.DataFrame(data)

Step 2: Prepare data

X = df[['Hours']] Feature (2D array)

y = df['Grade'] Target

Step 3: Train the model

model = LinearRegression()

model.fit(X, y)

Step 4: Make predictions

y_pred = model.predict(X)

Step 5: Evaluate the model

mse = mean_squared_error(y, y_pred)

r2 = r2_score(y, y_pred)

print(f"Slope (m): {model.coef_[0]:.2f}")

print(f"Intercept (b): {model.intercept_:.2f}")

print(f"Mean Squared Error: {mse:.2f}")

print(f"R² Score: {r2:.2f}")

Step 6: Visualize

plt.scatter(X, y, color='blue', label='Actual Grades')

plt.plot(X, y_pred, color='red', label='Regression Line')

plt.xlabel('Study Hours')

plt.ylabel('Grade')

plt.title('Linear Regression: Study Hours vs Grade')

plt.legend()

plt.show()

Step 7: Predict for new data

new_hours = np.array([[6.5]])

predicted_grade = model.predict(new_hours)

print(f"Predicted grade for 6.5 hours: {predicted_grade[0]:.2f}")

Explanation

- Dataset: 10 students with study hours and grades.

- Training: `model.fit(X, y)` computes m and b.

- Evaluation:

- MSE: Measures prediction error.

- R²: Tells how much of the grade variation is explained (0 to 1).

- Visualization: Scatter plot with a red line showing the fit.

- Prediction: Estimates grade for 6.5 hours.

Expected Output

- Slope (~4.5): Each extra hour adds ~4.5 points to the grade.

- Intercept (~40): Base grade without studying.

- MSE: Low if the line fits well.

- R²: Close to 1 for a good fit.

- Plot: Red line through blue points.

- Prediction: ~70–75 for 6.5 hours.

---

4. Multiple Linear Regression

Theory and Use Cases

Multiple linear regression extends simple linear regression to multiple features.

Equation: y = m₁x₁ + m₂x₂ + ... + mₙxₙ + b

- Use Cases:

- Predicting house prices using size, bedrooms, and location.

- Estimating sales based on advertising budget and market trends.

Python Example

Predict house prices using size and bedrooms.

Import libraries

import pandas as pd

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

Create dataset

data = {

'Size': [1400, 1600, 1700, 1875, 1100, 1550, 2350],

'Bedrooms': [3, 3, 4, 4, 2, 3, 5],

'Price': [245, 312, 279, 308, 199, 219, 405]

}

df = pd.DataFrame(data)

Prepare data

X = df[['Size', 'Bedrooms']]

y = df['Price']

Train model

model = LinearRegression()

model.fit(X, y)

Make predictions

y_pred = model.predict(X)

Evaluate

mse = mean_squared_error(y, y_pred)

r2 = r2_score(y, y_pred)

print(f"Coefficients: Size = {model.coef_[0]:.2f}, Bedrooms = {model.coef_[1]:.2f}")

print(f"Intercept: {model.intercept_:.2f}")

print(f"Mean Squared Error: {mse:.2f}")

print(f"R² Score: {r2:.2f}")

Predict for a new house

new_house = np.array([[1500, 3]])

predicted_price = model.predict(new_house)

print(f"Predicted price for 1500 sq ft, 3 bedrooms: ${predicted_price[0]:.2f} thousand")

Explanation

- Coefficients: Show how much each feature (size, bedrooms) affects price.

- Intercept: Base price when size and bedrooms are 0 (theoretical).

- Evaluation: MSE and R² assess model quality.

- Prediction: Combines both features for a new house.

5. Polynomial Regression

Handling Nonlinear Relationships

When the relationship isn’t linear, polynomial regression fits a curve by adding polynomial terms (e.g., x², x³).

Equation: y = m₁x + m₂x² + ... + b

Python Example

Predict sales based on advertising budget (nonlinear pattern).

import numpy as np

import pandas as pd

from sklearn.preprocessing import PolynomialFeatures

from sklearn.linear_model import LinearRegression

from sklearn.pipeline import make_pipeline

import matplotlib.pyplot as plt

Create dataset

data = {

'Ad_Budget': [10, 20, 30, 40, 50],

'Sales': [100, 250, 450, 550, 600]

}

df = pd.DataFrame(data)

Prepare data

X = df[['Ad_Budget']]

y = df['Sales']

Create polynomial regression model (degree 2)

polyreg = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())

polyreg.fit(X, y)

Predict for smooth curve

X_plot = np.linspace(10, 50, 100).reshape(-1, 1)

y_plot = polyreg.predict(X_plot)

Evaluate

y_pred = polyreg.predict(X)

mse = mean_squared_error(y, y_pred)

r2 = r2_score(y, y_pred)

print(f"Mean Squared Error: {mse:.2f}")

print(f"R² Score: {r2:.2f}")

Visualize

plt.scatter(X, y, color='blue', label='Actual Sales')

plt.plot(X_plot, y_plot, color='red', label='Polynomial Fit')

plt.xlabel('Ad Budget ($)')

plt.ylabel('Sales')

plt.title('Polynomial Regression: Ad Budget vs Sales')

plt.legend()

plt.show()

Predict for new budget

new_budget = np.array([[35]])

predicted_sales = polyreg.predict(new_budget)

print(f"Predicted sales for $35 budget: {predicted_sales[0]:.2f}")

Explanation

- PolynomialFeatures: Transforms X into [X, X²].

- Pipeline: Combines transformation and regression.

- Curve: Fits nonlinear data better than a straight line.

- Evaluation: MSE and R² show fit quality.

6. Regularized Regression (Ridge and Lasso)

Preventing Overfitting

Overfitting happens when the model fits the training data too closely, including noise, and performs poorly

on new data. Ridge and Lasso regression add penalties to prevent this:

- Ridge: Penalizes large coefficients (L2 regularization).

- Lasso: Penalizes coefficients and can set some to zero (L1 regularization).

Python Example (Ridge)

Using the house price dataset.

from sklearn.linear_model import Ridge

Same dataset as multiple regression

data = {

'Size': [1400, 1600, 1700, 1875, 1100],

'Bedrooms': [3, 3, 4, 4, 2],

'Price': [245, 312, 279, 308, 199]

}

df = pd.DataFrame(data)

X = df[['Size', 'Bedrooms']]

y = df['Price']

Train Ridge model

ridge = Ridge(alpha=1.0) Alpha controls penalty strength

ridge.fit(X, y)

Evaluate

y_pred = ridge.predict(X)

mse = mean_squared_error(y, y_pred)

r2 = r2_score(y, y_pred)

print(f"Coefficients: Size = {ridge.coef_[0]:.2f}, Bedrooms = {ridge.coef_[1]:.2f}")

print(f"Intercept: {ridge.intercept_:.2f}")

print(f"Mean Squared Error: {mse:.2f}")

print(f"R² Score: {r2:.2f}")

Explanation

- Alpha: Higher values shrink coefficients more, reducing overfitting.

- Coefficients: Smaller than in regular regression, making the model more robust.

Try Lasso by replacing `Ridge` with `Lasso` in the code above. Lasso may set some coefficients to zero

, effectively selecting features.

---

7. Evaluating Regression Models

Metrics

1. Mean Squared Error (MSE):

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Lower is better.

2. Root Mean Squared Error (RMSE):

\text{RMSE} = \sqrt{\text{MSE}}

Easier to interpret (same units as target).

3. R² Score:

R^2 = 1 - \frac{\text{Sum of Squared Errors}}{\text{Total Sum of Squares}}

- 0: Model explains none of the variance.

- 1: Perfect fit.

- Negative: Worse than predicting the mean.

Cross-Validation

Splitting data into training and test sets ensures the model generalizes well. K-Fold Cross-Validation splits

data into k parts, trains on k-1, and tests on the remaining part, repeating k times.

from sklearn.model_selection import cross_val_score

Use the house price dataset

model = LinearRegression()

scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print(f"Cross-Validated R² Scores: {scores}")

print(f"Average R²: {scores.mean():.2f}")

8. Practical Tips and Common Pitfalls

Data Preprocessing

1. Handle Missing Values: Fill with mean/median or drop rows.

df.fillna(df.mean(), inplace=True)

2. Feature Scaling: Standardize features (mean=0, std=1) for better performance.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

3. Encode Categorical Features: Convert categories (e.g., location) to numbers using one-hot encoding.

df = pd.get_dummies(df, columns=['Location'])

Handling Outliers

Outliers can skew the line. Detect them using z-scores or IQR, then remove or transform them.

Remove outliers using IQR

Q1 = df['Price'].quantile(0.25)

Q3 = df['Price'].quantile(0.75)

IQR = Q3 - Q1

df = df[~((df['Price'] < (Q1 - 1.5 * IQR)) | (df['Price'] > (Q3 + 1.5 * IQR)))]

Multicollinearity

If features are highly correlated (e.g., size and bedrooms), the model becomes unstable. Check with

Variance Inflation Factor (VIF):

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()

vif['Feature'] = X.columns

vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif)

- VIF > 5–10 indicates multicollinearity. Remove or combine correlated features.

---

9. Real-World Example: House Price Prediction

Let’s build an end-to-end project using a realistic dataset.

Python Code

Import libraries

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error, r2_score

import matplotlib.pyplot as plt

Step 1: Create a realistic dataset (or load from CSV)

data = {

'Size': [1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700],

'Bedrooms': [3, 3, 4, 4, 2, 3, 5, 4, 3, 3],

'Age': [5, 10, 2, 8, 20, 15, 3, 7, 12, 4],

'Price': [245, 312, 279, 308, 199, 219, 405, 324, 319, 255]

}

df = pd.DataFrame(data)

Step 2: Preprocess data

Check for missing values

print("Missing values:\n", df.isnull().sum())

Features and target

X = df[['Size', 'Bedrooms', 'Age']]

y = df['Price']

Scale features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

Step 3: Split data

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Step 4: Train model

model = LinearRegression()

model.fit(X_train, y_train)

Step 5: Predict and evaluate

y_pred_train = model.predict(X_train)

y_pred_test = model.predict(X_test)

Metrics

print("Training MSE:", mean_squared_error(y_train, y_pred_train))

print("Test MSE:", mean_squared_error(y_test, y_pred_test))

print("Training R²:", r2_score(y_train, y_pred_train))

print("Test R²:", r2_score(y_test, y_pred_test))

Step 6: Visualize predictions vs actuals (test set)

plt.scatter(y_test, y_pred_test, color='blue', label='Predicted vs Actual')

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label='Perfect Fit')

plt.xlabel('Actual Price')

plt.ylabel('Predicted Price')

plt.title('House Price Prediction: Actual vs Predicted')

plt.legend()

plt.show()

Step 7: Predict for a new house

new_house = scaler.transform(np.array([[1500, 3, 5]]))

predicted_price = model.predict(new_house)

print(f"Predicted price for 1500 sq ft, 3 bedrooms, 5 years old: ${predicted_price[0]:.2f} thousand")

Explanation

- Preprocessing: Checks for missing values and scales features.

- Train-Test Split: 80% training, 20% testing to evaluate generalization.

- Evaluation: Compares MSE and R² for training and test sets.

- Visualization: Scatter plot shows how close predictions are to actuals.

- Prediction: Scales new data before predicting.

In-Depth Tutorial: Linear Regression for Machine Learning

Shiva Gautam

Post a Comment

Uncontrolled form input in React-JS

ABOUT US

Contact form

In-Depth Tutorial: Linear Regression for Machine Learning

Shiva Gautam

You may like these posts

Post a Comment

Contact form