In-Depth Tutorial: Linear Regression for Machine Learning
This tutorial is structured to take you from understanding the theory to implementing and evaluating
linear regression models. We’ll also explore variations like multiple and polynomial regression, and
I’ll provide practical tips to avoid common pitfalls.
Top 5 Machine Learning algorithm:
Regression:
Simple Linear Regression
Multiple Linear Regression
Polynomial Regression
Ridge and Lasso Regression
Support Vector Regression
Classification:
- Linear Models
- Logistic Regression
- Support Vector Machines
- Non-linear Models
- K-Nearest Neighbours
- Kernel SVM
- Native Bayes
- Decision Tree Classification
- Random Forest Classification
Table of Contents
1. Understanding Linear Regression
- What is Linear Regression?
- The Math Behind It
- Linear Regression in Supervised Learning
2. Key Concepts and Assumptions
- Assumptions of Linear Regression
- Cost Function and Optimization
3. Practical Implementation: Simple Linear Regression
- Step-by-Step Python Example
- Visualizing and Interpreting Results
4. Multiple Linear Regression
- Theory and Use Cases
- Python Example with Multiple Features
5. Polynomial Regression
- Handling Nonlinear Relationships
- Python Example
6. Regularized Regression (Ridge and Lasso)
- Preventing Overfitting
- Python Examples
7. Evaluating Regression Models
- Metrics: MSE, RMSE, R²
- Cross-Validation
8. Practical Tips and Common Pitfalls
- Data Preprocessing
- Handling Outliers and Multicollinearity
9. Real-World Example: House Price Prediction
- End-to-End Python Project
1. Understanding Linear Regression
What is Linear Regression?
Linear regression predicts a continuous numerical value (e.g., house price, temperature) based on one or
more input features (e.g., house size, weather conditions).
It models the relationship as a straight line (or hyperplane in higher dimensions).
Simple Linear Regression: One feature.
Equation: y = mx + b
- y: Predicted value (target).
- x: Input feature.
- m: Slope (weight of the feature).
- b: Intercept (bias term).
- Multiple Linear Regression: Multiple features.
Equation: y = m₁x₁ + m₂x₂ + ... + mâ‚™xâ‚™ + b
The Math Behind It
Linear regression finds the best-fit line by minimizing the error between predicted and actual values.
The error is measured using a cost function, typically the Mean Squared Error (MSE):
\[
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
\]
- yáµ¢: Actual value.
- Å·áµ¢: Predicted value (mx + b).
- n: Number of data points.
The model adjusts m and b to make MSE as small as possible using an optimization technique called Gradient Descent:
- Start with random m and b.
- Compute the gradient (direction to reduce MSE).
- Update m and b iteratively until MSE converges.
Linear Regression in Supervised Learning
Linear regression is a supervised learning algorithm because it uses labeled data (input features paired with known
outputs). Supervised learning includes:
- Regression: Predicting numbers (e.g., linear regression).
- Classification: Predicting categories (e.g., logistic regression).
In linear regression:
1. You provide a dataset with features (X) and labels (y).
2. The model learns the relationship by optimizing m and b.
3. You can then predict y for new X.
2. Key Concepts and Assumptions
Assumptions of Linear Regression
For linear regression to work well, these assumptions should hold:
1. Linearity: The relationship between features and target is linear.
2. Independence: Observations are independent of each other.
3. Homoscedasticity: Constant variance of errors across all levels of features.
4. Normality: Errors are normally distributed (important for statistical tests).
5. No multicollinearity: Features are not highly correlated (for multiple regression).
If these assumptions are violated, the model may perform poorly. We’ll discuss how to check them later.
Cost Function and Optimization
The cost function (MSE) quantifies how wrong the predictions are. The goal is to find m and b that minimize MSE. Scikit-learn uses Ordinary Least Squares (OLS) or Gradient Descent internally:
- OLS: Solves for m and b analytically (fast for small datasets).
- Gradient Descent: Iteratively updates m and b (better for large datasets).
3. Practical Implementation: Simple Linear Regression
Let’s implement simple linear regression to predict student grades based on study hours.
Python Code
Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Step 1: Create a dataset
data = {
'Hours': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'Grade': [50, 55, 60, 65, 70, 75, 80, 85, 90, 95]
}
df = pd.DataFrame(data)
Step 2: Prepare data
X = df[['Hours']] Feature (2D array)
y = df['Grade'] Target
Step 3: Train the model
model = LinearRegression()
model.fit(X, y)
Step 4: Make predictions
y_pred = model.predict(X)
Step 5: Evaluate the model
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f"Slope (m): {model.coef_[0]:.2f}")
print(f"Intercept (b): {model.intercept_:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")
Step 6: Visualize
plt.scatter(X, y, color='blue', label='Actual Grades')
plt.plot(X, y_pred, color='red', label='Regression Line')
plt.xlabel('Study Hours')
plt.ylabel('Grade')
plt.title('Linear Regression: Study Hours vs Grade')
plt.legend()
plt.show()
Step 7: Predict for new data
new_hours = np.array([[6.5]])
predicted_grade = model.predict(new_hours)
print(f"Predicted grade for 6.5 hours: {predicted_grade[0]:.2f}")
Explanation
- Dataset: 10 students with study hours and grades.
- Training: `model.fit(X, y)` computes m and b.
- Evaluation:
- MSE: Measures prediction error.
- R²: Tells how much of the grade variation is explained (0 to 1).
- Visualization: Scatter plot with a red line showing the fit.
- Prediction: Estimates grade for 6.5 hours.
Expected Output
- Slope (~4.5): Each extra hour adds ~4.5 points to the grade.
- Intercept (~40): Base grade without studying.
- MSE: Low if the line fits well.
- R²: Close to 1 for a good fit.
- Plot: Red line through blue points.
- Prediction: ~70–75 for 6.5 hours.
---
4. Multiple Linear Regression
Theory and Use Cases
Multiple linear regression extends simple linear regression to multiple features.
Equation: y = m₁x₁ + m₂x₂ + ... + mâ‚™xâ‚™ + b
- Use Cases:
- Predicting house prices using size, bedrooms, and location.
- Estimating sales based on advertising budget and market trends.
Python Example
Predict house prices using size and bedrooms.
Import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Create dataset
data = {
'Size': [1400, 1600, 1700, 1875, 1100, 1550, 2350],
'Bedrooms': [3, 3, 4, 4, 2, 3, 5],
'Price': [245, 312, 279, 308, 199, 219, 405]
}
df = pd.DataFrame(data)
Prepare data
X = df[['Size', 'Bedrooms']]
y = df['Price']
Train model
model = LinearRegression()
model.fit(X, y)
Make predictions
y_pred = model.predict(X)
Evaluate
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f"Coefficients: Size = {model.coef_[0]:.2f}, Bedrooms = {model.coef_[1]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")
Predict for a new house
new_house = np.array([[1500, 3]])
predicted_price = model.predict(new_house)
print(f"Predicted price for 1500 sq ft, 3 bedrooms: ${predicted_price[0]:.2f} thousand")
Explanation
- Coefficients: Show how much each feature (size, bedrooms) affects price.
- Intercept: Base price when size and bedrooms are 0 (theoretical).
- Evaluation: MSE and R² assess model quality.
- Prediction: Combines both features for a new house.
5. Polynomial Regression
Handling Nonlinear Relationships
When the relationship isn’t linear, polynomial regression fits a curve by adding polynomial terms (e.g., x², x³).
Equation: y = m₁x + m₂x² + ... + b
Python Example
Predict sales based on advertising budget (nonlinear pattern).
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
Create dataset
data = {
'Ad_Budget': [10, 20, 30, 40, 50],
'Sales': [100, 250, 450, 550, 600]
}
df = pd.DataFrame(data)
Prepare data
X = df[['Ad_Budget']]
y = df['Sales']
Create polynomial regression model (degree 2)
polyreg = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
polyreg.fit(X, y)
Predict for smooth curve
X_plot = np.linspace(10, 50, 100).reshape(-1, 1)
y_plot = polyreg.predict(X_plot)
Evaluate
y_pred = polyreg.predict(X)
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")
Visualize
plt.scatter(X, y, color='blue', label='Actual Sales')
plt.plot(X_plot, y_plot, color='red', label='Polynomial Fit')
plt.xlabel('Ad Budget ($)')
plt.ylabel('Sales')
plt.title('Polynomial Regression: Ad Budget vs Sales')
plt.legend()
plt.show()
Predict for new budget
new_budget = np.array([[35]])
predicted_sales = polyreg.predict(new_budget)
print(f"Predicted sales for $35 budget: {predicted_sales[0]:.2f}")
Explanation
- PolynomialFeatures: Transforms X into [X, X²].
- Pipeline: Combines transformation and regression.
- Curve: Fits nonlinear data better than a straight line.
- Evaluation: MSE and R² show fit quality.
6. Regularized Regression (Ridge and Lasso)
Preventing Overfitting
Overfitting happens when the model fits the training data too closely, including noise, and performs poorly
on new data. Ridge and Lasso regression add penalties to prevent this:
- Ridge: Penalizes large coefficients (L2 regularization).
- Lasso: Penalizes coefficients and can set some to zero (L1 regularization).
Python Example (Ridge)
Using the house price dataset.
from sklearn.linear_model import Ridge
Same dataset as multiple regression
data = {
'Size': [1400, 1600, 1700, 1875, 1100],
'Bedrooms': [3, 3, 4, 4, 2],
'Price': [245, 312, 279, 308, 199]
}
df = pd.DataFrame(data)
X = df[['Size', 'Bedrooms']]
y = df['Price']
Train Ridge model
ridge = Ridge(alpha=1.0) Alpha controls penalty strength
ridge.fit(X, y)
Evaluate
y_pred = ridge.predict(X)
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f"Coefficients: Size = {ridge.coef_[0]:.2f}, Bedrooms = {ridge.coef_[1]:.2f}")
print(f"Intercept: {ridge.intercept_:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")
Explanation
- Alpha: Higher values shrink coefficients more, reducing overfitting.
- Coefficients: Smaller than in regular regression, making the model more robust.
Try Lasso by replacing `Ridge` with `Lasso` in the code above. Lasso may set some coefficients to zero
, effectively selecting features.
---
7. Evaluating Regression Models
Metrics
1. Mean Squared Error (MSE):
\[
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
\]
Lower is better.
2. Root Mean Squared Error (RMSE):
\[
\text{RMSE} = \sqrt{\text{MSE}}
\]
Easier to interpret (same units as target).
3. R² Score:
\[
R^2 = 1 - \frac{\text{Sum of Squared Errors}}{\text{Total Sum of Squares}}
\]
- 0: Model explains none of the variance.
- 1: Perfect fit.
- Negative: Worse than predicting the mean.
Cross-Validation
Splitting data into training and test sets ensures the model generalizes well. K-Fold Cross-Validation splits
data into k parts, trains on k-1, and tests on the remaining part, repeating k times.
from sklearn.model_selection import cross_val_score
Use the house price dataset
model = LinearRegression()
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"Cross-Validated R² Scores: {scores}")
print(f"Average R²: {scores.mean():.2f}")
8. Practical Tips and Common Pitfalls
Data Preprocessing
1. Handle Missing Values: Fill with mean/median or drop rows.
df.fillna(df.mean(), inplace=True)
2. Feature Scaling: Standardize features (mean=0, std=1) for better performance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
3. Encode Categorical Features: Convert categories (e.g., location) to numbers using one-hot encoding.
df = pd.get_dummies(df, columns=['Location'])
Handling Outliers
Outliers can skew the line. Detect them using z-scores or IQR, then remove or transform them.
Remove outliers using IQR
Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['Price'] < (Q1 - 1.5 * IQR)) | (df['Price'] > (Q3 + 1.5 * IQR)))]
Multicollinearity
If features are highly correlated (e.g., size and bedrooms), the model becomes unstable. Check with
Variance Inflation Factor (VIF):
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['Feature'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
- VIF > 5–10 indicates multicollinearity. Remove or combine correlated features.
---
9. Real-World Example: House Price Prediction
Let’s build an end-to-end project using a realistic dataset.
Python Code
Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
Step 1: Create a realistic dataset (or load from CSV)
data = {
'Size': [1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700],
'Bedrooms': [3, 3, 4, 4, 2, 3, 5, 4, 3, 3],
'Age': [5, 10, 2, 8, 20, 15, 3, 7, 12, 4],
'Price': [245, 312, 279, 308, 199, 219, 405, 324, 319, 255]
}
df = pd.DataFrame(data)
Step 2: Preprocess data
Check for missing values
print("Missing values:\n", df.isnull().sum())
Features and target
X = df[['Size', 'Bedrooms', 'Age']]
y = df['Price']
Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Step 3: Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
Step 4: Train model
model = LinearRegression()
model.fit(X_train, y_train)
Step 5: Predict and evaluate
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
Metrics
print("Training MSE:", mean_squared_error(y_train, y_pred_train))
print("Test MSE:", mean_squared_error(y_test, y_pred_test))
print("Training R²:", r2_score(y_train, y_pred_train))
print("Test R²:", r2_score(y_test, y_pred_test))
Step 6: Visualize predictions vs actuals (test set)
plt.scatter(y_test, y_pred_test, color='blue', label='Predicted vs Actual')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label='Perfect Fit')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('House Price Prediction: Actual vs Predicted')
plt.legend()
plt.show()
Step 7: Predict for a new house
new_house = scaler.transform(np.array([[1500, 3, 5]]))
predicted_price = model.predict(new_house)
print(f"Predicted price for 1500 sq ft, 3 bedrooms, 5 years old: ${predicted_price[0]:.2f} thousand")
Explanation
- Preprocessing: Checks for missing values and scales features.
- Train-Test Split: 80% training, 20% testing to evaluate generalization.
- Evaluation: Compares MSE and R² for training and test sets.
- Visualization: Scatter plot shows how close predictions are to actuals.
- Prediction: Scales new data before predicting.
POST Answer of Questions and ASK to Doubt