date: 2026-03-03
๐ House Price Prediction using Linear & Ridge Regression
๐ Project Overview
This project demonstrates a complete end-to-end Machine Learning workflow for predicting house prices using Linear Regression and Ridge Regression.
The goal was not just to train a model, but to build a clean and professional ML pipeline including:
- Proper data cleaning
- Handling extreme outliers
- Avoiding data leakage
- Feature preprocessing using Pipeline & ColumnTransformer
- Model comparison (Linear vs Ridge)
- Alpha tuning for Ridge Regression
- Prediction on unseen data
๐ Dataset
The dataset contains approximately 4600 house records with features such as:
- Bedrooms
- Bathrooms
- Square footage (living & lot)
- Floors
- Waterfront
- View
- Condition
- Year built / renovated
- City
- Price (Target Variable)
๐งน Data Cleaning
The following preprocessing steps were performed:
Dropped unnecessary and high-cardinality columns:
datecountrystatezipstreet
Removed extreme outliers using 99th percentile filtering:
upper_limit = df["price"].quantile(0.99) df = df[df["price"] <= upper_limit]
This stabilized the model performance significantly.
๐ Train-Test Split
Data leakage was avoided by splitting the dataset before preprocessing:
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
๐๏ธ Preprocessing Pipeline
Used:
SimpleImputer(median for numeric, most_frequent for categorical)StandardScaler(important for Ridge regularization)OneHotEncoder(handle_unknown="ignore")ColumnTransformerPipeline
This ensures clean, reproducible, production-style ML workflow.
๐ Model Performance
๐น Linear Regression
- Rยฒ Score โ 0.65
- RMSE โ 171,000
๐น Ridge Regression (Best Alpha = 0.01)
- Rยฒ Score โ 0.65
- RMSE โ 171,000
After proper cleaning, Linear Regression performed almost as well as Ridge, showing that the dataset became well-conditioned after outlier handling.
๐ฏ Key Insights
- Linear Regression is sensitive to extreme outliers.
- Removing top 1% price values significantly improved stability.
- Ridge regularization helps when multicollinearity or large coefficients exist.
- Proper preprocessing is more important than blindly switching models.
- Avoiding data leakage is critical for trustworthy evaluation.
๐ Unseen House Prediction Example
new_house = pd.DataFrame({
"bedrooms": [3],
"bathrooms": [2],
"sqft_living": [1800],
"sqft_lot": [4000],
"floors": [1],
"waterfront": [0],
"view": [0],
"condition": [3],
"sqft_above": [1500],
"sqft_basement": [300],
"yr_built": [2005],
"yr_renovated": [0],
"city": ["Seattle"]
})
Predicted Price โ \$455,000
๐ Tech Stack
- Python
- Pandas
- NumPy
- Scikit-Learn
- Matplotlib
๐ง What This Project Demonstrates
- Strong understanding of regression fundamentals
- Regularization concepts (L2 Penalty)
- Model evaluation (Rยฒ & RMSE)
- Clean ML pipeline architecture
- Real-world prediction workflow
๐ Conclusion
This project emphasizes building models the right way --- clean data, proper validation, and structured experimentation.
Instead of chasing unrealistic accuracy, the focus was on building a stable and interpretable regression pipeline.