House Price Prediction Pipeline

Data Science

An end-to-end Machine Learning project to predict residential real estate prices using a regression-based pipeline. This project covers the full lifecycle from data cleaning and outlier removal to comparative analysis between Linear and Tree-based models.

Project Overview

The goal of this project is to build a reliable predictive model for house prices using the Washington State dataset. We addressed common real-world data issues such as skewed distributions, outliers, and categorical encoding to find the most accurate estimator.

Tech Stack

  • Language: Python 3.x
  • Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-Learn

    * Models Tested: Linear (Ridge) Regression, Decision Tree, Random Forest

The Pipeline

1. Data Cleaning & Preprocessing

  • Outlier Removal: Applied the IQR (Interquartile Range) Method to filter out extreme price outliers that skew model learning.
  • Target Analysis: Identified that house prices were heavily right-skewed.
  • Log Transformation: Applied np.log1p to the target variable (price) to correct the skewness and transform it into a normal distribution for better regression performance.

  • Feature Engineering: * Created is_renovated as a binary flag.

    • Dropped redundant columns (sqft_above, sqft_basement) to prevent multicollinearity.

      * Removed low-variance or high-cardinality noise (street, country, date).

2. Feature Engineering

We implemented a ColumnTransformer to automate the preprocessing: * Numerical Features: Scaled using StandardScaler (Mean=0, Std=1).

* Categorical Features: Transformed city and statezip using OneHotEncoder to handle 100+ unique locations.

3. Model Comparison & Metrics

We split the data into 80% training and 20% testing sets.

Model R² Score Mean Absolute Error (MAE) Performance
Linear (Ridge) 0.78 ~$73,000 Best Baseline
Random Forest 0.72 ~$78,000 Good (Low Variance)

Decision Tree 0.46 ~$107,000 Poor (High Overfit)

Critical Thinking: Why the results?

  • Random Forest vs. Decision Tree: A single Decision Tree overfitted the training data by memorizing specific house details. The Random Forest used an Ensemble approach (Bagging), averaging 100 trees to cancel out noise and significantly improve the R² score.

  • Linear Model Success: Because we had a high number of categorical features (Zip codes), the Linear Ridge model handled the sparse high-dimensional data more efficiently than the tree-based models.


Future Improvements

  • Implement XGBoost or LightGBM for boosting performance.
  • Hyperparameter tuning using GridSearchCV.
  • Feature Importance analysis to identify the top price drivers.

Author

Suraj Singh

Data Science | Machine Learning