House Price Prediction Pipeline

An end-to-end Machine Learning project to predict residential real estate prices using a regression-based pipeline. This project covers the full lifecycle from data cleaning and outlier removal to comparative analysis between Linear and Tree-based models.

Project Overview

The goal of this project is to build a reliable predictive model for house prices using the Washington State dataset. We addressed common real-world data issues such as skewed distributions, outliers, and categorical encoding to find the most accurate estimator.

Tech Stack

Language: Python 3.x
Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-Learn
* Models Tested: Linear (Ridge) Regression, Decision Tree, Random Forest

The Pipeline

1. Data Cleaning & Preprocessing

Outlier Removal: Applied the IQR (Interquartile Range) Method to filter out extreme price outliers that skew model learning.
Target Analysis: Identified that house prices were heavily right-skewed.
Log Transformation: Applied np.log1p to the target variable (price) to correct the skewness and transform it into a normal distribution for better regression performance.
Feature Engineering: * Created is_renovated as a binary flag.
- Dropped redundant columns (sqft_above, sqft_basement) to prevent multicollinearity.
  * Removed low-variance or high-cardinality noise (street, country, date).

2. Feature Engineering

We implemented a ColumnTransformer to automate the preprocessing: * Numerical Features: Scaled using StandardScaler (Mean=0, Std=1).

* Categorical Features: Transformed `city` and `statezip` using `OneHotEncoder` to handle 100+ unique locations.

3. Model Comparison & Metrics

We split the data into 80% training and 20% testing sets.

Model	R² Score	Mean Absolute Error (MAE)	Performance
Linear (Ridge)	0.78	~$73,000	Best Baseline
Random Forest	0.72	~$78,000	Good (Low Variance)
	Decision Tree	0.46	~$107,000	Poor (High Overfit)

Critical Thinking: Why the results?

Random Forest vs. Decision Tree: A single Decision Tree overfitted the training data by memorizing specific house details. The Random Forest used an Ensemble approach (Bagging), averaging 100 trees to cancel out noise and significantly improve the R² score.
Linear Model Success: Because we had a high number of categorical features (Zip codes), the Linear Ridge model handled the sparse high-dimensional data more efficiently than the tree-based models.

House Price Prediction Pipeline

An end-to-end Machine Learning project to predict residential real estate prices using a regression-based pipeline. This project covers the full lifecycle from data cleaning and outlier removal to comparative analysis between Linear and Tree-based models.

Project Overview

The goal of this project is to build a reliable predictive model for house prices using the Washington State dataset. We addressed common real-world data issues such as skewed distributions, outliers, and categorical encoding to find the most accurate estimator.

Tech Stack

* Models Tested: Linear (Ridge) Regression, Decision Tree, Random Forest

The Pipeline

1. Data Cleaning & Preprocessing

* Removed low-variance or high-cardinality noise (`street`, `country`, `date`).

2. Feature Engineering

* Categorical Features: Transformed `city` and `statezip` using `OneHotEncoder` to handle 100+ unique locations.

3. Model Comparison & Metrics

Critical Thinking: Why the results?

Future Improvements

Author

Suraj Singh

Data Science | Machine Learning

An end-to-end Machine Learning project to predict residential real estate prices using a regression-based pipeline. This project covers the full lifecycle from data cleaning and outlier removal to comparative analysis between Linear and Tree-based models.

Project Overview

The goal of this project is to build a reliable predictive model for house prices using the Washington State dataset. We addressed common real-world data issues such as skewed distributions, outliers, and categorical encoding to find the most accurate estimator.

Tech Stack

* Models Tested: Linear (Ridge) Regression, Decision Tree, Random Forest

The Pipeline

1. Data Cleaning & Preprocessing

* Removed low-variance or high-cardinality noise (street, country, date).

2. Feature Engineering

* Categorical Features: Transformed city and statezip using OneHotEncoder to handle 100+ unique locations.

3. Model Comparison & Metrics

Critical Thinking: Why the results?

Future Improvements

Author

Suraj Singh

Data Science | Machine Learning

* Removed low-variance or high-cardinality noise (`street`, `country`, `date`).

* Categorical Features: Transformed `city` and `statezip` using `OneHotEncoder` to handle 100+ unique locations.