House Price Prediction King County
This article is about my project in R as part of my Master’s course to predict the house prices in King County, USA.
King County is located in the U.S. state of Washington. The population was 2,274,315 in the 2020 census estimate, making it the most populous county in Washington, and the 12th-most populous in the United States. The county seat is Seattle, also the state’s most populous city.
We have taken the data-set from Kaggle which consists of a vast number of records and variables that are house features along with the prices of houses in King County. Through this project, we aim to build predictive models and compare models to choose the one showing the best accuracy of the price of the house in question. We also aim to find out the best qualities/features of the house that play a huge role in determining the price of the house.
BUSINESS UNDERSTANDING
The industries that seem to be the most beneficial from these resources and data are the Real Estate industry, Architecture industry, the building, and construction companies whose primary focus is an investment in houses and/or renovating and selling them.
The firm we are primarily focusing on is Coldwell Banker. Coldwell Banker is a Real Estate firm that focuses on educating their clients on all matters of buying their home along with helping their customers do the same. They also provide other services such as commercial real estate, corporate brokerage, relocation services, and concierge services along with residential brokerage. We aim to help the firm predict prices that would be most suitable for their customers.
Coldwell Banker was first established in San Fransisco, CA in 1906. It has approximately 3000 offices in 49 countries and territories. The founders of this real estate franchise were Colbert Coldwell and Arthur Banker. One of the largest services offered by Coldwell bankers is in the residential brokerage services whose current Chief Executive officer is M Ryan Gorman. Coldwell banker’s revenue in the residential services is around 1 billion dollars bearing about 55,000 employees. The headquarters for this is located in New Jersey, United States.
Besides being in the residential sector, they also provide various other services like those in corporate, commercial, relocation, and concierge. In relocation and concierge services, Coldwell Banker has revenue of about 5 million bearing 21 employees. The headquarters is located in Ontario, Canada. Other services like commercial services (subsidiary) offered by Coldwell banker has a revenue of around 100 Million- 500 Million with 10,000 employees. The headquarters for this is located in New Jersey, United States. The line of business that we are focusing on is in the residential sector/service offered by Coldwell Banker.
SWOT ANALYSIS:
Strengths:
- Valuation for different capacities (low range, mod range, higher range, luxury apartments, and houses).
2. High resources and property available.
3. Franchising across the globe in a residential estate.
4. Great social media presence (very important in the technology booming world).
5. Pros of local franchise, company understands the local needs of people, with the security of a global company.
6. Been in place since 1906 (more credibility).
7. Has 55k employees.
8. Multiple services offered apart from residential real estate.
Weaknesses:
1. Difficult to adapt to changes across all the franchises, around the world.
2. Taking into consideration the rules and policies of different places. No single rule can apply to all.
3. Reputation at stake, so maintenance of all the teams across the globe is a difficult task.
4. Less presence in the lower-income segment of the real estate market.
Opportunities:
1. Bettering the business by implementing new technologies such as smart homes and virtual reality tours.
2. International residential real estate strategy (if some rules are beneficial in different countries, they can use it to their advantage as the company is widespread)
3. Strategic residential real estate visibility and brand development
4. Even if the market slows down, it seems to be a never-ending business.
5. Due to the wide range, learnings about real-state strategy about different places.
Threats:
1. Higher levels of risks.
2. Climate change could affect residential real estate.
3. Changes in potential buyer’s preferences.
4. Policies and rules changing from time to time.
5. Politics and tax rules can affect real-estate business.
6. Fluctuating market and economy
7. High competition
DATA ANALYSIS AND UNDERSTANDING (Data description)
Our data source comes from Kaggle. The following are the fields listed in the dataset. Our data set consists of 21613 records of houses with 21 variables/ features pertaining to the house.
- id — Unique ID for each home sold
- date — The date of the day the house was sold.
- price — Price of each house sold.
- bedrooms — Number of bedrooms in the house
- bathrooms — Number of bathrooms, where .5 accounts for a room with a toilet but no shower.
- sqft_living — Area (Sq ft) of the interior living space/room of the house.
- sqft_lot — Area(Sq ft) of the entire lot/house.
- Floors- Number of floors
- waterfront — A dummy variable for whether the apartment was overlooking a waterfront or not
- condition -An index from 1 to 5 on the condition of the apartment.
- view — An index from 0 to 4 of how good the view of the property was.
- grade — An index from 1 to 13, where 1–3 falls short of building construction and design, 7 has an average level of construction and design, and 11–13 has a high-quality level of construction and design..
- sqft_above — The square footage of the interior housing space that is above ground level.
- sqft_basement — The square footage of the interior housing space that is below ground level.
- yr_built — The year the house was initially built
- yr_renovated — The year of the house’s last renovation.
- zipcode — What zip code area the house is in
- lat — Lattitude
- long — Longitude
- sqft_living15 — The square footage of interior housing living space for the nearest 15 neighbors.
- sqft_lot15 — The square footage of the land lots of the nearest 15 neighbors.
The data collected carries a range of 1 year starting from 2nd May 2014- 2nd May 2015. In this particular dataset, Our target variable of interest is the Price of the house.
DATA ANALYSIS AND UNDERSTANDING — SUMMARY OF THE DATA SET(EDA)
Outliers, missing values, and transformations
- Initially, before beginning the process we had an outlier of a house with 33 bedrooms in an sqft lot of a lesser space that would have impractical to fit in. Furthermore, notably, there was another outlier of a house with11 bedrooms that did not fit the area well too.
- Upon performing analysis by running the regression model, in all three models, we notice a huge difference in the 21st bin with a notably high error between the predicted and actual values. To get better RMSE values, we omitted any outliers and then performed the analysis again.
- The dataset did not contain any missing values.
- The analysis was performed on raw untransformed data.
Summary Statistics.
We performed Exploratory data analysis on our data set that produced the following results.
CORRELATION MATRIX OF THE PLOT
There is a strong correlation between price and most of the variables in question. Especially Price and Sqft living.
HOUSE PRICE VS BEDROOMS ( BEFORE OUTLIERS).
HOUSE PRICE VS BEDROOMS (AFTER OUTLIERS)
These graphs talk about how houses with 9 number rooms on average typically cost more than the others.
HOUSE PRICE VS BATHROOMS
These graphs talk about how a house with 8 bathrooms(for eg) clearly has a higher price value of about 5million.
SQFT_LIVING VS PRICE
WATERFRONT VS HOUSE PRICE
The sqft_living vs price talks about the linear/positive relationship between sqft_living and the price. As the Sqft_living area increases, the price of the house increases. Meanwhile, the box plot shows that price of a house with a waterfront typically costs more.
FLOORS VS PRICE and CONDITION VS HOUSE PRICE
These boxplots talk about the comparison between prices when floors or condition of the house is taken into account. Typically a house with 2.5 floors according to them has been sold at a higher price compared to the rest. Meanwhile, a condition of ‘5’ for a house costs more than something below it.
VIEW VS HOUSE PRICE and GRADE VS HOUSE PRICE
These box plots mention how a view with a 4 value typically costs more than the others and a grade of 13 for a house costs the highest.
SQFT_ABOVE VS HOUSE PRICE and SQFT_BASEMENT VS HOUSE PRICE
Both these graphs depict a linear relationship between Sqft above and Sqft below, when if a house with more sqft above and basement space area increases, the price of that house is bound to increase for the most part. We say most part because some houses with 0 sqft basement area have the prices higher than others and this could account to other factors such as renovation etc.
YR_BUILT VS HOUSE PRICE and SQFT_LIVING 15 VS HOUSE PRICE
The Sqft living 15 which is the living space area of the nearest 15 houses when increases, the price also increases. Year-built tends to have a negative correlation with price in the sense that the older is the house, the lower is its price value due to various conditions. However, some may get renovated and sold at a high price.
Data Vizualization
We observe that houses near to waterfront have prices higher.
DATA MODELING
Our main goal of using the model is to predict the accuracy of the prices of the houses in King county given the features of the house. With this model, we also aim to find out the top features that would be useful/ required when predicting the price.
Our target variable in this dataset is the Price of the house. We plan to divide the dataset into Training data and Validation data to compare the accuracy of the prediction across various models.
Model building and evaluation
The model we plan to use is the Multiple linear regression model. The dependent variable in our model is the “Price” whereas the independent variables àre the remaining features of the house.
The 3 regressional models we have chosen are
- MODEL 1: Multiple linear regression model including all variables.
In this model, we ran multiple linear regression on all variables excluding id, date, lat, long, zip code. The data set is then divided into 70% training and 30% valid and tested.
Upon running this model, we derive an adjusted R square value of 65.85% and multiple R squared values of 65.88%. Furthermore, after running the prediction on the model, we get an RMSE value of 227088.7. The RMSE value speaks of the root mean square error between the predicted and the actual price. Moreover, the MAPE or the Mean average percentage error accounted for 29% which.
Next, the graphs for actual vs predicted were plotted to study the model. We notice a considerably huge difference between predicted and actual in the 21st bin.
When the data for predicted vs actual was exported and examined, we noticed a huge difference in the error for predicted and actual prices in the 21st bin. The 1st bin showed about 27% error but the 21st bin had about 41% error which was very high.
On further analysis, we decided to remove the outliers and re-do the analysis to see if made a difference.
To try to improve our model better, we omitted the outliers and tried running the model again. This time, however, it produced fairly better results.
Adjusted R square- 57.9%
RMSE: 135275.3
MAPE:25.9 %
We notice that RMSE decreases by almost 92k and percentage by 5%
We were able to fairly predict prices below 1 million.
2. MODEL 2: Multiple linear regression model using exhaustive search- Best 8 features.
In this model, we ran multiple linear regression by using exhaustive search and choosing the best 8 features. id, date, lat, long, zipcode was excluded and the data set is then divided into 70% training and 30% valid and tested.
The best 8 features that we chose were view, grade, sqft_living, yr_built, bathroom, waterfront, bedrooms, and sqft_living15.
§Upon running this model
We got an adjusted R square value of 65.62%
RMSE: 236163.4
MAPE: 29.45%
The model with lesser features predicted almost the same as that of considering everything.
After running graphs and plotting predicted vs actual, we get similar results where the 21st bin shows a huge difference in error. To correct, we omit the outlier values and run them again.
Upon doing that, we get a reduced RMSE value and MAPE
RMSE: 136230.7
MAPE: 26.18%
Adjusted R square: 57.37
The graph changes and so does the plot when an exhaustive search is done
However, the best 8 features change this time and we get a different set where we consider — view, grade, floors,sqft_living15,yr_built, sqft_living, condition, and bathrooms.
3. MODEL 3: Multiple linear regression model running on significant variables. In this model, we ran multiple linear regression by running the linear regression on significant variables only. id, date, lat, long, zipcode was excluded and the data set is then divided into 70% training and 30% valid and tested. The most significant features turned up to be around 11 of them.
Upon running this model
We got an adjusted R square value of 65.93%
RMSE: 234939.9
MAPE: 29.11%
The model with significant features predicts somewhat similar to that of running an overfit model.
After running graphs and plotting predicted vs actual, we get similar results where the 21st bin shows the huge difference in error. To correct, we omit the outlier values and run them again.
Upon doing that, we get a reduced RMSE value and MAPE
- RMSE: 135862.4
- MAPE: 26.07%
- Adjusted R square: 57.69
- The graph changes to one precise to the mentioned values.
The chosen Model
The final model that we chose was the Model from Best 8 features after outliers were omitted.
Descriptions :
RMSE: 136230.7
MAPE: 26.18%
Adjusted R square: 57.37
We can conclude after analyzing all models that the best 8 feature model predicts approximately the same adjusted r square value as that of a model where all variables / significant ones are considered. Here the advantage is that we do not need to consider all variables except for those 8. This helps us cut downtime and help predict better results/predict prices. R2 measures goodness of fit or how well it captured the training data. RMSE and MAE help in analyzing predictive performance and help compared the model with each other.
Limitations to the model included predicting price for a certain range of values. About 5–6% of the data were omitted to improve the model performance. Very expensive homes tend to follow a different set of rules for the price than moderately prices houses.
MODEL EQUATION
The final model equation can be written as Y= β0+ β1x1+ β2x2 + …. βpxp where
β0 = 4.675* 10^(6)
β1 = 2.626* 10^(4)
β2= 9.189*10^(4)
β3= 4.492* 10^(4)
β4= 5.056* 10^(4)
β5= -2.682* 10^(3)
β6= 5.518* 10^(1)
β7= 1.806* 10^(4)
β8=2.760*10^(4)
The final model regression equation= 4.675* 10^(6)+ (2.626* 10^(4)*View)+ (9.189*10^(4)*grade)+(4.492* 10^(4)
- floors)+( 5.056* 10^(4)*sqft_living15)-(2.682* 10^(3)*yr_built)+(5.518* 10^(1)*sqft_living)+ (1.806* 10^(4)*condition)+(2.760*10^(4)*bathrooms)
RECOMMENDATIONS
Our approach with this model can be used by Coldwell banker residential firm to predict some of the main/best features which play a vital role in determining the price of the house. In our analysis, we conclude that view, grade, floors, condition, a year that the house was built in, square ft of the living space, and square foot living space of the nearest 15 neighbors and bathrooms play a vital role in the determination of the price. Instead of having to go through all 20 variables, the firm can attain similar results when filtered out just these features. This helps with faster prediction.
The model will be useful later in the future for classifying prospective customers in categories of a particular range (high range or modest range) of house prices based on the customer’s budget. This in turn can be used for creating attractive pricing offers along with providing mortgage offers from partner banks.
The target firm can further provide consultation to the builders and architects in designing the houses based on the prominent features and giving a selling price estimate of the said house/lot based on features like view, grade, floors, etc. from the model.
The firm can use this model to find out the feasibility of a greenfield residential real estate project in the future.