In this project, I use multiple linear regression modeling to analyze house sales in a northwestern county.
Real estate company Royal Homes is looking to better understand the King County housing market before they finalize their business strategy. We seek to answer the following questions: what types of homes should they be looking to sell to make the most profit? What features lend towards higher sale prices?
This project uses the King County House Sales dataset, which can be found in kc_house_data.csv
in the data folder in this GitHub repository.
The data contains the following columns:
- 'bedrooms': number of bedrooms
- 'bathrooms': number of bathrooms
- 'floors': number of floors (levels) in the house
- 'waterfront': whether the house is on the waterfront
- 'view': quality of view from house
- 'condition': how good the overall condition of the house is; related to maintenance of house.
- 'grade': overall grade of the house; related to the construction and design of the house
- 'zipcode': zip code
- 'sqft_living': sq. ft. of living space
- 'sqft_lot': Square footage of the lot
- 'sqft_above': Square footage of house apart from basement
- 'yr_built': year house was built
- 'sqft_living15': sq. ft. of interior housing living space for the nearest 15 neighbors
- 'sqft_lot15': sq. ft. of the land lots of the nearest 15 neighbors
- 'mth_sold': month house was sold
- 'yr_sold': year house was sold
- 'renovated': whether a house had a year populated in the yr_renovated column
After three iterations, the final model includes log transformed sqft_living, and categorical/dummy variables for condition, view, zipcode, and waterfront. The dependent variable, price, is also log transformed.
While it is not perfect, it shows remarkable improvement in terms of normality of residuals, homoscedasticity and linearity while maintaining a similar R^2 as the second model. Around 85 percent of the variation in price is explained by the model. Skew of 0.053 is the closest we have seen to 0 thus far, and the kurtosis value of 4.86 is the lowest we've seen thus far. The final model also passes multicollinearity checks.
In the final model we can see that a house's zip code is highly influential on its sale price given the magnitude of the coefficients of several zip codes, all of which are statistically significant. For example, a house in zip code 98039 is associated with a natural log of sale price that is 1.50 higher, or a price that is about ~4.48 dollars higher.
Square footage of a house's living space also positively impacts sale price. Given a 0.68 coefficient, a 1 percent increase in sqft_living increases price by 0.68 percent.
Whether a house is on the waterfront, the quality of view, and how good the condition of the house is also impact sale price. A waterfront property is associated with a log of sale price that is 0.45 higher, or a sale price that is 1.57 dollars higher.
Given the regression results, I would recommend the following:
- Focus on finding properties in advantageous zip codes
- Focus on larger houses, particularity with a larger living space
- Waterfront properties and properties with good views tend to yield higher prices
- Condition matters. Selecting a house in poor condition can detract from sale price