Problem description

Table of contents

Problem description
Collecting the data
Exploring the data
Cleaning up the data
- Duplicates
- Creating new features
  - District
Reading up the data for the models
Which features to use?
- Correlation
- Categorical feature selection
Training the models
Chosen model: CATBoost
- Further residual analysis
Finding the best deals
Data leakage test

Problem description

At the moment there are more than 15.000 houses and apartments for sale in the small and beatiful city of Balneário Camboriú - Brazil, with an area of just 45 km², it means there are many options independent of where you want to live, near the Interpraias Park or near the Big Wheel.

GasparRocha @ pixabay.com

We all know that when overwhelmed with options we tend to not be able to choose from one, so to help out a friend who was searching for a place to buy in this city, I decided to analyse and sort all the options he had and find one that wasn’t overpriced, or even better, underpriced.

For this I decided to use statiscal analysis and ML models to predict the price considering all other houses, then find the ones that the real value is below the predicted price and have these as his choices.

Collecting the data

To begin this search I’ve scrapped website’s listings and stored all the data in a Relational Database (MySQL). On the first day alone there were 9616 scrapped listing, this would grow to 15469 in another 20 days.

Website scrapping in progress

Note

scrape_data.ipynb
In this code you’ll find:
🐍 Python - Web scrapping with Selenium in Chrome Browser and JavaScript injection
🗃 SQL Database - Data retrieval from different tables, unification, update and alteration (new columns are inserted)

The listings included the address, description, price, area, number of bedrooms and bathrooms, and sometimes a condominium fee and floor. There was also a huge list of amenities, in total 167 different types, I created then 2 SQL tables inside the main Database, one for the listings and one for the amenities, with a many to many link table between them:

Schema of the Database

Exploring the data

Note

Check the code!
data_exploration.ipynb In this code you’ll find:
🐍 Python

Data manipulation with Pandas
Data visualization with Matplotlib and Seaborn
Statistics with Scipy
Machine Learning with SKlearn

🗃 SQL Database - Data retrieval

Using Pandas and Matplotlib we can take a quick look into what we are our pool of options:

Our price range is huge, on the right we can see the outliers as the dots, the cheapest option is R$250K, while on the top we can see our most expensive place at a staggering R$55M, 220 times more expensive than the cheapest one.

Analysing this boxplot (right) and the histogram on the top left we can take a look at all the prices, they are heavily concentrated below the R$5M mark (more than 83% of all entries).

On the middle of the left graphs we can take a look at the most expensive options, there are 74 of them above R$20M.

And in the bottom we see least expensive options, there are quite a few choices here, my friends stipulated around the R$1M range, around that mark, from R$800K to R$1.2M there are 766 options in our training set.

Cleaning up the data

Duplicates

After scraping up the data we have to clean it before analysing it, even though each entry was unique, sometimes the same place was put up several times, this can be because of several units of the same building for sale, or the same place was put into the websites system more than once, so we deleted some duplicated places.

Creating new features

In the title of each entry there was the type house for sale, extracting this from the string results the following types for sale:

The first three are Apartment, House and Penthouse, respectively, after that we have commercial spots and free lots, which we are not interested in, so we’ll exclude those. Gated Community Houses (Casa de Condomínio) will also be included in the research.

This left us with this distribution:

Count of every type of entries for sale

District

In the address there is also the district which the place finds itself in, to find out which are the districts from Balneário Camboriú I retrieved a table from Wikipedia. After the first run it was evident that there was an unofficial district - Barra Sul - which had very high prices.

Count of every entry by district

Below we can see the price distribution among these districts:

Centro (City Center) has the higher range - as it has most of the entries, while Barra Sul has a higher average than the others. Vila Real has the cheapest entries, as it is located behind the highway far away from the beach, making it undesirable for most who come here looking to enjoy the sea.

Google maps

Reading up the data for the models

To have a better precision with the ML models a few things were checked on the data:

Outliers

Outliers can heavily influence the outcomes of the prediction, a few causes for outliers are:

Wrongly inputted/collected data
Wrong unit of measure
True outliers which should be analysed individually if they should be in the training data

This analysis was carried out in a few different ways, the first was to analyse the boxplots:

Here in the area box plot on the left we see that the top data point is really standing out, analysing the entry we see that it was an empty lot that was wrongly inputted as an apartment, so it will be removed.

On the lower range there were some outliers as well, taking a closer look at them we saw many values in the 10m², this is obviously impossible for an apartment, after cross-referencing with the websites listings it became clear that, when no value was given, the listing came with 10m² as default. These listing will also not be included in the training data.

To set a numerical value in what consists an outlier, rules was set according to literature. It is the first quantile (Q25) minus 1,5 times the inter quantile range, thus removing 44 of the entries.

On the price box plot below we also see a very strong outlier:

With a price of R$55M it is 70% more expensive than the second one, but this wasn’t a mistake, it is really a unique condo, so it will be included in the training data.

The same analysis was carried out on all numerical features.

Normality

As seen above the price distribution is heavily right skewed, this usually brings bias to our models, we can easily check this with a Quantile-Quantile Plot, and when non-normality is found, we can check if the log of the data becomes normal:

On the top left we can see that the data is non-normal, as it deviates heavily from the red line, on the top right we can see the log of the price is very normal, we can use this to get a better prediction in our models.

The same analysis has been carried out in all numerical features.

Empty values

Sometimes the data is not complete, in the Garages feature there were a few empty values, 388 to be exact, as to be able to use this feature in our models some value must be given. Just using the average would skew the data, since there are big apartments would be left with too few garages, and small ones would get more, so a ratio of garages per m² was found, and inputted into the training data.

Which features to use?

Using too many features could clog up the models, making it biased, having too few would create this problem as well, to address this there was a few methods used.

Correlation

The more diversification a variable has with another and the target feature (Price), the better it can train the models. For this we can easily use a correlation matrix:

With the exception of condominium fee - which had many empty values to begin with - all the numerical features seem to have a good linear relation with the target feature, and can be used to describe it. Furthermore, none of the features seem to completely explain one another, if this was the case we’d have to remove one of them as not to bias the models.

The difference from Spearman’s to Pearson’s correlation indicates that the Condominium Fee does not follow a linear relation to Price, analysing a scatterplot from price to condominium we see that they have a second degree relationship, this should then be adapted to the feature in case it is used in the models.

Categorical feature selection

To fine tune the categorical feature selection a statistical analysis was conducted, it consisted in verifying if the population with and without the specific feature had very different characteristics, this analysis in specific verified means and standard deviation.

This was done with A/B testing using various methods (Shapiro to check for normality, Levene for homogeneity, T-Test for normal distributions and Mann-Whitney U for non-normal distributions), which resulted in p-values that would define if the groups with and without these features were too similar, a high p-value (more than 0.05 or 95% confidence) indicates that the groups are similar and shouldn’t be used, while a low p-value was used to define that the feature would be used.

Example of features that were considered too similar and weren’t used in the models:

Example of features that were deemed good for the models:

Training the models

Linear regression

As a baseline for model selection we’ll use a linear regression, this fits a simple straight line for each feature. We’ll try the simple regressor which has no extra parameters, as well as lasso and ridge regressor, both of which have regularization parameters (alpha) that reduce over fitting, a few values for this parameters were used.

Code for the linear regressor

In the picture above we see that the best linear regressor found was Lasso Regressor, with an mean squared error of 0.75 for both the training and test set. The top two graphs shows that predicted value (x-axis) plotted against the real value (y-axis), a perfect prediction would see all values falling on the red line.

On the bottom graphs we can see the residuals (how much the prediction differ from the real value), if there were a pattern in these residuals, it would mean there was a relation in the prices that wasn’t found, but there is no pattern to be seen.

Random Forest Regressor

A better regressor is a random forest, it utilizes many forests, each with different features and leaf nodes, with different conditions to reach this leaf node, all these leaf nodes then “vote” on the predicted value.

💡 Click below to check the code!

Code for random forest regressor

tWith the default values we can see that the random forest regressor already yields a much better result in predicting the true value, the residuals continue to have non-discernable pattern and the data comes a lot closer to a straight line.

Hyper parametrization was conducted and this resulted in a slightly improved score of 0.838 in the test data set.

💡 Click below to check the Hyper parametrization code!

Hyper parametrization code RFG

Other tested models

Models that were also tested but performed worse were LightGBM, XGBoost and GradientBoostRegressor. These can be seen in the data_exploration.ipynb notebook.

Chosen model: CATBoost

CATBoost had the best score with default parameters, so a more detailed hyper parametrization was conducted.

The defaults values almost all led to a better model, with exception to learning rate which managed to slight improve the R2 score.

Further residual analysis

When visualizing the residuals, it became clear that the bigger residuals where in the entries with the bigger areas, to see if there is a pattern in the predictions, another plot of the residuals against the features was conducted:

As the tables showed, the larger the area the more inaccurate our models are, if a greater precision is needed, a new model with smaller areas could be trained.

Finding the best deals

Back to the original problem, we wanted to find the best bang for the buck in the R$800K to R$1.2M range. For this we’ll search for apartments with a predicted value below the real value, and analyse them manually.

The entries below the dashed line, 19 in total, are our objects of interest, if the model gave a good prediction, they should be undervalued.

Looking at the biggest residual of them all we see something interesting, this apartment is beach front, huge area, 4 bathrooms, with an undervaluation of R$3.3M, a great deal right? Well, it’s a scam.

The next 3 points and most of these “best deals” shows another problem with the models, they are really old apartments, something the collected data wasn’t able to show, further proving that a model is only as good as the data it receives.

Another problem found when viewing these entries is the districts feature, “Centro” includes both beach front apartments as well as “highway front” as seen below, a better predictor would be distance to the beach.

Also while reviewing the options some pictures seemed familiar, 3 of the listings were using the same pictures for apartments in different parts of the city, with different areas and bedrooms, also indicating that these could be a scam.

Here’s a list of all potential places sorted by residual value:

Data leakage test

To be sure that no data was leaked during the first data analysis, 3000 new rows of data were acquired and run in the previous models, this is the result:

Catboost had a miniscule drop, this shows how robust Catboost is and that our analysis and modelling was consistent, but could be improved with more data, for example, latitude and longitude, year of construction, floor, quality of materials, number of suites, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Docs/Images		Docs/Images
models		models
notebooks		notebooks
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Problem description

Collecting the data

Exploring the data

Cleaning up the data

Duplicates