Skip to content

Commit

Permalink
Update 08.decision_tree.md
Browse files Browse the repository at this point in the history
  • Loading branch information
sofia-frenk authored Nov 15, 2024
1 parent efc7aa3 commit 6b3a020
Showing 1 changed file with 56 additions and 5 deletions.
61 changes: 56 additions & 5 deletions content/08.decision_tree.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Decision Tree Analysis

This section is dedicated to decision tree analysis. Because the dependent variable is not categorical, the DecisionTreeRegressor from scikit-learn was employed.
After the first decision tree was created, using the original dataset (with Duration_hours and Duration_min combined into a single variable Total_Duration), the \( R^2 \) value was 0.999977. This value seemed suspuciously perfect.
After the first decision tree was created, using the original dataset (with Duration_hours and Duration_min combined into a single variable Total_Duration), the $R^2$ value was 0.999977. This value seemed suspuciously perfect.
The effect of the high correlation value can also be seen in the figure below, which is a plot of the actual vs predicted value, and as can be seen the predicted values fall almost perfectly along the actual values.

<p align="center">
Expand All @@ -10,14 +10,65 @@ The effect of the high correlation value can also be seen in the figure below, w
<strong>Figure 7:</strong> Correlation matrix created using the original dataset.
</p>



To understand the origins of this \( R^2 \) value, firstly a correlation plot was created. The first correlation plot is seen below in Figure 8:
To understand the origins of this $R^2$ value, firstly, a correlation plot was created. The first correlation plot is seen below in Figure 8:

<p align="center">
<img src="images/Correlation_Mat_Original_Data.png" alt="Correlation matrix created using the original dataset" width="600px">
<br>
<strong>Figure 8:</strong> Correlation matrix created using the original dataset.
</p>

As can be seen from the figure above, the highest correlation appears between Total_Duration and CO2_Emitted (US Ton), the depenent variable. This makes sense, of course, because the longer the plane is in flight, the more \( CO_2 \) will be emitted.
As can be seen from the figure above, the highest correlation appears between Total_Duration and CO2_Emitted (US Ton), the depenent variable. This makes sense, of course, because the longer the plane is in flight, the more $CO_2$ will be emitted.

In order to question this highly suspicious result, we divided the origianl dependent variable, CO2_Emitted (US Ton), by Total_Duration to create a new dependent variable called CO2_Emitted/Hour.

<p align="center">
<img src="images/Actual_Predicted_CO2_Emission/Hour.png" alt="Actual vs predicted values using CO2_Emitted/Hour as a dependent variable" width="600px">
<br>
<strong>Figure 9:</strong> Actual vs predicted values using CO2_Emitted/Hour as a dependent variable.
</p>

<p align="center">
<img src="images/Correlation_Mat_CO2_Emissions/Hour.png" alt="Correlation matrix created using CO2_Emitted/Hour as a dependent variable" width="600px">
<br>
<strong>Figure 10:</strong> Correlation matrix created using CO2_Emitted/Hour as a dependent variable.
</p>

Note the appearance of binomial data
<p align="center">
<img src="images/Distribution_CO2_Emitted_Hour.png" alt="DISTRIBUTION using CO2_Emitted/Hour as a dependent variable" width="600px">
<br>
<strong>Figure 11:</strong> DISTRIBUTION using CO2_Emitted/Hour as a dependent variable.
</p>

<p align="center">
<img src="images/Actual_Predicted_CO2_Emissions_Fuel.png" alt="Actual vs predicted values using CO2_Emitted/Fuel_Usage_Rate as a dependent variable" width="600px">
<br>
<strong>Figure 12:</strong> Actual vs predicted values using CO2_Emitted/Fuel_Usage_Rate as a dependent variable.
</p>

<p align="center">
<img src="images/Correlation_Mat_CO2_Emissions_Fuel.png" alt="Correlation matrix created using CO2_Emitted/Fuel_Usage_Rate as a dependent variable" width="600px">
<br>
<strong>Figure 13:</strong> Correlation matrix created using CO2_Emitted/Fuel_Usage_Rate as a dependent variable.
</p>

Note the appearance of binomial data
<p align="center">
<img src="images/Distribution_Fuel_Consuption.png" alt="DISTRIBUTION of Fuel_Usage_Rate" width="600px">
<br>
<strong>Figure 14:</strong> DISTRIBUTION of Fuel_Usage_Rate.
</p>

Data mimickry noting that Total_Duration and Fuel_Consumption_Rate are most influential independent variables
<p align="center">
<img src="images/Distribution_CO2_Emission_Fuel.png" alt="DISTRIBUTION using CO2_Emitted/Hour as a dependent variable" width="600px">
<br>
<strong>Figure 11:</strong> DISTRIBUTION using CO2_Emitted/Hour as a dependent variable.
</p>

<p align="center">
<img src="images/Distribution_Total_Duration.png" alt="DISTRIBUTION of Total_Duration" width="600px">
<br>
<strong>Figure 11:</strong> DISTRIBUTION of Total_Duration.
</p>

0 comments on commit 6b3a020

Please sign in to comment.