Based on data extraction from our literature review, this project performs data_analysis and visualistion. The main pipeline involves identifying key features for enablers and entry points and making co-occurrance visualisationss.
The pipeline includes data preprocessing, random forest analysis, co-occurrence analysis, and visualization of results through heatmaps.
root/
├── src/
│ ├── data_processing/
│ │ └── general_preprocessing.py
│ ├── analysis/
│ │ ├── random_forest.py
│ │ └── Co_occurrence.py
│ └── visualization/
│ └── heatmap.py
├── Technology_Books/
│ ├── Transport.py
│ ├── Coal.py
│ └── Renewables.py
└── README.md
-
Data Preprocessing: The
general_preprocessing.py
module loads and preprocesses the input data, cleaning and vectorizing the 'Enabler' and 'Entry' columns. -
Random Forest Analysis: The
random_forest.py
module performs random forest analysis to identify the most important features (enablers and entries) for each technology sector. -
Co-occurrence Analysis: The
Co_occurrence.py
module calculates co-occurrence matrices and identifies secular enablers across clusters. -
Visualization: The
heatmap.py
module creates and saves heatmaps to visualize the co-occurrence data. -
Sector-Specific Analysis: The
Transport.py
,Coal.py
, andRenewables.py
files in theTechnology_Books
folder run the complete analysis pipeline for each respective sector.
-
Ensure you have all the required dependencies installed (pandas, numpy, matplotlib, scikit-learn, etc.).
-
Place your input data files in the appropriate location and update the
INPUT_FILE
path in the respective sector-specific Python files. -
Run the sector-specific analysis by executing the corresponding Python file:
python Technology_Books/Transport.py python Technology_Books/Coal.py python Technology_Books/Renewables.py
-
The results, including heatmaps and analysis outputs, will be saved in the
output
directory.
- You can adjust the number of top enablers and entries to analyze by modifying the
n_enablers
andn_entries
parameters in the sector-specific files. This alters the behaviour of the random forest model, as it trains per the number of total features(number of enablers plus entries). Diminishing returns past 12 observed. - The
detailed
andcluster_specific
parameters in therun_random_forest_analysis
function can be toggled to change the analysis approach. Detailed ensures class balanced sampling when clusters sizes are unbalanced. It works best where one cluster is substantially bigger than the others. Cluster speicific on the other hand trains a random forest model for each cluster specifically. In theory this extracts more cluster specific features per cluster than the other approaches, however it requires large cluster populations, works for some of the clusters in Renewables but not so good for the smaller clusters. Cluster-Specific cannot be used in conjuction with detailed. - Color palettes for heatmaps can be customized in the sector-specific files.
- Threshold can be altered in heatmap to plot co-occurances lower than the threshold value. In general heatmap is safer to edit and alter than the other files, as it is purely post-processing.
The pipeline generates the following outputs:
- Random Forest analysis results (saved as joblib files)
- Co-occurrence heatmaps (saved as PNG files)
- Lists of top enablers and entries
- Secular enablers across clusters
- The Renewables analysis (Wind & Solar) is split into two batches with different cluster groups.
- The Coal and Transport analyses process all clusters together.
- Make sure to update file paths and adjust parameters as needed for your specific use case.
For more detailed information on each module and its functions, please refer to the docstrings and comments within the individual Python files.