Skip to content

Code repository for "Machine Learning and the Implementable Efficient Frontier" by Jensen, Kelly, Malamud, and Pedersen (2024)

Notifications You must be signed in to change notification settings

Saarialho/ml-and-the-implementable-efficient-frontier

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This repository contains the code used for the paper Machine Learning and the Implementable Efficient Frontier by Jensen, Kelly, Malamud, and Pedersen (2024). Please cite this paper if you are using the code:

@article{JensenKellyMalamudPedersen2024,
	author = {Jensen, Theis Ingerslev and Kelly, Bryan and Malamud, Semyon and Pedersen, Lasse Heje},
	title = {Machine Learning and the Implementable Efficient Frontier},
	year = {2024}
}

Please send questions about the code to Theis I. Jensen at [email protected].

How to run the code

To run the code, clone this repo to your local computing environment, and follow the steps explained below. We note that replicating our analysis requires substantial computational resources, and the code is set up to be executed on a high performance computing cluster with a SLURM scheduler.

Data

You need eight data sets to run the code.

  • usa.csv
    • Firm characteristics at a monthly frequency from the paper Is There a Replication Crisis in Finance? by Jensen, Kelly, and Pedersen (2023)
    • Download from WRDS. To get US data, require that the column excntry is equal to "USA"
  • usa_dsf.csv
    • Stock returns at a daily frequency
    • The data can be generated by following the instructions from the GitHub repository from "Is There a Replication Crisis in Finance.'' Alternatively, you can request the data from us by sending an email to [email protected]
  • world_ret_monthly.csv
    • Stock returns at a monthly frequency
    • The data can be generated by following the instructions from the GitHub repository from "Is There a Replication Crisis in Finance.'' Alternatively, you can request the data from us by sending an email to [email protected]
  • Factor Details.xlsx
  • Cluster Labels.csv
  • market_returns.csv
    • Market returns from "Is There a Replication Crisis in Finance"
    • Download from Dropbox
  • ff3_m.csv
  • short_fees
    • Short-selling fees based on the Markit Securities Finance Analytics - American Equities database. You can run the vast majority of the code without this data set (the exception being 6 - Short selling fees.R)
    • Download from WRDS

These data sets should be saved in the Data folder with the exact names used above.

Data generation

In this section, we'll go through the steps needed to implement the portfolio choice methods considered in the paper and implement the portfolio choice methods we use in the paper. This step is by far the most computationally intensive. We used the dSQ module to submit multiple jobs at the same time to a Slurm scheduler. Below, we include our dSQ calls to give you a sense of the computational resources required to run each step.

Return prediction models

  • What: Estimate the 12 models used to predict realized returns at time t+1, t+2, ..., t+12
  • dSQ call: dsq --job-file Joblists/joblist_models.txt --cpus-per-task=32 --mem=100G --partition=day -t 06:00:00 --mail-type ALL --output slurm_output/dsq-joblist_models-%A_%1a-%N.out. This call will start 12 independent jobs, which for us took a maximum of 5 hours and required approximately 75GB RAM for each job
  • Main R script: slurm_fit_models.R
  • Output folder: Data/Generated/Models

Portfolios: base case

  • What: Implement portfolio choice methods with the base case parameters used for tables 2-4 and figures 2-4 and D.4
  • dSQ call: dsq --job-file Joblists/joblist_pfchoice_base.txt --cpus-per-task=48 --mem=60G --partition=week -t 1-10:00:00 --mail-type AL L --output slurm_output/dsq-joblist_pfchoice-base-%A_%1a-%N.out. This call will start 1 job, which for us took a maximum of 6 hours and required approximately 40GB RAM
  • Main R script: slurm_build_portfolios.R
  • Output folder: Data/Generated/Portfolios

Portfolios: all

  • What: Implement the portfolio choice methods for all stocks used for the top-left panel in Figure 8
  • dSQ call: dsq --job-file Joblists/joblist_pfchoice_all.txt --cpus-per-task=32 --mem=100G --partition=week -t 5-00:00:00 --mail-type AL L --output slurm_output/dsq-joblist_pfchoice-all-%A_%1a-%N.out. This call will start 1 job, which for us took a maximum of 2 days and 16 hours and required approximately 70GB RAM
  • Main R script: slurm_build_portfolios.R
  • Output folder: Data/Generated/Portfolios

Portfolios: size groups

  • What: Implement the portfolio choice methods for stocks in different size groups used for the remaining panels in Figure 8
  • dSQ call: dsq --job-file Joblists/joblist_pfchoice_size.txt --cpus-per-task=16 --mem=50G --partition=day -t 8:00:00 --mail-type ALL -- output slurm_output/dsq-joblist_pfchoice-size-%A_%1a-%N.out. This call will start 5 jobs, which for us took a maximum of 5 hours and required approximately 30GB RAM for each job
  • Main R script: slurm_build_portfolios.R
  • Output folder: Data/Generated/Portfolios

Implementable Efficient Frontier:

  • What: Implement portfolio choice methods for different combinations of wealth and risk aversion to generate the implementable efficient frontier from Figure 1
  • dSQ call: dsq --job-file Joblists/joblist_pfchoice_ief.txt --cpus-per-task=16 --mem=50G --partition=day -t 10:00:00 --mail-type ALL -- output slurm_output/dsq-joblist_pfchoice-ief-%A_%1a-%N.out. This call will start 20 independent jobs, which for us took a maximum of 7 hours and required approximately 40GB RAM for each job
  • Main R script: slurm_build_portfolios.R
  • Output folder: Data/Generated/Portfolios

Economic feature importance

  • What: Implement the permutation-based feature importance analysis used for figures 5, 6, and D.3
  • dSQ call: dsq --job-file Joblists/joblist_pfchoice_fi.txt --cpus-per-task=48 --mem=70G --partition=day -t 23:00:00 --mail-type ALL --o utput slurm_output/dsq-joblist_pfchoice-fi-%A_%1a-%N.out This call will start 3 independent jobs, which for us took a maximum of 3 hours and required approximately 35GB RAM for each job
  • Main R script: slurm_build_portfolios.R
  • Output folder: Data/Generated/Portfolios

Simulations

  • What: Implement simulations from Appendix Section E
  • dSQ call: dsq --job-file Joblists/joblist_simulations.txt --cpus-per-task=32 --mem=50G --partition=day -t 10:00:00 --mail-type ALL --o utput slurm_output/dsq-joblist_simulations-%A_%1a-%N.out. This call will start 15 independent jobs, which for us took a maximum of 9 hours and required approximately 25GB for each job
  • Main R script: simulations/simulations.R
  • Output folder: simulations/results

Data analysis

After generating the data from the previous section, you can analyze it on your local PC. Specifically, you can generate all figures and tables from the paper by running the scripts below. Importantly, you need to go through each script and ensure they point to the correct files (the names of the files from the previous sections depend on when the code was submitted).

Start by running the main.R script to load the relevant packages and settings.

Next, run the scripts below to create the figures and tables:

  • 6 - Implementable efficient frontier.R
  • 6 - Base analysis.R
  • 6 - Performance across size distribution.R
  • 6 - Feature Importance.R
  • 6 - Economic intuition.R
  • 6 - Short selling fees.R
  • 6 - RF Example.R

Finally, run the scripts below to save the figures and tables, as well as generate various numbers mentioned in the paper:

  • 7 - Figures.R
  • 7 - Tables.R
  • 7 - Numbers.R

After running these scripts, you should have the figures from the paper in the Figures folder and be able to copy-paste the tables in latex format and the numbers from the console.

About

Code repository for "Machine Learning and the Implementable Efficient Frontier" by Jensen, Kelly, Malamud, and Pedersen (2024)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 99.4%
  • C++ 0.6%