Chinese blog about this project: 量化系列2 - 众包数据集
Table of contents generated with markdown-toc
To: dmnsn7 Who provided tushare token and make daily update possible.
- Download tar ball from latest release page on github
- Extract tar file to default qlib directory
wget https://github.com/chenditc/investment_data/releases/download/2023-10-08/qlib_bin.tar.gz
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=1
If you want to contribute to the set of scripts or the data, here is what you should do to set up a dev environment.
Follow https://github.com/dolthub/dolt
Raw data hosted on dolt: https://www.dolthub.com/repositories/chenditc/investment_data
To download as dolt database:
dolt clone chenditc/investment_data
docker run
-v /<some output directory>:/output
-it --rm chenditc/investment_data bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/
You can use the following parameter to mount an existing dolt chenditc/investment_data folder to the container.
-v /<dolt directory>:/dolt
You will need tushare token to use tushare api. Get tushare token from https://tushare.pro/
export TUSHARE=<Token>
bash daily_update.sh
docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash daily_update.sh && bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=1
- Try to fill in missing data by combining data from multiple data source. For example, delist company's data.
- Try to correct data by cross validate against multiple data source.
The database table on dolthub is named with prefix of data source, for example ts_a_stock_eod_price
. The meaning of the prefix:
- w(wind): high quality static data source. Only available till 2019.
- c(caihui): high quality static data source. Only available till 2019.
- ts: Tushare data source
- ak: Akshare data source
- yahoo: Use Qlib's yahoo collector https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo
- baostock: Baostock
- final: Merged final data with validation and correction
The initial date for each stock might be different, when we calculate the adjusted price, we are using the first date price as adjust factor = 1.0.
In order to merge different data sources, we need to rescale the adjust factor, so that each data source will have the same adjusted price.
Each data source will have a dedicated link table, which is generated by:
- If the final_a_stock_eod_price already has this stock, adjust_ratio = final_a_stock_eod_price.adjust_price / current_data_source.adjust_price
- If the stock is new to final_a_stock_eod_price, adjust_ratio = 1.
Data validation needs to run to verify if the adjust factor match between each data source:
- data_source_1.adjust_ratio * data_source_1.adjust_price = final_a_stock_eod_price.adjust_price
To add a new stock index, we need to change:
- Add index weight download script. Change tushare/dump_index_eod_price.py script to dump the index info. If the index is not available in tushare, write a new script and add to the daily_update.sh script. Example commit
- Add price download script. Change tushare/dump_index_eod_price.py to add the index price. Eg. Example Commit
- Modify export script. Change the qlib dump script qlib/dump_index_weight.py#L13, so that index will be dump and renamed to a txt file for use. Example commit
Please raise an issue to discuss the plan, example issue: #11
It should includes:
- Why do we want this data?
- How do we do regular update?
- Which data source would we use?
- When should we trigger update?
- How do we validate regular update complete correctly?
- Which data source should we get historical data?
- How do we plan to validate the historical data?
- Is the data source complete? How did we verify this?
- Is the data source accurate? How did we verify this?
- If we see error in validation, how will we deal with them?
- Are we changing exisiting table or adding new table?
If the data is not clean, we might try hard to dig insight from it and find incorrect insight. So we want high quality data instead of just data.