Skip to content

Commit b6c3154

Browse files
committed
add function to automatically update daily frequency data
1 parent a4f6e04 commit b6c3154

File tree

6 files changed

+189
-21
lines changed

6 files changed

+189
-21
lines changed

README.md

+22
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,28 @@ Users could create the same dataset with it.
159159
*Please pay **ATTENTION** that the data is collected from [Yahoo Finance](https://finance.yahoo.com/lookup), and the data might not be perfect.
160160
We recommend users to prepare their own data if they have a high-quality dataset. For more information, users can refer to the [related document](https://qlib.readthedocs.io/en/latest/component/data.html#converting-csv-format-into-qlib-format)*.
161161
162+
### Automatic update of daily frequency data(from yahoo finance)
163+
> It is recommended that users update the data manually once (--trading_date 2021-05-25) and then set it to update automatically.
164+
165+
> For more information refer to: [yahoo collector](https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo#Automatic-update-of-daily-frequency-data)
166+
167+
* Automatic update of data to the "qlib" directory each trading day(Linux)
168+
* use *crontab*: `crontab -e`
169+
* set up timed tasks:
170+
171+
```
172+
* * * * 1-5 python <script path> update_data_to_bin --qlib_data_1d_dir <user data dir>
173+
```
174+
* **script path**: *qlib/scripts/data_collector/yahoo/collector.py*
175+
176+
* Manual update of data
177+
```
178+
python qlib/scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --trading_date <start date> --end_date <end date>
179+
```
180+
* *trading_date*: start of trading day
181+
* *end_date*: end of trading day(not included)
182+
183+
162184
<!--
163185
- Run the initialization code and get stock data:
164186

docs/component/data.rst

+28
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,34 @@ After running the above command, users can find china-stock and us-stock data in
6767

6868
When ``Qlib`` is initialized with this dataset, users could build and evaluate their own models with it. Please refer to `Initialization <../start/initialization.html>`_ for more details.
6969

70+
Automatic update of daily frequency data
71+
----------------------------------------
72+
73+
**It is recommended that users update the data manually once (\-\-trading_date 2021-05-25) and then set it to update automatically.**
74+
75+
For more information refer to: `yahoo collector <https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo#Automatic-update-of-daily-frequency-data>`_
76+
77+
- Automatic update of data to the "qlib" directory each trading day(Linux)
78+
- use *crontab*: `crontab -e`
79+
- set up timed tasks:
80+
81+
.. code-block:: bash
82+
83+
* * * * 1-5 python <script path> update_data_to_bin --qlib_data_1d_dir <user data dir>
84+
85+
- **script path**: *qlib/scripts/data_collector/yahoo/collector.py*
86+
87+
- Manual update of data
88+
89+
.. code-block:: bash
90+
91+
python qlib/scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --trading_date <start date> --end_date <end date>
92+
93+
- *trading_date*: start of trading day
94+
- *end_date*: end of trading day(not included)
95+
96+
97+
7098
Converting CSV Format into Qlib Format
7199
-------------------------------------------
72100

scripts/data_collector/cn_index/collector.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ def get_instruments(
295295
$ python collector.py --index_name CSI300 --qlib_dir ~/.qlib/qlib_data/cn_data --method save_new_companies
296296
297297
"""
298-
_cur_module = importlib.import_module("collector")
298+
_cur_module = importlib.import_module("data_collector.cn_index.collector")
299299
obj = getattr(_cur_module, f"{index_name.upper()}")(
300300
qlib_dir=qlib_dir, index_name=index_name, request_retry=request_retry, retry_sleep=retry_sleep
301301
)

scripts/data_collector/us_index/collector.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@ def get_instruments(
271271
$ python collector.py --index_name SP500 --qlib_dir ~/.qlib/qlib_data/cn_data --method save_new_companies
272272
273273
"""
274-
_cur_module = importlib.import_module("collector")
274+
_cur_module = importlib.import_module("data_collector.us_index.collector")
275275
obj = getattr(_cur_module, f"{index_name.upper()}Index")(
276276
qlib_dir=qlib_dir, index_name=index_name, request_retry=request_retry, retry_sleep=retry_sleep
277277
)

scripts/data_collector/yahoo/README.md

+52-9
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,19 @@
1+
2+
- [Collector Data](#collector-data)
3+
- [Automatic update data](#automatic-update-of-daily-frequency-data(from-yahoo-finance))
4+
- [CN Data](#CN-Data)
5+
- [1d from yahoo](#1d-from-yahoocn)
6+
- [1d from qlib](#1d-from-qlibcn)
7+
- [using data(1d)](#using-data1d-cn)
8+
- [1min from yahoo](#1min-from-yahoocn)
9+
- [1min from qlib](#1min-from-qlibcn)
10+
- [using data(1min)](#using-data1min-cn)
11+
- [US Data](#CN-Data)
12+
- [1d from yahoo](#1d-from-yahoous)
13+
- [1d from qlib](#1d-from-qlibus)
14+
- [using data(1d)](#using-data1d-us)
15+
16+
117
# Collect Data From Yahoo Finance
218

319
> *Please pay **ATTENTION** that the data is collected from [Yahoo Finance](https://finance.yahoo.com/lookup) and the data might not be perfect. We recommend users to prepare their own data if they have high-quality dataset. For more information, users can refer to the [related document](https://qlib.readthedocs.io/en/latest/component/data.html#converting-csv-format-into-qlib-format)*
@@ -18,10 +34,37 @@ pip install -r requirements.txt
1834

1935
## Collector Data
2036

37+
### Automatic update of daily frequency data(from yahoo finance)
38+
> It is recommended that users update the data manually once (--trading_date 2021-05-25) and then set it to update automatically.
39+
40+
* Automatic update of data to the "qlib" directory each trading day(Linux)
41+
* use *crontab*: `crontab -e`
42+
* set up timed tasks:
43+
44+
```
45+
* * * * 1-5 python <script path> update_data_to_bin --qlib_data_1d_dir <user data dir>
46+
```
47+
* **script path**: *qlib/scripts/data_collector/yahoo/collector.py*
48+
49+
* Manual update of data
50+
```
51+
python qlib/scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --trading_date <start date> --end_date <end date>
52+
```
53+
* *trading_date*: start of trading day
54+
* *end_date*: end of trading day(not included)
55+
56+
* qlib/scripts/data_collector/yahoo/collector.py update_data_to_bin parameters:
57+
* *source_dir*: The directory where the raw data collected from the Internet is saved, default "Path(__file__).parent/source"
58+
* *normalize_dir*: Directory for normalize data, default "Path(__file__).parent/normalize"
59+
* *qlib_data_1d_dir*: the qlib data to be updated for yahoo, usually from: [download qlib data](https://github.com/microsoft/qlib/tree/main/scripts#download-cn-data)
60+
* *trading_date*: trading days to be updated, by default ``datetime.datetime.now().strftime("%Y-%m-%d")``
61+
* *end_date*: end datetime, default ``pd.Timestamp(trading_date + pd.Timedelta(days=1))``; open interval(excluding end)
62+
* *region*: region, value from ["CN", "US"], default "CN"
63+
2164
2265
### CN Data
2366
24-
#### 1d from yahoo
67+
#### 1d from yahoo(CN)
2568
2669
```bash
2770
@@ -37,12 +80,12 @@ python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/cn_1d_nor --qli
3780
3881
```
3982

40-
### 1d from qlib
83+
### 1d from qlib(CN)
4184
```bash
4285
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1d --region cn
4386
```
4487

45-
### using data
88+
### using data(1d CN)
4689

4790
```python
4891
import qlib
@@ -52,7 +95,7 @@ qlib.init(provider_uri="~/.qlib/qlib_data/qlib_cn_1d", region="cn")
5295
df = D.features(D.instruments("all"), ["$close"], freq="day")
5396
```
5497

55-
#### 1min from yahoo
98+
#### 1min from yahoo(CN)
5699

57100
```bash
58101

@@ -67,12 +110,12 @@ cd qlib/scripts
67110
python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/cn_1min_nor --qlib_dir ~/.qlib/qlib_data/qlib_cn_1min --freq 1min --exclude_fields date,adjclose,dividends,splits,symbol
68111
```
69112

70-
### 1min from qlib
113+
### 1min from qlib(CN)
71114
```bash
72115
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1min --interval 1min --region cn
73116
```
74117

75-
### using data
118+
### using data(1min CN)
76119

77120
```python
78121
import qlib
@@ -85,7 +128,7 @@ df = D.features(D.instruments("all"), ["$close"], freq="1min")
85128

86129
### US Data
87130

88-
#### 1d from yahoo
131+
#### 1d from yahoo(US)
89132

90133
```bash
91134

@@ -100,13 +143,13 @@ cd qlib/scripts
100143
python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/us_1d_nor --qlib_dir ~/.qlib/stock_data/source/qlib_us_1d --freq day --exclude_fields date,adjclose,dividends,splits,symbol
101144
```
102145

103-
#### 1d from qlib
146+
#### 1d from qlib(US)
104147

105148
```bash
106149
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_us_1d --region us
107150
```
108151

109-
### using data
152+
### using data(1d US)
110153

111154
```python
112155
# using

scripts/data_collector/yahoo/collector.py

+85-10
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
import importlib
1010
from abc import ABC
1111
from pathlib import Path
12-
from typing import Iterable, Type
12+
from typing import Iterable
1313

1414
import fire
1515
import requests
@@ -18,11 +18,15 @@
1818
from loguru import logger
1919
from yahooquery import Ticker
2020
from dateutil.tz import tzlocal
21-
from qlib.utils import code_to_fname, fname_to_code
21+
22+
from qlib.tests.data import GetData
23+
from qlib.utils import code_to_fname, fname_to_code, exists_qlib_data
2224
from qlib.config import REG_CN as REGION_CN
2325

2426
CUR_DIR = Path(__file__).resolve().parent
2527
sys.path.append(str(CUR_DIR.parent.parent))
28+
29+
from dump_bin import DumpDataUpdate
2630
from data_collector.base import BaseCollector, BaseNormalize, BaseRun, Normalize
2731
from data_collector.utils import (
2832
deco_retry,
@@ -153,7 +157,10 @@ def _get_simple(start_, end_):
153157

154158
_result = None
155159
if interval == self.INTERVAL_1d:
156-
_result = _get_simple(start_datetime, end_datetime)
160+
try:
161+
_result = _get_simple(start_datetime, end_datetime)
162+
except ValueError as e:
163+
pass
157164
elif interval == self.INTERVAL_1min:
158165
_res = []
159166
_start = self.start_datetime
@@ -184,7 +191,7 @@ def download_index_data(self):
184191

185192
class YahooCollectorCN(YahooCollector, ABC):
186193
def get_instrument_list(self):
187-
logger.info("get HS stock symbos......")
194+
logger.info("get HS stock symbols......")
188195
symbols = get_hs_stock_symbols()
189196
logger.info(f"get {len(symbols)} symbols.")
190197
return symbols
@@ -233,9 +240,9 @@ def download_index_data(self):
233240

234241

235242
class YahooCollectorCN1min(YahooCollectorCN):
236-
def download_index_data(self):
237-
# TODO: 1m
238-
logger.warning(f"{self.__class__.__name__} {self.interval} does not support: download_index_data")
243+
def get_instrument_list(self):
244+
symbols = super(YahooCollectorCN1min, self).get_instrument_list()
245+
return symbols + ["000300.ss", "000905.ss", "00903.ss"]
239246

240247

241248
class YahooCollectorUS(YahooCollector, ABC):
@@ -450,10 +457,12 @@ def normalize(self, df: pd.DataFrame) -> pd.DataFrame:
450457
_max_date = df.index.max()
451458
df = df.reindex(self._calendar_list).loc[:_max_date].reset_index()
452459
df = df[df[self._date_field_name] > _last_date]
460+
if df.empty:
461+
return pd.DataFrame()
453462
_si = df["close"].first_valid_index()
454463
if _si > df.index[0]:
455464
logger.warning(
456-
f"{df.iloc[0][self._symbol_field_name]} missing data: {df.loc[:_si][self._date_field_name]}"
465+
f"{df.loc[_si][self._symbol_field_name]} missing data: {df.loc[:_si-1][self._date_field_name].to_list()}"
457466
)
458467
# normalize
459468
df = self.normalize_yahoo(
@@ -661,7 +670,7 @@ def _get_calendar_list(self) -> Iterable[pd.Timestamp]:
661670

662671
def symbol_to_yahoo(self, symbol):
663672
if "." not in symbol:
664-
_exchange = symbol[:2]
673+
_exchange = symbol[:2].lower()
665674
_exchange = "ss" if _exchange == "sh" else _exchange
666675
symbol = symbol[2:] + "." + _exchange
667676
return symbol
@@ -864,7 +873,7 @@ def normalize_data_1d_extend(
864873
yc.normalize()
865874

866875
def normalize_data_1min_cn_offline(
867-
self, qlib_data_1d_dir, date_field_name: str = "date", symbol_field_name: str = "symbol"
876+
self, qlib_data_1d_dir: str, date_field_name: str = "date", symbol_field_name: str = "symbol"
868877
):
869878
"""Normalised to 1min using local 1d data
870879
@@ -942,6 +951,72 @@ def download_today_data(
942951
limit_nums,
943952
)
944953

954+
def update_data_to_bin(self, qlib_data_1d_dir: str, trading_date: str = None, end_date: str = None):
955+
"""update yahoo data to bin
956+
957+
Parameters
958+
----------
959+
qlib_data_1d_dir: str
960+
the qlib data to be updated for yahoo, usually from: https://github.com/microsoft/qlib/tree/main/scripts#download-cn-data
961+
962+
trading_date: str
963+
trading days to be updated, by default ``datetime.datetime.now().strftime("%Y-%m-%d")``
964+
end_date: str
965+
end datetime, default ``pd.Timestamp(trading_date + pd.Timedelta(days=1))``; open interval(excluding end)
966+
967+
Notes
968+
-----
969+
If the data in qlib_data_dir is incomplete, np.nan will be populated to trading_date for the previous trading day
970+
971+
Examples
972+
-------
973+
$ python collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --trading_date <start date> --end_date <end date>
974+
# get 1m data
975+
"""
976+
977+
if self.interval.lower() != "1d":
978+
logger.warning(f"currently supports 1d data updates: --interval 1d")
979+
980+
# start/end date
981+
if trading_date is None:
982+
trading_date = datetime.datetime.now().strftime("%Y-%m-%d")
983+
logger.warning(f"trading_date is None, use the current date: {trading_date}")
984+
985+
if end_date is None:
986+
end_date = (pd.Timestamp(trading_date) + pd.Timedelta(days=1)).strftime("%Y-%m-%d")
987+
988+
# download qlib 1d data
989+
qlib_data_1d_dir = Path(qlib_data_1d_dir).expanduser().resolve()
990+
if not exists_qlib_data(qlib_data_1d_dir):
991+
GetData().qlib_data(target_dir=qlib_data_1d_dir, interval=self.interval, region=self.region)
992+
993+
# download data from yahoo
994+
self.download_data(delay=1, start=trading_date, end=end_date, check_data_length=1)
995+
996+
# normalize data
997+
self.normalize_data_1d_extend(str(qlib_data_1d_dir))
998+
999+
# dump bin
1000+
_dump = DumpDataUpdate(
1001+
csv_path=self.normalize_dir,
1002+
qlib_dir=qlib_data_1d_dir,
1003+
exclude_fields="symbol,date",
1004+
max_workers=self.max_workers,
1005+
)
1006+
_dump.dump()
1007+
1008+
# parse index
1009+
_region = self.region.lower()
1010+
if _region not in ["cn", "us"]:
1011+
logger.warning(f"Unsupported region: region={_region}, component downloads will be ignored")
1012+
return
1013+
index_list = ["CSI100", "CSI300"] if _region == "cn" else ["SP500", "NASDAQ100", "DJIA", "SP400"]
1014+
get_instruments = getattr(
1015+
importlib.import_module(f"data_collector.{_region}_index.collector"), "get_instruments"
1016+
)
1017+
for _index in index_list:
1018+
get_instruments(str(qlib_data_1d_dir), _index)
1019+
9451020

9461021
if __name__ == "__main__":
9471022
fire.Fire(Run)

0 commit comments

Comments
 (0)