Skip to content

Commit bab50e8

Browse files
committed
fix YahooNormalize1min && update docs
1 parent 46714ad commit bab50e8

File tree

4 files changed

+149
-123
lines changed

4 files changed

+149
-123
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ We recommend users to prepare their own data if they have a high-quality dataset
162162
### Automatic update of daily frequency data(from yahoo finance)
163163
> It is recommended that users update the data manually once (--trading_date 2021-05-25) and then set it to update automatically.
164164
165-
> For more information refer to: [yahoo collector](https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo#Automatic-update-of-daily-frequency-data)
165+
> For more information refer to: [yahoo collector](https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo#automatic-update-of-daily-frequency-datafrom-yahoo-finance)
166166
167167
* Automatic update of data to the "qlib" directory each trading day(Linux)
168168
* use *crontab*: `crontab -e`

examples/benchmarks/README.md

+4
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@ Here are the results of each benchmark model running on Qlib's `Alpha360` and `A
44

55
The numbers shown below demonstrate the performance of the entire `workflow` of each model. We will update the `workflow` as well as models in the near future for better results.
66

7+
> If you need to reproduce the results below, please use the **v1** dataset: `python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1d --region cn --version v1`
8+
>
9+
> In the new version of qlib, the default dataset is **v2**. Since the data is collected from the YahooFinance API (which is not very stable), the results of *v2* and *v1* may differ
10+
711
## Alpha360 dataset
812
| Model Name | Dataset | IC | ICIR | Rank IC | Rank ICIR | Annualized Return | Information Ratio | Max Drawdown |
913
|---|---|---|---|---|---|---|---|---|

scripts/data_collector/yahoo/README.md

+135-115
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,9 @@
11

22
- [Collector Data](#collector-data)
3-
- [Automatic update data](#automatic-update-of-daily-frequency-data(from-yahoo-finance))
4-
- [CN Data](#CN-Data)
5-
- [1d from yahoo](#1d-from-yahoocn)
6-
- [1d from qlib](#1d-from-qlibcn)
7-
- [using data(1d)](#using-data1d-cn)
8-
- [1min from yahoo](#1min-from-yahoocn)
9-
- [1min from qlib](#1min-from-qlibcn)
10-
- [using data(1min)](#using-data1min-cn)
11-
- [US Data](#CN-Data)
12-
- [1d from yahoo](#1d-from-yahoous)
13-
- [1d from qlib](#1d-from-qlibus)
14-
- [using data(1d)](#using-data1d-us)
3+
- [Get Qlib data](#get-qlib-databin-file)
4+
- [Collector *YahooFinance* data to qlib](#collector-yahoofinance-data-to-qlib)
5+
- [Automatic update of daily frequency data](#automatic-update-of-daily-frequency-datafrom-yahoo-finance)
6+
- [Using qlib data](#using-qlib-data)
157

168

179
# Collect Data From Yahoo Finance
@@ -34,6 +26,110 @@ pip install -r requirements.txt
3426

3527
## Collector Data
3628

29+
### Get Qlib data(`bin file`)
30+
> `qlib-data` from *YahooFinance*, is the data that has been dumped and can be used directly in `qlib`
31+
32+
- get data: `python scripts/get_data.py qlib_data`
33+
- parameters:
34+
- `target_dir`: save dir, by default *~/.qlib/qlib_data/cn_data*
35+
- `version`: dataset version, value from [`v1`, `v2`], by default `v1`
36+
- `v2` end date is *2021-06*, `v1` end date is *2020-09*
37+
- user can append data to `v2`: [automatic update of daily frequency data](#automatic-update-of-daily-frequency-datafrom-yahoo-finance)
38+
- **the [benchmarks](https://github.com/microsoft/qlib/tree/main/examples/benchmarks) for qlib use `v1`**, *due to the unstable access to historical data by YahooFinance, there are some differences between `v2` and `v1`*
39+
- `interval`: `1d` or `1min`, by default `1d`
40+
- `region`: `cn` or `us`, by default `cn`
41+
- `delete_old`: delete existing data from `target_dir`(*features, calendars, instruments, dataset_cache, features_cache*), value from [`True`, `False`], by default `True`
42+
- `exists_skip`: traget_dir data already exists, skip `get_data`, value from [`True`, `False`], by default `False`
43+
- examples:
44+
```bash
45+
# cn 1d
46+
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1d --region cn
47+
# cn 1min
48+
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1min --region cn --interval 1min
49+
# us 1d
50+
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_us_1d --region us --interval 1d
51+
# us 1min
52+
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_us_1min --region us --interval 1min
53+
```
54+
55+
### Collector *YahooFinance* data to qlib
56+
> collector *YahooFinance* data and *dump* into `qlib` format
57+
1. download data to csv: `python scripts/data_collector/yahoo/collector.py download_data`
58+
59+
- parameters:
60+
- `source_dir`: save the directory
61+
- `interval`: `1d` or `1min`, by default `1d`
62+
> **due to the limitation of the *YahooFinance API*, only the last month's data is available in `1min`**
63+
- `region`: `CN` or `US`, by default `CN`
64+
- `delay`: `time.sleep(delay)`, by default *0.5*
65+
- `start`: start datetime, by default *"2000-01-01"*; *closed interval(including start)*
66+
- `end`: end datetime, by default `pd.Timestamp(datetime.datetime.now() + pd.Timedelta(days=1))`; *open interval(excluding end)*
67+
- `max_workers`: get the number of concurrent symbols, it is not recommended to change this parameter in order to maintain the integrity of the symbol data, by default *1*
68+
- `check_data_length`: check the number of rows per *symbol*, by default `None`
69+
> if `len(symbol_df) < check_data_length`, it will be re-fetched, with the number of re-fetches coming from the `max_collector_count` parameter
70+
- `max_collector_count`: number of *"failed"* symbol retries, by default 2
71+
- examples:
72+
```bash
73+
# cn 1d data
74+
python collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_1d --start 2020-01-01 --end 2020-12-31 --delay 1 --interval 1d --region US
75+
# cn 1min data
76+
python collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_1min --delay 1 --interval 1min --region CN
77+
# us 1d data
78+
python collector.py download_data --source_dir ~/.qlib/stock_data/source/us_1d --start 2020-01-01 --end 2020-12-31 --delay 1 --interval 1d --region US
79+
# us 1min data
80+
python collector.py download_data --source_dir ~/.qlib/stock_data/source/us_1min --delay 1 --interval 1min --region US
81+
```
82+
2. normalize data: `python scripts/data_collector/yahoo/collector.py normalize_data`
83+
84+
- parameters:
85+
- `source_dir`: csv directory
86+
- `normalize_dir`: result directory
87+
- `max_workers`: number of concurrent, by default *1*
88+
- `interval`: `1d` or `1min`, by default `1d`
89+
> if **`interval == 1min`**, `qlib_data_1d_dir` cannot be `None`
90+
- `region`: `CN` or `US`, by default `CN`
91+
- `date_field_name`: column *name* identifying time in csv files, by default `date`
92+
- `symbol_field_name`: column *name* identifying symbol in csv files, by default `symbol`
93+
- `end_date`: if not `None`, normalize the last date saved (*including end_date*); if `None`, it will ignore this parameter; by default `None`
94+
- `qlib_data_1d_dir`: qlib directory(1d data)
95+
```
96+
if interval==1min, qlib_data_1d_dir cannot be None, normalize 1min needs to use 1d data;
97+
98+
qlib_data_1d can be obtained like this:
99+
$ python scripts/get_data.py qlilb_data --target_dir <qlib_data_1d_dir> --interval 1d
100+
$ python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <qlib_data_1d_dir> --trading_date 2021-06-01
101+
or:
102+
download 1d data from YahooFinance
103+
104+
```
105+
- examples:
106+
```bash
107+
# normalize 1d cn
108+
python collector.py normalize_data --source_dir ~/.qlib/stock_data/source/cn_1d --normalize_dir ~/.qlib/stock_data/source/cn_1d_nor --region CN --interval 1d
109+
# normalize 1min cn
110+
python collector.py normalize_data --qlib_data_1d_dir ~/.qlib/qlib_data/qlib_cn_1d --source_dir ~/.qlib/stock_data/source/cn_1min --normalize_dir ~/.qlib/stock_data/source/cn_1min_nor --region CN --interval 1min
111+
```
112+
3. dump data: `python scripts/dump_bin.py dump_all`
113+
114+
- parameters:
115+
- `csv_path`: stock data path or directory, **normalize result(normalize_dir)**
116+
- `qlib_dir`: qlib(dump) data director
117+
- `freq`: transaction frequency, by default `day`
118+
> `freq_map = {1d:day, 1mih: 1min}`
119+
- `max_workers`: number of threads, by default *16*
120+
- `include_fields`: dump fields, by default `""`
121+
- `exclude_fields`: fields not dumped, by default `"""
122+
> dump_fields = `include_fields if include_fields else set(symbol_df.columns) - set(exclude_fields) exclude_fields else symbol_df.columns`
123+
- `symbol_field_name`: column *name* identifying symbol in csv files, by default `symbol`
124+
- `date_field_name`: column *name* identifying time in csv files, by default `date`
125+
- examples:
126+
```bash
127+
# dump 1d cn
128+
python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/cn_1d_nor --qlib_dir ~/.qlib/qlib_data/qlib_cn_1d --freq day --exclude_fields date,symbol
129+
# dump 1min cn
130+
python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/cn_1min_nor --qlib_dir ~/.qlib/qlib_data/qlib_cn_1min --freq 1min --exclude_fields date,symbol
131+
```
132+
37133
### Automatic update of daily frequency data(from yahoo finance)
38134
> It is recommended that users update the data manually once (--trading_date 2021-05-25) and then set it to update automatically.
39135
@@ -62,112 +158,36 @@ pip install -r requirements.txt
62158
* *region*: region, value from ["CN", "US"], default "CN"
63159
64160
65-
### CN Data
66-
67-
#### 1d from yahoo(CN)
68-
69-
```bash
70-
71-
# download from yahoo finance
72-
python collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_1d --region CN --start 2020-11-01 --end 2020-11-10 --delay 0.1 --interval 1d
73-
74-
# normalize
75-
python collector.py normalize_data --source_dir ~/.qlib/stock_data/source/cn_1d --normalize_dir ~/.qlib/stock_data/source/cn_1d_nor --region CN --interval 1d
76-
77-
# dump data
78-
cd qlib/scripts
79-
python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/cn_1d_nor --qlib_dir ~/.qlib/qlib_data/qlib_cn_1d --freq day --exclude_fields date,adjclose,dividends,splits,symbol
80-
81-
```
82-
83-
### 1d from qlib(CN)
84-
```bash
85-
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1d --region cn
86-
```
161+
## Using qlib data
87162
88-
### using data(1d CN)
163+
```python
164+
import qlib
165+
from qlib.data import D
89166
90-
```python
91-
import qlib
92-
from qlib.data import D
93-
94-
qlib.init(provider_uri="~/.qlib/qlib_data/qlib_cn_1d", region="cn")
95-
df = D.features(D.instruments("all"), ["$close"], freq="day")
96-
```
97-
98-
#### 1min from yahoo(CN)
99-
100-
```bash
101-
102-
# download from yahoo finance
103-
python collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_1min --region CN --start 2020-11-01 --end 2020-11-10 --delay 0.1 --interval 1min
104-
105-
# normalize
106-
python collector.py normalize_data --source_dir ~/.qlib/stock_data/source/cn_1min --normalize_dir ~/.qlib/stock_data/source/cn_1min_nor --region CN --interval 1min
107-
108-
# dump data
109-
cd qlib/scripts
110-
python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/cn_1min_nor --qlib_dir ~/.qlib/qlib_data/qlib_cn_1min --freq 1min --exclude_fields date,adjclose,dividends,splits,symbol
111-
```
112-
113-
### 1min from qlib(CN)
114-
```bash
115-
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1min --interval 1min --region cn
116-
```
167+
# 1d data cn
168+
# freq=day, freq default day
169+
qlib.init(provider_uri="~/.qlib/qlib_data/qlib_cn_1d", region="cn")
170+
df = D.features(D.instruments("all"), ["$close"], freq="day")
117171
118-
### using data(1min CN)
172+
# 1min data cn
173+
# freq=1min
174+
qlib.init(provider_uri="~/.qlib/qlib_data/qlib_cn_1min", region="cn")
175+
inst = D.list_instruments(D.instruments("all"), freq="1min", as_list=True)
176+
# get 100 symbols
177+
df = D.features(inst[:100], ["$close"], freq="1min")
178+
# get all symbol data
179+
# df = D.features(D.instruments("all"), ["$close"], freq="1min")
119180
120-
```python
121-
import qlib
122-
from qlib.data import D
123-
124-
qlib.init(provider_uri="~/.qlib/qlib_data/qlib_cn_1min", region="cn")
125-
df = D.features(D.instruments("all"), ["$close"], freq="1min")
126-
127-
```
128-
129-
### US Data
130-
131-
#### 1d from yahoo(US)
132-
133-
```bash
134-
135-
# download from yahoo finance
136-
python collector.py download_data --source_dir ~/.qlib/stock_data/source/us_1d --region US --start 2020-11-01 --end 2020-11-10 --delay 0.1 --interval 1d
137-
138-
# normalize
139-
python collector.py normalize_data --source_dir ~/.qlib/stock_data/source/us_1d --normalize_dir ~/.qlib/stock_data/source/us_1d_nor --region US --interval 1d
140-
141-
# dump data
142-
cd qlib/scripts
143-
python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/us_1d_nor --qlib_dir ~/.qlib/stock_data/source/qlib_us_1d --freq day --exclude_fields date,adjclose,dividends,splits,symbol
144-
```
145-
146-
#### 1d from qlib(US)
147-
148-
```bash
149-
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_us_1d --region us
150-
```
151-
152-
### using data(1d US)
153-
154-
```python
155-
# using
156-
import qlib
157-
from qlib.data import D
158-
159-
qlib.init(provider_uri="~/.qlib/qlib_data/qlib_us_1d", region="us")
160-
df = D.features(D.instruments("all"), ["$close"], freq="day")
161-
162-
```
163-
164-
165-
### Help
166-
```bash
167-
pythono collector.py collector_data --help
168-
```
181+
# 1d data us
182+
qlib.init(provider_uri="~/.qlib/qlib_data/qlib_us_1d", region="us")
183+
df = D.features(D.instruments("all"), ["$close"], freq="day")
169184
170-
## Parameters
185+
# 1min data us
186+
qlib.init(provider_uri="~/.qlib/qlib_data/qlib_us_1min", region="cn")
187+
inst = D.list_instruments(D.instruments("all"), freq="1min", as_list=True)
188+
# get 100 symbols
189+
df = D.features(inst[:100], ["$close"], freq="1min")
190+
# get all symbol data
191+
# df = D.features(D.instruments("all"), ["$close"], freq="1min")
192+
```
171193
172-
- interval: 1min or 1d
173-
- region: CN or US

scripts/data_collector/yahoo/collector.py

+9-7
Original file line numberDiff line numberDiff line change
@@ -242,7 +242,10 @@ def download_index_data(self):
242242
class YahooCollectorCN1min(YahooCollectorCN):
243243
def get_instrument_list(self):
244244
symbols = super(YahooCollectorCN1min, self).get_instrument_list()
245-
return symbols + ["000300.ss", "000905.ss", "00903.ss"]
245+
return symbols + ["000300.ss", "000905.ss", "000903.ss"]
246+
247+
def download_index_data(self):
248+
pass
246249

247250

248251
class YahooCollectorUS(YahooCollector, ABC):
@@ -461,7 +464,7 @@ def normalize(self, df: pd.DataFrame) -> pd.DataFrame:
461464
_si = df["close"].first_valid_index()
462465
if _si > df.index[0]:
463466
logger.warning(
464-
f"{df.loc[_si][self._symbol_field_name]} missing data: {df.loc[:_si-1][self._date_field_name].to_list()}"
467+
f"{df.loc[_si][self._symbol_field_name]} missing data: {df.loc[:_si - 1][self._date_field_name].to_list()}"
465468
)
466469
# normalize
467470
df = self.normalize_yahoo(
@@ -524,7 +527,7 @@ def adjusted_price(self, df: pd.DataFrame) -> pd.DataFrame:
524527
data_1d: pd.DataFrame = self.get_1d_data(symbol, _start, _end)
525528
data_1d = data_1d.copy()
526529
if data_1d is None or data_1d.empty:
527-
df["factor"] = 1 / df.loc[df["close"].first_valid_index()]
530+
df["factor"] = 1 / df.loc[df["close"].first_valid_index()]["close"]
528531
# TODO: np.nan or 1 or 0
529532
df["paused"] = np.nan
530533
else:
@@ -770,7 +773,7 @@ def default_base_dir(self) -> [Path, str]:
770773
def download_data(
771774
self,
772775
max_collector_count=2,
773-
delay=0,
776+
delay=0.5,
774777
start=None,
775778
end=None,
776779
check_data_length=None,
@@ -783,7 +786,7 @@ def download_data(
783786
max_collector_count: int
784787
default 2
785788
delay: float
786-
time.sleep(delay), default 0
789+
time.sleep(delay), default 0.5
787790
start: str
788791
start datetime, default "2000-01-01"; closed interval(including start)
789792
end: str
@@ -844,9 +847,8 @@ def normalize_data(
844847
"""
845848
if self.interval.lower() == "1min":
846849
if qlib_data_1d_dir is None or not Path(qlib_data_1d_dir).expanduser().exists():
847-
# TODO: add reference url
848850
raise ValueError(
849-
"If normalize 1min, the qlib_data_1d_dir parameter must be set: --qlib_data_1d_dir <user qlib 1d data >, Reference: "
851+
"If normalize 1min, the qlib_data_1d_dir parameter must be set: --qlib_data_1d_dir <user qlib 1d data >, Reference: https://github.com/zhupr/qlib/tree/support_extend_data/scripts/data_collector/yahoo#automatic-update-of-daily-frequency-datafrom-yahoo-finance"
850852
)
851853
super(Run, self).normalize_data(
852854
date_field_name, symbol_field_name, end_date=end_date, qlib_data_1d_dir=qlib_data_1d_dir

0 commit comments

Comments
 (0)