KeyError for training

Hi, I am trying the training example, but very confused about the format of the `train_list` table (especially the separator).

I downloaded the sample dataset from
```bash
wget https://github.com/wayneweiqiang/PhaseNet/releases/download/test_data/test_data.zip
unzip test_data.zip
```
and tried training
```bash
program_dir=${HOME}/PhaseNet/phasenet/
# training data
data=./npz
train_list=./npz.csv
mode=train
python ${program_dir}/train.py --train_dir=${data} --train_list=${train_list} --plot_figure --epochs=20 --batch_size=10
```
So far so good.

However, when I changed the data list `npz.csv` by keeping only 5 waveforms (modified data list: [npz2.csv](https://github.com/wayneweiqiang/PhaseNet/files/8950022/npz2.csv)) like this

```
	fname	network	station	location_code	p_idx	p_time	p_remark	p_weight	s_idx	s_time	s_remark	s_weight	first_motion	distance_km	emergence_angle	azimuth	latitude	longitude	elevation_m	unit	dt	event_index	channels	snr
1418751	PB.B065..EH.0398802.npz	PB	B065	--	6001	2019-12-16T03:24:49.550	IP	0	6498	2019-12-16T03:24:54.520	ES	2	U	37.5	94.0	295.0	36.7437	-121.4742	643.0	m/s	0.01	398802	EH2,EH1,EHZ	1.01,1.01,1.01
480852	NC.GDXB..HN.0221288.npz	NC	GDXB	--	6001	2008-10-10T16:43:13.420	EP	2	6205	2008-10-10T16:43:15.460	ES	2	U	13.4	114.0	287.0	38.808	-122.7953	939.0	m/s**2	0.01	221288	HNE,HNN,HNZ	1.82,1.76,1.48
1381378	CI.SMM..BH.0392114.npz	CI	SMM	--	6000	2019-07-20T01:28:41.210	IP	0	6481	2019-07-20T01:28:46.020	ES	2	U	32.9	96.0	135.0	35.3142	-119.9958	599.0	m/s	0.01	392114	BHE,BHN,BHZ	1.21,1.08,1.24
575947	NC.BJOB..HN.0244462.npz	NC	BJOB	--	6001	2010-05-25T09:28:40.370	IP	1	6144	2010-05-25T09:28:41.800	ES	2	U	8.3	127.0	248.0	36.6109	-121.3147	1052.0	m/s**2	0.01	244462	HNE,HNN,HNZ	104.15,83.56,40.85
51937	NN.SCH.N1.EH.0058663.npz	NN	SCH	N1	6000	1987-06-11T08:27:39.340	EP	3	6412	1987-06-11T08:27:43.450	S	1	D	32.2	94.0	219.0	37.3591	-118.6889	2346.0	m/s	0.01	58663	EHZ	1.00,1.00,1.02
```
and running the training again, a KeyError was thrown:
```
...
  File "/home/yyzhong/PhaseNet/phasenet/data_reader.py", line 45, in index_to_entry
    return iterator[idx]

  File "/home/yyzhong/PhaseNet/phasenet/data_reader.py", line 588, in __getitem__
    base_name = self.data_list[i]
...
KeyError: 3
```

It seems that this error is caused by incorrectly reading the `train_list` file (npz2.csv) . In lines 182-185 in `data_reader.py`, the code read the data list like this:
```python
try:
    csv = pd.read_csv(kwargs["data_list"], header=0, sep='[,|\s+]', engine="python")
except:
    csv = pd.read_csv(kwargs["data_list"], header=0, sep="\t")
```

In the original example, the data list ([npz.csv](https://github.com/wayneweiqiang/PhaseNet/files/8950205/npz.csv)) was read by
```python
csv = pd.read_csv(kwargs["data_list"], header=0, sep="\t")
```
However, after I deleted some lines in `npz.csv`, the modified data list ([npz2.csv](https://github.com/wayneweiqiang/PhaseNet/files/8950236/npz2.csv)) was read by
```
csv = pd.read_csv(kwargs["data_list"], header=0, sep='[,|\s+]', engine="python")
```
in this way, the resulting DataFrame-type variable `csv` is wrong:
```
                                             fname                  network station  location_code  p_idx  ...  unit    dt  event_index channels    snr
1418751 PB.B065..EH.0398802.npz  PB B065 --   6001  2019-12-16T03:24:49.550      IP              0   6498  ...   EH1   EHZ         1.01     1.01   1.01
480852  NC.GDXB..HN.0221288.npz  NC GDXB --   6001  2008-10-10T16:43:13.420      EP              2   6205  ...   HNN   HNZ         1.82     1.76   1.48
1381378 CI.SMM..BH.0392114.npz   CI SMM  --   6000  2019-07-20T01:28:41.210      IP              0   6481  ...   BHN   BHZ         1.21     1.08   1.24
575947  NC.BJOB..HN.0244462.npz  NC BJOB --   6001  2010-05-25T09:28:40.370      IP              1   6144  ...   HNN   HNZ       104.15    83.56  40.85
51937   NN.SCH.N1.EH.0058663.npz NN SCH  N1   6000  1987-06-11T08:27:39.340      EP              3   6412  ...  1.00  1.00         1.02      NaN    NaN
```

I don't know why the modified data list file was read in a different way. I can't find difference in the format of  [npz.csv](https://github.com/wayneweiqiang/PhaseNet/files/8950267/npz.csv) and [npz2.csv](https://github.com/wayneweiqiang/PhaseNet/files/8950269/npz2.csv). `npz2.csv` was created from `npz.csv` and the data items are separated by tab.

In addition, the format of "dataset/waveform.csv" is quite different from that of "test_data/npz.csv".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KeyError for training #51

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KeyError for training #51

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions