Skip to content

KeyError for training #51

Open
Open
@zhong-yy

Description

@zhong-yy

Hi, I am trying the training example, but very confused about the format of the train_list table (especially the separator).

I downloaded the sample dataset from

wget https://github.com/wayneweiqiang/PhaseNet/releases/download/test_data/test_data.zip
unzip test_data.zip

and tried training

program_dir=${HOME}/PhaseNet/phasenet/
# training data
data=./npz
train_list=./npz.csv
mode=train
python ${program_dir}/train.py --train_dir=${data} --train_list=${train_list} --plot_figure --epochs=20 --batch_size=10

So far so good.

However, when I changed the data list npz.csv by keeping only 5 waveforms (modified data list: npz2.csv) like this

	fname	network	station	location_code	p_idx	p_time	p_remark	p_weight	s_idx	s_time	s_remark	s_weight	first_motion	distance_km	emergence_angle	azimuth	latitude	longitude	elevation_m	unit	dt	event_index	channels	snr
1418751	PB.B065..EH.0398802.npz	PB	B065	--	6001	2019-12-16T03:24:49.550	IP	0	6498	2019-12-16T03:24:54.520	ES	2	U	37.5	94.0	295.0	36.7437	-121.4742	643.0	m/s	0.01	398802	EH2,EH1,EHZ	1.01,1.01,1.01
480852	NC.GDXB..HN.0221288.npz	NC	GDXB	--	6001	2008-10-10T16:43:13.420	EP	2	6205	2008-10-10T16:43:15.460	ES	2	U	13.4	114.0	287.0	38.808	-122.7953	939.0	m/s**2	0.01	221288	HNE,HNN,HNZ	1.82,1.76,1.48
1381378	CI.SMM..BH.0392114.npz	CI	SMM	--	6000	2019-07-20T01:28:41.210	IP	0	6481	2019-07-20T01:28:46.020	ES	2	U	32.9	96.0	135.0	35.3142	-119.9958	599.0	m/s	0.01	392114	BHE,BHN,BHZ	1.21,1.08,1.24
575947	NC.BJOB..HN.0244462.npz	NC	BJOB	--	6001	2010-05-25T09:28:40.370	IP	1	6144	2010-05-25T09:28:41.800	ES	2	U	8.3	127.0	248.0	36.6109	-121.3147	1052.0	m/s**2	0.01	244462	HNE,HNN,HNZ	104.15,83.56,40.85
51937	NN.SCH.N1.EH.0058663.npz	NN	SCH	N1	6000	1987-06-11T08:27:39.340	EP	3	6412	1987-06-11T08:27:43.450	S	1	D	32.2	94.0	219.0	37.3591	-118.6889	2346.0	m/s	0.01	58663	EHZ	1.00,1.00,1.02

and running the training again, a KeyError was thrown:

...
  File "/home/yyzhong/PhaseNet/phasenet/data_reader.py", line 45, in index_to_entry
    return iterator[idx]

  File "/home/yyzhong/PhaseNet/phasenet/data_reader.py", line 588, in __getitem__
    base_name = self.data_list[i]
...
KeyError: 3

It seems that this error is caused by incorrectly reading the train_list file (npz2.csv) . In lines 182-185 in data_reader.py, the code read the data list like this:

try:
    csv = pd.read_csv(kwargs["data_list"], header=0, sep='[,|\s+]', engine="python")
except:
    csv = pd.read_csv(kwargs["data_list"], header=0, sep="\t")

In the original example, the data list (npz.csv) was read by

csv = pd.read_csv(kwargs["data_list"], header=0, sep="\t")

However, after I deleted some lines in npz.csv, the modified data list (npz2.csv) was read by

csv = pd.read_csv(kwargs["data_list"], header=0, sep='[,|\s+]', engine="python")

in this way, the resulting DataFrame-type variable csv is wrong:

                                             fname                  network station  location_code  p_idx  ...  unit    dt  event_index channels    snr
1418751 PB.B065..EH.0398802.npz  PB B065 --   6001  2019-12-16T03:24:49.550      IP              0   6498  ...   EH1   EHZ         1.01     1.01   1.01
480852  NC.GDXB..HN.0221288.npz  NC GDXB --   6001  2008-10-10T16:43:13.420      EP              2   6205  ...   HNN   HNZ         1.82     1.76   1.48
1381378 CI.SMM..BH.0392114.npz   CI SMM  --   6000  2019-07-20T01:28:41.210      IP              0   6481  ...   BHN   BHZ         1.21     1.08   1.24
575947  NC.BJOB..HN.0244462.npz  NC BJOB --   6001  2010-05-25T09:28:40.370      IP              1   6144  ...   HNN   HNZ       104.15    83.56  40.85
51937   NN.SCH.N1.EH.0058663.npz NN SCH  N1   6000  1987-06-11T08:27:39.340      EP              3   6412  ...  1.00  1.00         1.02      NaN    NaN

I don't know why the modified data list file was read in a different way. I can't find difference in the format of npz.csv and npz2.csv. npz2.csv was created from npz.csv and the data items are separated by tab.

In addition, the format of "dataset/waveform.csv" is quite different from that of "test_data/npz.csv".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions