Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError for training #51

Open
zhong-yy opened this issue Jun 21, 2022 · 2 comments
Open

KeyError for training #51

zhong-yy opened this issue Jun 21, 2022 · 2 comments

Comments

@zhong-yy
Copy link

zhong-yy commented Jun 21, 2022

Hi, I am trying the training example, but very confused about the format of the train_list table (especially the separator).

I downloaded the sample dataset from

wget https://github.com/wayneweiqiang/PhaseNet/releases/download/test_data/test_data.zip
unzip test_data.zip

and tried training

program_dir=${HOME}/PhaseNet/phasenet/
# training data
data=./npz
train_list=./npz.csv
mode=train
python ${program_dir}/train.py --train_dir=${data} --train_list=${train_list} --plot_figure --epochs=20 --batch_size=10

So far so good.

However, when I changed the data list npz.csv by keeping only 5 waveforms (modified data list: npz2.csv) like this

	fname	network	station	location_code	p_idx	p_time	p_remark	p_weight	s_idx	s_time	s_remark	s_weight	first_motion	distance_km	emergence_angle	azimuth	latitude	longitude	elevation_m	unit	dt	event_index	channels	snr
1418751	PB.B065..EH.0398802.npz	PB	B065	--	6001	2019-12-16T03:24:49.550	IP	0	6498	2019-12-16T03:24:54.520	ES	2	U	37.5	94.0	295.0	36.7437	-121.4742	643.0	m/s	0.01	398802	EH2,EH1,EHZ	1.01,1.01,1.01
480852	NC.GDXB..HN.0221288.npz	NC	GDXB	--	6001	2008-10-10T16:43:13.420	EP	2	6205	2008-10-10T16:43:15.460	ES	2	U	13.4	114.0	287.0	38.808	-122.7953	939.0	m/s**2	0.01	221288	HNE,HNN,HNZ	1.82,1.76,1.48
1381378	CI.SMM..BH.0392114.npz	CI	SMM	--	6000	2019-07-20T01:28:41.210	IP	0	6481	2019-07-20T01:28:46.020	ES	2	U	32.9	96.0	135.0	35.3142	-119.9958	599.0	m/s	0.01	392114	BHE,BHN,BHZ	1.21,1.08,1.24
575947	NC.BJOB..HN.0244462.npz	NC	BJOB	--	6001	2010-05-25T09:28:40.370	IP	1	6144	2010-05-25T09:28:41.800	ES	2	U	8.3	127.0	248.0	36.6109	-121.3147	1052.0	m/s**2	0.01	244462	HNE,HNN,HNZ	104.15,83.56,40.85
51937	NN.SCH.N1.EH.0058663.npz	NN	SCH	N1	6000	1987-06-11T08:27:39.340	EP	3	6412	1987-06-11T08:27:43.450	S	1	D	32.2	94.0	219.0	37.3591	-118.6889	2346.0	m/s	0.01	58663	EHZ	1.00,1.00,1.02

and running the training again, a KeyError was thrown:

...
  File "/home/yyzhong/PhaseNet/phasenet/data_reader.py", line 45, in index_to_entry
    return iterator[idx]

  File "/home/yyzhong/PhaseNet/phasenet/data_reader.py", line 588, in __getitem__
    base_name = self.data_list[i]
...
KeyError: 3

It seems that this error is caused by incorrectly reading the train_list file (npz2.csv) . In lines 182-185 in data_reader.py, the code read the data list like this:

try:
    csv = pd.read_csv(kwargs["data_list"], header=0, sep='[,|\s+]', engine="python")
except:
    csv = pd.read_csv(kwargs["data_list"], header=0, sep="\t")

In the original example, the data list (npz.csv) was read by

csv = pd.read_csv(kwargs["data_list"], header=0, sep="\t")

However, after I deleted some lines in npz.csv, the modified data list (npz2.csv) was read by

csv = pd.read_csv(kwargs["data_list"], header=0, sep='[,|\s+]', engine="python")

in this way, the resulting DataFrame-type variable csv is wrong:

                                             fname                  network station  location_code  p_idx  ...  unit    dt  event_index channels    snr
1418751 PB.B065..EH.0398802.npz  PB B065 --   6001  2019-12-16T03:24:49.550      IP              0   6498  ...   EH1   EHZ         1.01     1.01   1.01
480852  NC.GDXB..HN.0221288.npz  NC GDXB --   6001  2008-10-10T16:43:13.420      EP              2   6205  ...   HNN   HNZ         1.82     1.76   1.48
1381378 CI.SMM..BH.0392114.npz   CI SMM  --   6000  2019-07-20T01:28:41.210      IP              0   6481  ...   BHN   BHZ         1.21     1.08   1.24
575947  NC.BJOB..HN.0244462.npz  NC BJOB --   6001  2010-05-25T09:28:40.370      IP              1   6144  ...   HNN   HNZ       104.15    83.56  40.85
51937   NN.SCH.N1.EH.0058663.npz NN SCH  N1   6000  1987-06-11T08:27:39.340      EP              3   6412  ...  1.00  1.00         1.02      NaN    NaN

I don't know why the modified data list file was read in a different way. I can't find difference in the format of npz.csv and npz2.csv. npz2.csv was created from npz.csv and the data items are separated by tab.

In addition, the format of "dataset/waveform.csv" is quite different from that of "test_data/npz.csv".

@wayneweiqiang
Copy link
Collaborator

Hi, Thanks for reporting the issue. Because I have used two different formats of CSV in the past, which can cause some confusion. You can just keep only the line "csv = pd.read_csv(kwargs["data_list"], header=0, sep="\t")" to read the CSV file for training. Let me know if you meet other questions.

@zhong-yy
Copy link
Author

Hi, Thanks for reporting the issue. Because I have used two different formats of CSV in the past, which can cause some confusion. You can just keep only the line "csv = pd.read_csv(kwargs["data_list"], header=0, sep="\t")" to read the CSV file for training. Let me know if you meet other questions.

Thank you. So is the line

csv = pd.read_csv(kwargs["data_list"], header=0, sep='[,|\s+]', engine="python")

useless in the current version? I would like make sure that deleting this line won't cause unexpected problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants