Description
Hi, I am trying the training example, but very confused about the format of the train_list
table (especially the separator).
I downloaded the sample dataset from
wget https://github.com/wayneweiqiang/PhaseNet/releases/download/test_data/test_data.zip
unzip test_data.zip
and tried training
program_dir=${HOME}/PhaseNet/phasenet/
# training data
data=./npz
train_list=./npz.csv
mode=train
python ${program_dir}/train.py --train_dir=${data} --train_list=${train_list} --plot_figure --epochs=20 --batch_size=10
So far so good.
However, when I changed the data list npz.csv
by keeping only 5 waveforms (modified data list: npz2.csv) like this
fname network station location_code p_idx p_time p_remark p_weight s_idx s_time s_remark s_weight first_motion distance_km emergence_angle azimuth latitude longitude elevation_m unit dt event_index channels snr
1418751 PB.B065..EH.0398802.npz PB B065 -- 6001 2019-12-16T03:24:49.550 IP 0 6498 2019-12-16T03:24:54.520 ES 2 U 37.5 94.0 295.0 36.7437 -121.4742 643.0 m/s 0.01 398802 EH2,EH1,EHZ 1.01,1.01,1.01
480852 NC.GDXB..HN.0221288.npz NC GDXB -- 6001 2008-10-10T16:43:13.420 EP 2 6205 2008-10-10T16:43:15.460 ES 2 U 13.4 114.0 287.0 38.808 -122.7953 939.0 m/s**2 0.01 221288 HNE,HNN,HNZ 1.82,1.76,1.48
1381378 CI.SMM..BH.0392114.npz CI SMM -- 6000 2019-07-20T01:28:41.210 IP 0 6481 2019-07-20T01:28:46.020 ES 2 U 32.9 96.0 135.0 35.3142 -119.9958 599.0 m/s 0.01 392114 BHE,BHN,BHZ 1.21,1.08,1.24
575947 NC.BJOB..HN.0244462.npz NC BJOB -- 6001 2010-05-25T09:28:40.370 IP 1 6144 2010-05-25T09:28:41.800 ES 2 U 8.3 127.0 248.0 36.6109 -121.3147 1052.0 m/s**2 0.01 244462 HNE,HNN,HNZ 104.15,83.56,40.85
51937 NN.SCH.N1.EH.0058663.npz NN SCH N1 6000 1987-06-11T08:27:39.340 EP 3 6412 1987-06-11T08:27:43.450 S 1 D 32.2 94.0 219.0 37.3591 -118.6889 2346.0 m/s 0.01 58663 EHZ 1.00,1.00,1.02
and running the training again, a KeyError was thrown:
...
File "/home/yyzhong/PhaseNet/phasenet/data_reader.py", line 45, in index_to_entry
return iterator[idx]
File "/home/yyzhong/PhaseNet/phasenet/data_reader.py", line 588, in __getitem__
base_name = self.data_list[i]
...
KeyError: 3
It seems that this error is caused by incorrectly reading the train_list
file (npz2.csv) . In lines 182-185 in data_reader.py
, the code read the data list like this:
try:
csv = pd.read_csv(kwargs["data_list"], header=0, sep='[,|\s+]', engine="python")
except:
csv = pd.read_csv(kwargs["data_list"], header=0, sep="\t")
In the original example, the data list (npz.csv) was read by
csv = pd.read_csv(kwargs["data_list"], header=0, sep="\t")
However, after I deleted some lines in npz.csv
, the modified data list (npz2.csv) was read by
csv = pd.read_csv(kwargs["data_list"], header=0, sep='[,|\s+]', engine="python")
in this way, the resulting DataFrame-type variable csv
is wrong:
fname network station location_code p_idx ... unit dt event_index channels snr
1418751 PB.B065..EH.0398802.npz PB B065 -- 6001 2019-12-16T03:24:49.550 IP 0 6498 ... EH1 EHZ 1.01 1.01 1.01
480852 NC.GDXB..HN.0221288.npz NC GDXB -- 6001 2008-10-10T16:43:13.420 EP 2 6205 ... HNN HNZ 1.82 1.76 1.48
1381378 CI.SMM..BH.0392114.npz CI SMM -- 6000 2019-07-20T01:28:41.210 IP 0 6481 ... BHN BHZ 1.21 1.08 1.24
575947 NC.BJOB..HN.0244462.npz NC BJOB -- 6001 2010-05-25T09:28:40.370 IP 1 6144 ... HNN HNZ 104.15 83.56 40.85
51937 NN.SCH.N1.EH.0058663.npz NN SCH N1 6000 1987-06-11T08:27:39.340 EP 3 6412 ... 1.00 1.00 1.02 NaN NaN
I don't know why the modified data list file was read in a different way. I can't find difference in the format of npz.csv and npz2.csv. npz2.csv
was created from npz.csv
and the data items are separated by tab.
In addition, the format of "dataset/waveform.csv" is quite different from that of "test_data/npz.csv".