Skip to content

Poster outline including results

Glenn Thompson edited this page Nov 4, 2021 · 3 revisions

Introduction - previous work

  • Discuss Langer et al (2003, 2006)
  • Malfante et al. (2018) classified 109,609 transients at Ubinas volcano with a 93.5% accuracy.
  • Falcin et al. (2021) got an accuracy rate of 72% by applying AAA to 845 events labelled by OVSG (542 VT, 217 nested, 86 LP) for 2013-2018. Adding hybrid and tornillo events (after Moretti et al. 2020) led to accuracy of 84%, which could have reached 86-93% but was dragged down by hybrids (64%). Detection rate was twice as good as STA/LTA used by OVSG.

Original Dataset

  • Events: 217,290 transients detected on the MVO digital seismic network between 1996/10/21 and 2008/10/16.
  • Labels: Five main labels used by MVO for local volcano-seismic events: rockfall (ROC: 58%), hybrid (HYB: 19%), long-period (LPE: 11%), lp-rockfall (LP-ROC: 5.8%), and volcano-tectonic (VT: 3.1%). Others are regional (REG: 1.4%), gages (or st georges?, GAG: 0.7%)
  • Total traces: 4,072,590
  • Traces per event: Average is 18.7
  • I separately noted 235,804 S-files on hal between 1996/10/21 and 2008/10/16. 191,592 by 2004/02/16 (an end date not likely to cause problems with SRC/MVO). cum_N_sfiles = 235,804, cum_N_DSN_wavfiles = 217,290, cum_N_DSN_traces = 4,072,590, cum_N_ASN_wavfiles = 42,110, cum_N_ASN_traces = 439,569

Methods 1: From the original dataset to our dataset

  • M1.1: We computed a range of data quality and statistical and physical metrics on each waveform trace for each transient
  • M1.2: Waveform QC: We automatically removed waveforms with dropouts (and list other problems we look for here).
  • M1.3: We manually reviewed/reclassified transient classifications until we had approximately 100 transients of each class (total 522).

Results 1

  • R1.1: ~21% of these transients were incorrectly classified at MVO.

Methods 2: Supervised ML

  • M2.1: The code [http://github.com/malfante/AAA] transforms each waveform into a set of 102 features: 34 features for each of three domains (time, spectral, cepstral).
  • M2.2: We added 5 frequency metrics we computed in M1.1 as features here. These pre-computed features are 2 band ratios, peak frequency, median frequency and bandwidth. The resulting 107-point vectors of features were then used for modeling.
  • M2.3: The dataset was randomly divided 50 times into training and testing datasets, to produce a robust model. We use the Random Forest Classifier algorithm from the scikit-learn library. One model is produced per trace ID.
  • M2.4: We try different trace IDs and compare results. For each waveform, a probability is computed for each class.

Results 2:

  • Results from Paris are at MONTSERRAT/results.csv. These can be compared with results obtained without the extra features I added here.
  • Separate models for 3 channels yield accuracies of 76-80%.
  • If the LP-ROC class is omitted (following Langer et al, 2006), accuracy rises to 82-85%.
  • If only VT and LP classes are considered, accuracy is 96-99%.

Discussion

Conclusions

Further work

For the poster, we intend to:

  • add a frequency change metric to M1.1 and incorporate this as a pre-computed feature at M2.2
  • generate one model per channel at M2.3
  • compute a probability for each trace ID in an event, and take a weighted average of these to automatically label the event
  • expand our labelled dataset to at least 1000 events
  • reclassify the catalog of 217,290 transients
  • repeat the jackknifing