Skip to content

A library for detection of audio events for the National Park Service

License

Notifications You must be signed in to change notification settings

nationalparkservice/acoustic_discovery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image of Dark-eyed Junco singing over Alaska

This library was commissioned by the National Park Service to assist with ornithological research in Alaska.
It's purpose is to automatically detect the songs of select avian species in recorded audio.

Table of Contents

Background

Since 2001 researchers at Denali National Park have collected extensive audio recordings throughout the park in an initiative to protect and study the natural acoustic environment. Recordings often contain sounds which can used to better understand avian occupancy, abundance, phenological timing, or other quantities of interest to conservation efforts.

Recent advances in artificial intelligence technology have drastically improved the ability of machines to perceive audio signals at human levels. The identification and annotation of avian species over thousands of hours of audio previously would have required an enormous amount of time from skilled technical staff. This library uses machine listening models pre-trained on NPS audio files to help automatically identify avian species. It is our hope that it will catalyze the use of long-format audio recordings for avian conservation work throughout the state.


Author

This library and the associated listening models were created by Cameron Summers, who is a researcher in machine learning and artificial intelligence located in the San Francisco Bay Area.


Usage

At a high level, the library takes in (1) audio files, (2) species lists, and (3) detection thresholds for each species, and outputs a corresponding timeline of detection probabilities for each species. A probability of 0.0 means the model absolutely expects species is not vocalizing, while a probability of 1.0 means the model absolutely expects species is vocalizing. Users may also choose to output audio clips of each detection exceeding the threshold. These can be useful for rapid, visual proofing of automated analysis results.

process diagram

The configuration for the models is carefully tuned for optimal detection performance. It is helpful to understand some of these parameters to be able to interpret the outputs of the library:

  • window_size_sec - Size of the detection window
  • hop_size - Separation between consecutive overlapping detection windows

For the models in this library, the window size is 4.0 seconds and the hop size is 0.01 seconds. Thus for a 30 second long file, there should be 3000 detections. The first detection window goes from 0.0 seconds in the audio to 4.0 seconds, the second window from 0.01 seconds to 4.01 seconds, and so on.

diagram of rolling window schema

Models

Each species has an already-trained model in a folder and they are stored in the models directory of this project. The user provides a path to one of these to use it for detections.

When running a detector, you will likely use these recommended thresholds:

Species Code Recommended Threshold
Willow Ptarmigan WIPT 0.2
White-tailed Ptarmigan* WTPT 0.9
Greater Yellowlegs* GRYE 0.3
Surfbird SURF 0.1
Wilson's Snipe* WISN 0.6
Olive-sided Flycatcher OSFL 0.1
Common Raven* CORA 0.1
Ruby-crowned Kinglet RCKI 0.4
Swainson’s Thrush* SWTH 0.6
Hermit Thrush* HETH 0.1
American Robin* AMRO 0.6
Varied Thrush VATH 0.3
Orange-crowned Warbler* OCWA 0.99
Blackpoll Warbler* BLPW 0.2
Myrtle Warbler MYWA 0.5
Fox Sparrow* FOSP 0.7
Lincoln's Sparrow LISP 0.7
White-crowned Sparrow* WCSP 0.99
Golden-crowned Sparrow* GCSP 0.9
Dark-eyed Junco* DEJU 0.2

(Higher performance is expected for species marked with an asterisk.)

Models have one of two separate configuration types to improve performance. Importantly, species from different groups cannot be run together in the same instance of the AcousticDetector class! The two groups are as follows:

Group 1:

FOSP, WCSP, CORA, HETH, WTPT, GRYE, AMRO, DEJU, BLPW, SWTH

{'axis_dim': 1, 'feature_dim': 42, 'high_freq': 12000.0, 'hop_size': 0.01, 'low_freq': 100.0, 'nfft': 1024, 'num_cepstral_coeffs': 14, 'num_filters': 512, 'window_size_sec': 4.0}

Group 2:

OSFL, RCKI, LISP, GCSP, VATH, MYWA, WISN, SURF, OCWA, WIPT

{'axis_dim': 1, 'feature_dim': 64, 'high_freq': 5000.0, 'hop_size': 0.01, 'low_freq': 500.0, 'nfft': 512, 'num_cepstral_coeffs': None, 'num_filters': 64, 'window_size_sec': 4.0}


Using your own thresholds:

Knowledge of Binary Classification and associated evaluation techniques is useful for setting thresholds. A user might vary the detection thresholds depending on the application. If to goal is to answer the question "Does my species exist anywhere in this file?", this might call for a high threshold to limit Type I Errors. However, if the goal is to answer the question of "Precisely how many calls occurred in the file?", then a lower threshold may be appropriate to limit Type II Errors.

Using Command Line

For help:

python -m nps_acoustic_discovery.discover -h

usage: Audio event detection for the National Park Service [-h]
                                                           -m MODEL_DIR_PATH
                                                           -t THRESHOLD
                                                           [-o {probs,detections,audio}]
                                                           --ffmpeg FFMPEG
                                                           audio_path save_dir

positional arguments:
  audio_path            Path to audio file on which to run the classifier
  save_dir              Directory in which to save the output.

optional arguments:
-h, --help            show this help message and exit
-m MODEL_DIR_PATH, --model_dir_path MODEL_DIR_PATH
                    Path to model(s) directories for classification
-t THRESHOLD, --threshold THRESHOLD
                    The threshold for a positive detection
-o {probs,detections,audio}, --output {probs,detections,audio}

                        Type of output file:
                             probs: Raw probabilities over time
                             detections: Raven detections file
                             audio: Audio slices for each detection

--ffmpeg FFMPEG       Path to FFMPEG executable
--ffmpeg_quiet        Suppress ffmpeg output for detection processing
--chunk_size_minutes CHUNK_SIZE_MINUTES
                      Number of minutes of audio to process at a time in large files
Command Line Examples

Running one model to generate a Raven file:

python -m nps_acoustic_discovery.discover <path_to_audio> <path_to_save_dir> -m <model_dir> -t <threshold> -o detections

Running two species models with two different thresholds generates two Raven files describing where the model detection probabilities exceeded the thresholds:

python -m nps_acoustic_discovery.discover <path_to_audio> <path_to_save_dir> -m <model_dir1> -m <model_dir2> -t <threshold1> -t <threshold2> -o detections

Running one model to generate a file with raw probabilities while suppressing ffmpeg output:

python -m nps_acoustic_discovery.discover <path_to_audio> <path_to_save_dir> -m <model_dir> -t <threshold> -o probs --ffmpeg_quiet

Running one model to generate an audio file (possibly many) where the model detection probabilities exceeded the threshold. Chunk size in minutes is set to 30 seconds since there is a lot of RAM available.

python -m nps_acoustic_discovery.discover <path_to_audio> <path_to_save_dir> -m <model_dir> -t <threshold> -o audio --chunk_size_minutes 30

Using Code

While inside the project directory, setup a model:

>>> from nps_acoustic_discovery.discover import AcousticDetector
>>> model_dir_paths = ['./models/SWTH']
>>> thresholds = [0.6]
>>> ffmpeg_path = '/usr/bin/ffmpeg'   # or where yours is
>>> detector = AcousticDetector(model_dir_paths, thresholds, ffmpeg_path=ffmpeg_path)

The models attribute in the detector is a dict that maps a model id to the model object. Now the detector houses 1 Swainson's Thrush (SWTH) model at the recommended threshold of 0.6 and a feature configuration. The feature configuration is derived from the model training phase and generally should not be altered since it could alter detection performance or break detection functionality.

>>> len(detector.models)
1
>>> detector.models.items()
dict_items([('61474838', <nps_acoustic_discovery.model.EventModel object at 0x10b096c88>)])
>>> detector.models['61474838'].detection_threshold
0.6
>>> detector.models['61474838'].fconfig
{'axis_dim': 1,
 'feature_dim': 42,
 'high_freq': 12000.0,
 'hop_size': 0.01,
 'low_freq': 100.0,
 'nfft': 1024,
 'num_cepstral_coeffs': 14,
 'num_filters': 512,
 'window_size_sec': 4.0}

Now we can use the detector on some audio.

>>> audio_path = './test/SWTH_test_30s.wav'
>>> model_prob_map = detector.process(audio_path, ffmpeg_quiet=True)
DEBUG:Processing chunk: 1. Audio len (s): 30.5
DEBUG:Processing features...
DEBUG:Input vector shape: (3049, 42)

Now we have probabilities of detection for the file.

>>> for model, probabilities in model_prob_map.items():
...     print("Type: {}, Shape: {}".format(type(probabilities), probabilities.shape))
...
Type: <class 'numpy.ndarray'>, Shape: (3049, 1)

As you can see, there are 3049 raw detection probabities for each 0.01 seconds of the file. Let's take a look at the plot:

alt text

There is a lot going on in the audio and you can see the probabilities changing as the model responds to what are presumably Swainson's Thrush songs. The probabilities collapse the last 4 seconds of the file because the window size is a minimum 4 seconds for detection.

From here, there are some convenience functions for common outputs. One is to easily create a Pandas dataframe.

>>> from nps_acoustic_discovery.output import probs_to_pandas, probs_to_raven_detections
>>> model_prob_df_map = probs_to_pandas(model_prob_map)
>>> for model, prob_df in model_prob_df_map.items():
...     print(prob_df.head())
...
   Relative Time (s)      SWTH
0               0.00  0.447792
1               0.01  0.369429
2               0.02  0.327936
3               0.03  0.380597
4               0.04  0.412197

And then to create a file that can be read by Raven built by the Cornell Lab of Ornithology.

>>> model_raven_df_map = probs_to_raven_detections(model_prob_df_map)
>>> header = ['Selection', 'Begin Time (s)', 'End Time (s)', 'Species']
>>> for model, raven_df in model_raven_df_map.items():
...     raven_df[header].to_csv('./', 'selection_table.txt', sep='\t', float_format='%.1f', index=False)

Or just look at the detections in the DataFrame and see that there are 4 confirmed detections above our threshold.

>>> model_raven_df_map = probs_to_raven_detections(model_prob_df_map)
>>> for model, raven_df in model_raven_df_map.items():
...     print(raven_df)
   Begin Time (s)  End Time (s)  Selection Species
0            0.51          4.51          1    SWTH
1            5.49          9.49          2    SWTH
2           12.52         16.52          3    SWTH
3           22.60         26.60          4    SWTH

The process of going from probabilities to Raven detections applies a low-pass filter to the probabilities and then the provided threshold.

If you wanted to save off slices of audio based on the detections it may look something like this with ffmpeg:

>>> import subprocess
>>> import os
>>> model_raven_df_map = probs_to_raven_detections(model_prob_df_map)
>>> for model, raven_df in model_raven_df_map.items():
...     slice_length = str(model.fconfig['window_size_sec'])
...     start_time = str(row['Begin Time (s)'])
...     output_filename = 'output_audio_slice_{}.wav'.format(idx)
...     for idx, row in raven_df.iterrows():
...         ffmpeg_slice_cmd = [ffmpeg_path, '-i', audio_path, '-ss', start_time,
                                '-t', slice_length, '-acodec', 'copy', output_filename]
            subprocess.Popen(ffmpeg_slice_cmd)

This should create 4 audio files corresponding to the start and end times of detections.

Large Files

Since soundscape recordings are often very long, one of the considerations for this project was to process audio in a stream to avoid loading very large files in memory. There is a parameter that controls this in the process function of the detector called chunk_size_minutes. This allows the user to specify how many (whole) minutes of audio to load into memory at a time for processing. The output for all chunks is concatenated at the end of processing. Note that currently, the detector does not "look ahead" across the chunk boundaries so there is a gap in detections at these boundaries the size of the detection window.


Installation

This project was developed for and tested with Python 3.5.

To install, clone this repository then install python dependencies using pip: pip install -r requirements.txt. It is recommended to use pip with virtualenv (or virtualenvwrapper) to keep your projects tidy.

This library also requires ffmpeg for file conversion - which implies it also handles many different types of audio file encodings - and for stream processing of large files. To install ffmpeg on Windows, see this the installation steps outline here. For static builds on all platforms, see the downloads on the ffmpeg site.


Model Training

A significant amount of time was invested in training species models to perform optimally. However, users can expect varied detection performance depending on the species/background noise/etc. since the model learns from the data and the data aren't always perfect or complete. Some common considerations for users that affect performance:

  • Species
    • The model learns from the data and some species have fewer examples to learn from
  • Background Noise
    • Rain or heavy overlap in species calls
  • Audio Encoding
    • The training audio is 44.1kHz sampling rate and 60 or 90kbps mp3 encoding. Using a similar or better encoding is advised. To illustrate, below a plot of the probabilities for the test file in the code example above. The wav series is the original 90kbps decoded to wav and the 320kbps and 60kbps series are the wav re-encoded to mp3. The higher quality 320kbps matches much closer the original signal than the 60kbps.

alt text


Smoke Tests

To run some basic tests, use nose:

nosetests --nocapture test/test_model.py

This should generate no errors.


Troubleshooting

-ImportError: No module named 'tensorflow'

Installing Keras with Pip creates a configuration file in your home directory ~/.keras/keras.json with the compute backend as Tensorflow. You may need to change this to Theano: "backend": "theano"


Dependencies


Public domain

This project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

About

A library for detection of audio events for the National Park Service

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages