Lightweight ocean passive acoustic data query #32

leewujung · 2022-08-15T20:04:09Z

leewujung
Aug 15, 2022
Maintainer

Title

Lightweight ocean passive acoustic data query

Summary

Create a lightweight data query service to help people find publicly-available passive acoustic data from the ocean. This idea is extensible to include other terrestrial publicly-available datasets.

Personnel

Wu-Jung Lee + anyone interested

Data sets and infrastructure support

Datasets to potentially get started on this:

Would be good to follow an existing convention for metadata.

Some more good references:

The problem

Ocean audio datasets are available in many places -- another realization of the "there are so many portals!" problem in the subdomain of passive acoustic monitoring.

It'll be nice to have a data query service that can check existing databases/data portals to grab at least metadata (need definitions on minimum set too) and let users know what is available out there on the web or some organization's buckets. Getting the actual data is much more "expensive" in terms of resources and logistics, so this proposal is just aimed toward knowing the existence of data.

Proposed methods/tools

The colocate package may be a good reference: https://github.com/ioos/colocate
That was also a project past OHW.

alukach · 2022-08-15T23:59:18Z

alukach
Aug 15, 2022

Thoughts

Cataloging Strategy

To work around shortcomings of individual data sources (e.g. some may not support intelligent queries such as discovering data by datetime or location).

Given the number of datasets, we will likely want to split up the datasets and create ingestors for each dataset responsible for:

discovering available data
generate metadata in a harmonized for each "unit" of data

Catalog Format

It seems appropriate to store metadata regarding all relevant data in a harmonized format.

It feels like a STAC catalog might be an appropriate format to store the data in a harmonized format. By adhering to an open standard, we can lean on pre-existing tooling that interacts with data in the commonly known format. For example:

pgSTAC - a database formatting tool to store STAC data in a PostgreSQL database
stac-fastapi - an API layer to query a STAC catalog
stac-browser - a UI for discovering data that is stored within a STAC catalog
pystac-client - a Python module for interacting with a STAC catalog
stac-pydantic - a Python module for validating STAC items

How will this operate?

I assumed this would be a web-service that would run in a cloud environment that would poll the source data at a regular interval. Does this seem sufficient or does that have shortcomings?

0 replies

Charles-Battisti · 2022-08-16T00:32:04Z

Charles-Battisti
Aug 16, 2022

Maybe look at Ocean Network Canada data

For Orcasound: https://github.com/orgs/orcasound/projects/2/views/1

0 replies

leewujung · 2022-08-16T17:56:47Z

leewujung
Aug 16, 2022
Maintainer Author

How will this operate?

I assumed this would be a web-service that would run in a cloud environment that would poll the source data at a regular interval. Does this seem sufficient or does that have shortcomings?

I was actually thinking thatit would be nice for this not be a web service, and just be a package that people can install and do the query. We could probably have a repo that use github actions to automate poll the metadata from different sources and store it somewhere (this part would have to be worked out...).

One reason is that there are so many data portals out there and it would be nice if we don't just create another one, just like it's usually best practice to revise an existing data standard/convention instead of creating a new one. I didn't include in the above the large number of data portals for terrestrial passive acoustic monitoring, but the sprawling problem is the same there.

0 replies

leewujung · 2022-08-16T18:24:46Z

leewujung
Aug 16, 2022
Maintainer Author

I also wonder if the intake catalogue would be a good choice for just the file existence and metadata -- the audio files themselves are too huge.

For a very minimal set of metadata to just build out and test the mechanism: lat-lon, time, and data source (which database, portal, or bucket).

Just because we're in GVE at the moment: What @ocefpaf is covering in the Data Access tutorials are/were also what inspired me for the project idea in addition to the colocate package.

0 replies

scottveirs · 2023-01-12T00:48:31Z

scottveirs
Jan 12, 2023

@leewujung I love this concept. Have you discussed with @valentina-s ?

In anticipation of revisiting your idea in 2023, I wanted to call you attention to the efforts of Karan last summer (via Google Summer of Code with Orcasound) -- https://www.orcasound.net/2022/08/04/making-hydrophone-data-accessible/ ...In terms of developing a catalog for OOI hydrophones, one might be able to build upon his code that was aspiring to transcode the mseed audio data for playback in near-real-time via the Orcasound web app.

I think this approach might also help improve access to and analysis of Canadian open data. Despite their best efforts, the Data Viewer for hydrophone audio provided by Ocean Networks Canada makes it difficult to determine all of the deployments within a region hold data for a particular period.

2 replies

emiliom Jan 12, 2023
Maintainer

Hey @scottveirs ! Coincidentally I'm at ONC this week, for a couple of meetings. I'd be happy to connect with you to pass on to ONC your suggestions about improved access to their hydrophone data; or see if there's an API that already meets your needs but is not well documented.

leewujung Jan 13, 2023
Maintainer Author

Hey @scottveirs : awesome that you like the idea! @valentina-s was also in OHW22 and she was aware of this, but perhaps the timing didn't line up well with the GSOC work?

There was a team that did pursue this idea during OHW22 and the results were in this repo: https://github.com/oceanhackweek/ohw22-proj-passive-acoustics-data-query
It'll be great to build upon Karan's efforts! The OOI query during OHW22 was through iris, so the data possibly are only with the LF hydrophones?

@emiliom : the test ONC query to my understanding went well: this notebook, but I haven't personally run it and explore the output URL further. @scottveirs would have better insights here.

It would be great to revisit this idea in 2023, and potentially get some help from the data providers to make the cataloging easier. My thought was that if we could update the catalog daily through GH Actions and the catalogs are small enough (they should be?) to be stored just in the repo, it would at least in theory be "free" to run, haha.

Jacob-Stevens-Haas · 2023-05-27T12:26:16Z

Jacob-Stevens-Haas
May 27, 2023

Hey all, I'm a 4th yr Ph.D. Candidate in Applied Math and stumbled here from an escience institue email looking to propose a project. I saw that there's a deadline of Jun 2nd. So I'd like to revive this thread, if that's ok.

My original intent was slightly different than @leewujung's proposal. Rather than lightweight catalogue of data, I was motivated by the problem of knowing there was a lot of data at ONC, but it being difficult to work with. I had been trying to build an ML model to detect and classify shipping, but the ONC python library/API downloads .wav and .mp3 files. So I built a package to manage and track downloads, stitching together acoustic files, and return all the data as numpy arrays. This made it easy to build up a lot of training data programmatically.

I originally did this for an employer, so I was trying to rebuild an open-source version at tehom. While the closed-source version lead to a conference poster of our model, the open-source one is about 50% done, since it's unrelated to my PhD and a perennial side project, but I'd be happy to lead a mentored sprint on it for OceanHackWeek 2023.

I also have maintainer access to ONC's python library and have worked with it pretty extensively.

Is this a useful project idea? Is this the kind of thing that would garner interest at an OHW? @leewujung @scottveirs @emiliom.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OceanHackWeek (OHW)

Lightweight ocean passive acoustic data query #32

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

How will this operate?

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

OceanHackWeek (OHW)

Lightweight ocean passive acoustic data query #32

leewujung Aug 15, 2022 Maintainer

Title

Summary

Personnel

Data sets and infrastructure support

The problem

Proposed methods/tools

Replies: 6 comments · 2 replies

alukach Aug 15, 2022

Thoughts

Cataloging Strategy

Catalog Format

How will this operate?

Charles-Battisti Aug 16, 2022

leewujung Aug 16, 2022 Maintainer Author

How will this operate?

leewujung Aug 16, 2022 Maintainer Author

scottveirs Jan 12, 2023

emiliom Jan 12, 2023 Maintainer

leewujung Jan 13, 2023 Maintainer Author

Jacob-Stevens-Haas May 27, 2023

leewujung
Aug 15, 2022
Maintainer

Replies: 6 comments 2 replies

alukach
Aug 15, 2022

Charles-Battisti
Aug 16, 2022

leewujung
Aug 16, 2022
Maintainer Author

leewujung
Aug 16, 2022
Maintainer Author

scottveirs
Jan 12, 2023

emiliom Jan 12, 2023
Maintainer

leewujung Jan 13, 2023
Maintainer Author

Jacob-Stevens-Haas
May 27, 2023