data-ingest

Source code for the CLI tool that ingests the data from the embedded data collection devices (whale tags, moorings, etc) and uploads it to the AWS cloud (S3) to later be combined in a unified dataset consumable by the machine learning pipelines in ApacheSpark for project CETI. The code targets Linux machines, although we attempted to use OS agnostic libraries where possible to ease porting of this code if need be.

There are a few assumptions made in this code.

For whale tags

It is assumed that a whale tag will be present on the same LAN, have ssh server running on port 22, and have a hostname of type wt-AABBCCDDEEFF.
It is assumed that the hostname is universally unique and constant.
The embedded software for the whale tags actually sets the hostname that way.
Whale tags are mechanically isolated to withstand high pressures, so we assume LAN is a WiFi.

For moorings

..and other potential sources of data attached as external storage to the machine that is uploading the data to S3.

It is assumed that the data folder contains subfolders that correspond to unique device IDs. For example, a mooring would have its data in a mg-AABBCCDDEEFF subfolder.
It is assumed that those subfolders are universally unique and constant.

Installation

First make sure you have AWS Credentials properly set to access AWS infrastructure. After that execute:

make login
pip install ceti

Installation from wheel file

The wheel file can be installed using pip. For example, if you have ceti-1.0.0-py3-none-any.whl:

pip install ceti-1.0.0-py3-none-any.whl

Installation from source

If you want to install from sources:

git clone https://github.com/Project-CETI/data-ingest.git
cd data-ingest
pip install .

Usage

$ ceti -h
usage: ceti [command] [options]

optional arguments:
  -h, --help  show this help message and exit

Available commands:

    s3upload  Uploads local whale data to AWS S3 cloud.
    whaletag  Discover whale tags on LAN and download data off them.

whaletag

This script contains code necessary to pull the data from whale tags that is on the same LAN as the machine running the script.

See command line arguments:

ceti whaletag -h

A typical use case would have a WiFi network with a Linux machine connected to it, and a whale tag's onboard embedded computer also connected to the same WiFi. Then one could:

Download all data from all whaletags present on the LAN

ceti whaletag -a

Clean all whaletags. Caution - this is dangerous. It removes all data from the tags. Only do this after you successfully uploaded the data to S3.

ceti whaletag -ca

Find all whaletags on the LAN

ceti whaletag -l

Download data from one specific whaletag

ceti whaletag -t wr-AABBCCEEDDFF

Delete data from the whaletag

ceti whaletag -ct wr-AABBCCEEDDFF

Offloading/Uploading Generic Non-Tag Data

The general_offload.sh script will handle all the offloading and uploading for non-tag data. Simply pass the file path to the data, and the ID of the device that captured the data.

If the data was captured on a shared ceti device like a drone or gopro, there should be a device ID label on the device. The ID that is given will be checked against a list of registered devices kept in s3: https://s3.console.aws.amazon.com/s3/object/ceti-data?region=us-east-1&prefix=Device+ID+List.txt If your device is not listed here, you can register it with a unique ID.

general_offload.sh will create a temporary staging folder to offload the files to. It will also save backup copies of the files in data/backup. Once it offloads all the files onto the local mahcine, it will upload them all to s3, then delete the temporary folder.

Example from data_ingest directory. This is the recommended way to offload and upload data

source general_offload.sh /media/mangohouse/3531-3034/DCIM/100MEDIA/ CETI-DJI_MINI2-1

The general_offload.sh script uses other ceti tools named general_offload and s3upload. It just calls them in succession. If you would like to use the general_offload tool manually, you can do so, but will need to provide an additional path to the temporary folder to be used.

To get a list of supported commands:

ceti general_offload -h

To preview a list of files to be offloaded:

ceti general_offload -t <path_to_files> <device_id> <temporary_folder>

To perform actual data offload:

ceti general_offload <path_to_files> <device_id> <temporary_folder>

Note that when you use the general_offload tool manually, you are responsible for manually deleting the temporary folder.

Uploading data to S3

To upload the files from the data directory use the s3upload command. It establishes connection to the S3 bucket for raw data and attempts to upload all the data from the folder you specify. This command also attempts to deduplicate the data during upload in order to provide upload resume capability. However, if your upload takes longer than 24 hours, and the connection breaks, some files might still be reuploaded. We aired on the side of caution and decided it is better to sometimes upload more than needed and dedup the data later, than potentially loose precious data.

To get a list of supported commands:

ceti s3upload -h

To preview a list of files and locations for the upload, assuming your data is in ./data:

ceti s3upload -t ./data

To perform actual upload to S3:

ceti s3upload ./data

Development

Building the package locally

For developer mode use from the project directory:

pip install -e .

You can build a wheel file for binary distribution of the package. The wheel file will be located in the ./dist folder.

make build_tools && make build

Releasing a new version

This package follows semantic versioning approach and PEP440. In order to release a new version run the following steps:

git checkout main && git pull
make release

This will autmatically bump the version at the patch level, e.g. 1.0.1 -> 1.0.2 and execute git push origin main --tags. After that the CI will run all the tests and publish the new version to AWS CodeArtifact repo.

You can control the version level to bump using the BUMP_LEVEL environment variable. Possible options are major, minor, patch (the default). For example:

BUMP_LEVEL=minor make release

Whale tag deployment

On certain versions of the tag hardware, the data from hydrophones is stored in raw format. Convert it before upload with the script scripts/flacencode.sh There's also a convemnience script scripts/tag.sh that does the following automatically:

discover all tags on subnet
create a temporary folder for data offload
download all data from all tags into temporary folder
flac encode all audio, gzip all sensor csv data
copy the back-up of compressed data to /data-backup folder
upload all downloaded and compressed data to s3
clean all tags

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
.github/workflows		.github/workflows
ceti		ceti
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
general_offload.sh		general_offload.sh
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-ingest

For whale tags

For moorings

Installation

Installation from wheel file

Installation from source

Usage

whaletag

Offloading/Uploading Generic Non-Tag Data

Uploading data to S3

Development

Building the package locally

Releasing a new version

Whale tag deployment

About

Releases

Packages

Contributors 5

Languages

License

Project-CETI/data-ingest

Folders and files

Latest commit

History

Repository files navigation

data-ingest

For whale tags

For moorings

Installation

Installation from wheel file

Installation from source

Usage

whaletag

Offloading/Uploading Generic Non-Tag Data

Uploading data to S3

Development

Building the package locally

Releasing a new version

Whale tag deployment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages