Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pangeo forge concept on OOI Data #10

Open
lsetiawan opened this issue Feb 19, 2021 · 1 comment
Open

Pangeo forge concept on OOI Data #10

lsetiawan opened this issue Feb 19, 2021 · 1 comment

Comments

@lsetiawan
Copy link

Hi all!

I am currently trying to use the concepts in pangeo-forge and apply it to OOI Data in https://github.com/ooi-data. Firstly, I think that this is a really great project evolution for pangeo and I am looking forward to see where it's headed. Currently I am working on trying to convert a lot of OOI data into zarr files. I have tried working with prefect and dask for this, and just running that using K8s cronjob, but stumbled into a lot of roadblocks in terms of getting status and history of the data pipeline, and seeing if things broke in the process.

After running across pangeo-forge a few months back, I really loved the idea of being able to create a data pipeline that combined github actions and prefect! However, I couldn't fully integrate the current pangeo-forge to my needs since this idea is limited to having the source dataset already set to be pulled in a server somewhere. OOI system works differently than other system where the user has to request the data/wait/then fetch the data. There wasn't a way to do that step in pangeo-forge that I can visibly see. One solution that I thought might work is have a fetch and wait task within the prefect flow, but that means a lot of sitting around for the kubernetes pod.

So because of those roadblocks from pangeo-forge, I decided to take the concepts and use https://github.com/pangeo-forge/terraclimate-feedstock as an example to create a pangeo-forge-esq POC system, where I have github actions to perform those request and wait step, and then a step for the actual processing.

I also added the idea of having a sort of history for request and processing that is tracked by git to provide a full provenance for the dataset. This is not fully baked but you can see a sort of history example for request and process.

Then there's another issue of being able to replicate this for all the datasets that OOI have. So I decided to utilize the github templates to have a nice template to copy from: https://github.com/ooi-data/stream_template. And found away to keep all the dataset repos in sync with the template by using https://github.com/koj-co/update-template.

The processing currently is not actually running anything but I am still working on the backend logistics within my K8s cluster. But you can see on screenshot below of the running pipeline.

Screenshot from 2021-02-19 13-20-24

I just thought I should share my experiences with combining the power of Github Actions + Prefect to create a data pipeline from the ideas of pangeo-forge. Thank you for creating this great project and laying out the roadmap. I hope my ideas and prototype would spark other ideas. I would love to end up porting all the stuff I have over to pangeo-forge and contribute to the project to any capacity that I can at some point 😄

@cisaacstern
Copy link
Member

Hi @lsetiawan, just wanted to check in to see how you're doing with this. The architecture of Pangeo Forge has evolved quite a bit since you first opened this Issue, so I wouldn't be surprised if your early experiments have needed / will need to be updated. Since this is on some level a design question related to pangeo-forge-recipes, perhaps we should move the discussion to https://github.com/pangeo-forge/pangeo-forge-recipes/issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants