Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate sync process #2

Open
mosoriob opened this issue May 3, 2021 · 4 comments
Open

Migrate sync process #2

mosoriob opened this issue May 3, 2021 · 4 comments

Comments

@mosoriob
Copy link
Contributor

mosoriob commented May 3, 2021

According to @dnfeldman

there is supposed to be a background job that periodically parses dataset's metadata json and populates the relevant fields
which is not currently there...

Can you use a postgres trigger? It makes from me

The trigger can be specified to fire before the operation is attempted on a row (before constraints are checked and the INSERT, UPDATE, or DELETE is attempted); or after the operation has completed (after constraints are checked and the INSERT, UPDATE, or DELETE has completed); or instead of the operation (in the case of inserts, updates or deletes on a view). If the trigger fires before or instead of the event, the trigger can skip the operation for the current row, or change the row being inserted (for INSERT and UPDATE operations only). If the trigger fires after the event, all changes, including the effects of other triggers, are "visible" to the trigger.

@dnfeldman
Copy link
Collaborator

I don't think the trigger approach would work in this case. Reason being is that updating temporal and/or spatial information on a dataset level requires going through all resources, which is very expensive. It would mean waiting for ~5 min after any changes to resources. I think manual sync trigger is probably the easiest approach, especially if it is run as a cronjob

curl '<dcat_endpoint>/datasets/sync_datasets_metadata' -X 'POST' -H 'Connection: keep-alive'

(where the endpoint is data-catalog.mint.isi.edu)

Let me know what you think!

@mosoriob
Copy link
Contributor Author

mosoriob commented May 3, 2021

I see the problem.

Question?

What happened in the following use case?

  1. The user adds a new dataset
  2. The user adds 10 resources
  3. The user search using the DataCatalog UI or MINT UI.

Should the user wait 5 minutes to use the datasets?

Deploy

I'm using the docker-compose.
Can you add a new container with the crontab?


FROM alpine:3.6

# copy crontabs for root user
COPY config/cronjobs /etc/crontabs/root

# start crond with log level 8 in foreground, output to stderr
CMD ["crond", "-f", "-d", "8"]

Where cronjobs is the file that contains your cronjobs, in this form:

* * * * * curl ${flask_app}/datasets/sync_datasets_metadata' -X 'POST' -H 'Connection: keep-alive'  2>&1
# remember to end this file with an empty new line

@dnfeldman
Copy link
Collaborator

The sync is there to update spatial and temporal coverages for a dataset. Everything else should be available immediately. So in the scenario you described, searching the dataset by its name should work right away, but searching by the specific spatial extent might return the newest results

As for adding a crontab, I just want to clarify: do you want to me to spin up a crontab container from the existing docker-compose.yml (https://github.com/mintproject/MINT-Data-Catalog/blob/master/docker-compose.yml)? Or do you have a separate crontab container that you use to manage all the running services on node1?

@mosoriob
Copy link
Contributor Author

mosoriob commented May 6, 2021

I prefer a crontab container because the services can be in the same place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants