Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a recurring harvest #5

Open
BeritJanssen opened this issue Dec 12, 2024 · 0 comments
Open

Implement a recurring harvest #5

BeritJanssen opened this issue Dec 12, 2024 · 0 comments

Comments

@BeritJanssen
Copy link
Contributor

BeritJanssen commented Dec 12, 2024

This repo was originally started for, among others, the Peace Portal corpora. From this side there was a request to regularly check resources publishing funerary inscriptions for newly added inscriptions. I was thinking to set this up as follows:

  • add Elasticsearch dependency to this repository
  • let the scraper retrieve ids of all (newly) available documents and compare their ids against ids found in existing indices
  • scrape the documents whose ids are not yet in the indices

Preferably, this would be done through a dedicated server, or, we could make use of Kubernetes. I'd like to work on this in the first half of 2025, so I'm not sure if the latter option is available.

This would need to be run at regular intervals, e.g., every few months. What would be the best way to achieve this? We could add a chronjob on the server itself, or we could make use of a self-hosted GitHub action runner and trigger the harvest via GitHub actions. Any thoughts, @gdamaskos @tymees @bartbouter @falconburrow ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant