-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support discovery/indexing datasets from forgejo-aneksajo instances #47
Comments
Another instance: https://atris.fz-juelich.de/ Since Forgejo is a descendant of Gitea, which is a descendant of Gogs, I wouldn't be surprised if the existing code for GIN can be easily adapted to index forgejo-aneksajo instances as well. |
indeed. Likely could be started as what we have for GIN already -- getting all of them but may be adding some extra sensing/filtering later on to blacklist/not consider some. |
whenever GIN had really in mind sharing data, I think forgejo-aneksajo are growing into a more generic "alternatives to github". So if we are to implement their indexing, we might want to add some kind of "filtering" based on sensing on whether they have e.g. |
@yarikoptic What instance(s) should be tracked? |
could start from the first but ideally all 3, since all "related" and should contain datalad or at least git-annex repos. |
@yarikoptic I can't find any documentation on forgejo-aneksajo's or hub.datalad.org's supported search query syntax. There also doesn't seem to be a way to register on hub.datalad.org so that I can get an API key. |
@yarikoptic How exactly do you want me to query forgejo-aneksajo instances for Datalad repositories? What data should I collect? |
|
@yarikoptic It looks like listing all the repos via Do you want the hub.datalad.org repos to only be queries/fetched/updated once a week like with GIN?
Do you want this or not? |
I cannot say what is wanted or not. But not every repository will be a dataset. In any case, the results will be dominated by thousands of HCP (sub)datasets. |
Yes
how do you see that potentially be done? I do not think following up with a separate query per repo would be scalable.
That's ok -- at this stage we already index non-datasets and non git-annex repos. But then at https://registry.datalad.org/overview/ stage we have ability to filter as desired and get stats on overall git-annex usage.
That's ok . We already have thousands from https://github.com/dandizarrs and alike. They do get groupped by the "organization" so should be ok, e.g. |
A query per repo is kind of the only way to do it. One other possibility would be to only make this query when each repository is first added to the database, on the assumption that the result won't change, but we'd still end up doing the query for every old, non-git-annex repository every time the update is run (not to mention that we'd need to query everything on the first run). |
@jwodder Greedy me doesn't want to miss some repo which becomes git-annex'ified, so for now let's just get them all as we do for GIN. Longer term - I think we would need to look into seeing API extended to allow for desired queries/limits. |
@yarikoptic I'm implementing hub.datalad.org support using the same code as for GIN, but the GIN code already makes a HEAD request to |
indeed, seems to be not supported by forgejo for some reason... let's do GET but also with Range so we do not bother fetching much either❯ curl -I -G -H "Authorization: token ${DLHUB_RO_TOKEN}" -H "Range: bytes=0-9" https://hub.datalad.org/api/v1/repos/www/datalad-blog/raw/main/.datalad/config
HTTP/2 405
allow: GET
alt-svc: h3=":443"; ma=2592000
cache-control: max-age=0, private, must-revalidate, no-transform
date: Wed, 19 Feb 2025 13:59:03 GMT
server: Caddy
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
❯ curl -G -H "Authorization: token ${DLHUB_RO_TOKEN}" -H "Range: bytes=0-9" https://hub.datalad.org/api/v1/repos/www/datalad-blog/raw/main/.datalad/config
[datalad "% did not find an existing issue, so filed a new one https://codeberg.org/forgejo/forgejo/issues/6992 |
FWIW you can also use the non-API raw endpoints with a HEAD request, i.e.:
|
@matrss Are you a developer of hub.datalad.org? It seems that all requests to https://hub.datalad.org/api/v1/repos/jsheunis/annotate-trr379-jsh/raw/main/.datalad/config get a 500 error. |
@yarikoptic Problem: As I stated in the previous comment, one of the requests for |
I am the current maintainer of forgejo-aneksajo, which should mostly be what is running on hub.datalad.org. I can reproduce that 500 error on codeberg as well. The cause is that the repository is completely empty, so some error is expected anyway. As soon as the repository is initialized with anything that endpoint will start returning a 404 (as long as the .datalad/config file doesn't yet exist). I agree that a 500 is not really a fitting status code to return here, but maybe you can just treat anything other than 200 as failure and "not a DataLad dataset"? |
FTR: for the gin -- we do get 404 ❯ curl -I -G -H "Authorization: token ${GIN_TOKEN}" https://gin.g-node.org/api/v1//repos/yarikoptic/testempty/raw/lkajsdf/.datalad/configs
HTTP/1.1 404 Not Found
Date: Wed, 19 Feb 2025 16:47:51 GMT
Server: Apache/2.4.62 (Debian) so let's treat 500 for forgejo as an indicator of absence for now. @matrss - would you be so kind to submit a fix? ;-) |
@yarikoptic https://atris.fz-juelich.de/ and https://hub.trr379.de/ don't have sign-up options, so I can't get API tokens for them. How should we handle this? |
Upon some behind the curtain discussion - let's skip those for now. |
I might be missing something obvious, but why do you even need an API token? As the admin of ATRIS I could give you a bot account for this purpose, but I think all the relevant API endpoints should not require authentication to index public repositories. Using an account for these requests would also add repositories with "internal" visibility to the results, which might actually be undesirable to index as they are at least not fully public. |
Words of wisdom @matrss and thank you for the offer -- frankly, I thought that API access is either entirely forbidden or rate limited (like github does) for non-authenticated access. But indeed, API and And thank you for chiming in, so it would be ok with you if we query/index public repos on ATRIS? |
- Some endpoints (`/api/v1/repos/*/*/raw`, `/api/v1/repos/*/*/media`, ...; anything that uses both `context.ReferencesGitRepo()` and `context.RepoRefForAPI` really) returned a 500 when the repository was completely empty. This resulted in some confusion in datalad/datalad-usage-dashboard#47 because the same request for a non-existent file in a repository could sometimes generate a 404 and sometimes a 500, depending on if the git repository is initialized at all or not. Returning a 404 is more appropriate here, since this isn't an unexpected internal error, but just another way of not finding the requested data. Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/7003 Reviewed-by: Gusted <[email protected]> Co-authored-by: Matthias Riße <[email protected]> Co-committed-by: Matthias Riße <[email protected]>
**Backport:** https://codeberg.org/forgejo/forgejo/pulls/7003 Some endpoints (`/api/v1/repos/*/*/raw`, `/api/v1/repos/*/*/media`, ...; anything that uses both `context.ReferencesGitRepo()` and `context.RepoRefForAPI` really) returned a 500 when the repository was completely empty. This resulted in some confusion in datalad/datalad-usage-dashboard#47 because the same request for a non-existent file in a repository could sometimes generate a 404 and sometimes a 500, depending on if the git repository is initialized at all or not. Returning a 404 seems more appropriate here, since this isn't an unexpected internal error, but just another way of not finding the requested data. Co-authored-by: Matthias Riße <[email protected]> Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/7014 Reviewed-by: Gusted <[email protected]> Co-authored-by: forgejo-backport-action <[email protected]> Co-committed-by: forgejo-backport-action <[email protected]>
Yes! |
https://codeberg.org/matrss/forgejo-aneksajo
known instances:
at worst could be all repositories similar to GIN I guess.
Related (also to interegrate/use API, has pointers):
The text was updated successfully, but these errors were encountered: