Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support discovery/indexing datasets from forgejo-aneksajo instances #47

Open
yarikoptic opened this issue Sep 13, 2024 · 25 comments
Open

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Sep 13, 2024

https://codeberg.org/matrss/forgejo-aneksajo

known instances:

at worst could be all repositories similar to GIN I guess.

Related (also to interegrate/use API, has pointers):

@matrss
Copy link

matrss commented Nov 8, 2024

Another instance: https://atris.fz-juelich.de/

Since Forgejo is a descendant of Gitea, which is a descendant of Gogs, I wouldn't be surprised if the existing code for GIN can be easily adapted to index forgejo-aneksajo instances as well.

@yarikoptic
Copy link
Member Author

indeed. Likely could be started as what we have for GIN already -- getting all of them but may be adding some extra sensing/filtering later on to blacklist/not consider some.

@yarikoptic yarikoptic added the good first issue Good for newcomers label Nov 12, 2024
@yarikoptic
Copy link
Member Author

whenever GIN had really in mind sharing data, I think forgejo-aneksajo are growing into a more generic "alternatives to github". So if we are to implement their indexing, we might want to add some kind of "filtering" based on sensing on whether they have e.g. git-annex branch.

@jwodder
Copy link
Member

jwodder commented Feb 13, 2025

@yarikoptic What instance(s) should be tracked?

@yarikoptic
Copy link
Member Author

could start from the first but ideally all 3, since all "related" and should contain datalad or at least git-annex repos.

@jwodder
Copy link
Member

jwodder commented Feb 14, 2025

@yarikoptic I can't find any documentation on forgejo-aneksajo's or hub.datalad.org's supported search query syntax. There also doesn't seem to be a way to register on hub.datalad.org so that I can get an API key.

@jwodder
Copy link
Member

jwodder commented Feb 14, 2025

@yarikoptic How exactly do you want me to query forgejo-aneksajo instances for Datalad repositories? What data should I collect?

@yarikoptic
Copy link
Member Author

@jwodder
Copy link
Member

jwodder commented Feb 17, 2025

@yarikoptic It looks like listing all the repos via /repos/search like with GIN is the only real option here.

Do you want the hub.datalad.org repos to only be queries/fetched/updated once a week like with GIN?

we might want to add some kind of "filtering" based on sensing on whether they have e.g. git-annex branch.

Do you want this or not?

@mih
Copy link
Member

mih commented Feb 17, 2025

I cannot say what is wanted or not. But not every repository will be a dataset.

In any case, the results will be dominated by thousands of HCP (sub)datasets.

@yarikoptic
Copy link
Member Author

@jwodder

@yarikoptic It looks like listing all the repos via /repos/search like with GIN is the only real option here.

Do you want the hub.datalad.org repos to only be queries/fetched/updated once a week like with GIN?

Yes

we might want to add some kind of "filtering" based on sensing on whether they have e.g. git-annex branch.

Do you want this or not?

how do you see that potentially be done? I do not think following up with a separate query per repo would be scalable.

@mih

I cannot say what is wanted or not. But not every repository will be a dataset.

In any case, the results will be dominated by thousands of HCP (sub)datasets.

That's ok -- at this stage we already index non-datasets and non git-annex repos. But then at https://registry.datalad.org/overview/ stage we have ability to filter as desired and get stats on overall git-annex usage.

In any case, the results will be dominated by thousands of HCP (sub)datasets.

That's ok . We already have thousands from https://github.com/dandizarrs and alike. They do get groupped by the "organization" so should be ok, e.g.

Image

@jwodder
Copy link
Member

jwodder commented Feb 17, 2025

@yarikoptic

how do you see that potentially be done? I do not think following up with a separate query per repo would be scalable.

A query per repo is kind of the only way to do it. One other possibility would be to only make this query when each repository is first added to the database, on the assumption that the result won't change, but we'd still end up doing the query for every old, non-git-annex repository every time the update is run (not to mention that we'd need to query everything on the first run).

@yarikoptic
Copy link
Member Author

@jwodder Greedy me doesn't want to miss some repo which becomes git-annex'ified, so for now let's just get them all as we do for GIN. Longer term - I think we would need to look into seeing API extended to allow for desired queries/limits.

@jwodder
Copy link
Member

jwodder commented Feb 18, 2025

@yarikoptic I'm implementing hub.datalad.org support using the same code as for GIN, but the GIN code already makes a HEAD request to /repos/{repo}/raw/{defbranch}/.datalad/config for each repo returned, and for some reason, the hub.datalad.org API doesn't accept HEAD requests to this endpoint. Should I change the method to GET, skip the check only for hub.datalad.org, or do something else?

@yarikoptic
Copy link
Member Author

indeed, seems to be not supported by forgejo for some reason... let's do GET but also with Range so we do not bother fetching much either
❯ curl -I -G -H "Authorization: token ${DLHUB_RO_TOKEN}" -H "Range: bytes=0-9" https://hub.datalad.org/api/v1/repos/www/datalad-blog/raw/main/.datalad/config
HTTP/2 405 
allow: GET
alt-svc: h3=":443"; ma=2592000
cache-control: max-age=0, private, must-revalidate, no-transform
date: Wed, 19 Feb 2025 13:59:03 GMT
server: Caddy
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN

❯ curl -G -H "Authorization: token ${DLHUB_RO_TOKEN}" -H "Range: bytes=0-9" https://hub.datalad.org/api/v1/repos/www/datalad-blog/raw/main/.datalad/config
[datalad "%                                                                                                   

did not find an existing issue, so filed a new one https://codeberg.org/forgejo/forgejo/issues/6992

@matrss
Copy link

matrss commented Feb 19, 2025

FWIW you can also use the non-API raw endpoints with a HEAD request, i.e.:

curl --head https://hub.datalad.org/www/datalad-blog/raw/branch/main/.datalad/config

@jwodder
Copy link
Member

jwodder commented Feb 19, 2025

@matrss Are you a developer of hub.datalad.org? It seems that all requests to https://hub.datalad.org/api/v1/repos/jsheunis/annotate-trr379-jsh/raw/main/.datalad/config get a 500 error.

@jwodder
Copy link
Member

jwodder commented Feb 19, 2025

@yarikoptic Problem: As I stated in the previous comment, one of the requests for .datalad/config for a specific repo on hub.datalad.org is consistently returning 500. While @matrss's URL succeeds for this repository, it cannot be used with gin.g-node.org. What should be done about this?

@matrss
Copy link

matrss commented Feb 19, 2025

@matrss Are you a developer of hub.datalad.org? It seems that all requests to https://hub.datalad.org/api/v1/repos/jsheunis/annotate-trr379-jsh/raw/main/.datalad/config get a 500 error.

I am the current maintainer of forgejo-aneksajo, which should mostly be what is running on hub.datalad.org.

I can reproduce that 500 error on codeberg as well. The cause is that the repository is completely empty, so some error is expected anyway. As soon as the repository is initialized with anything that endpoint will start returning a 404 (as long as the .datalad/config file doesn't yet exist).

I agree that a 500 is not really a fitting status code to return here, but maybe you can just treat anything other than 200 as failure and "not a DataLad dataset"?

@yarikoptic
Copy link
Member Author

FTR: for the gin -- we do get 404

❯ curl -I -G -H "Authorization: token ${GIN_TOKEN}" https://gin.g-node.org/api/v1//repos/yarikoptic/testempty/raw/lkajsdf/.datalad/configs
HTTP/1.1 404 Not Found
Date: Wed, 19 Feb 2025 16:47:51 GMT
Server: Apache/2.4.62 (Debian)

so let's treat 500 for forgejo as an indicator of absence for now. @matrss - would you be so kind to submit a fix? ;-)

@jwodder
Copy link
Member

jwodder commented Feb 19, 2025

@yarikoptic https://atris.fz-juelich.de/ and https://hub.trr379.de/ don't have sign-up options, so I can't get API tokens for them. How should we handle this?

@yarikoptic
Copy link
Member Author

Upon some behind the curtain discussion - let's skip those for now.

@matrss
Copy link

matrss commented Feb 20, 2025

@yarikoptic https://atris.fz-juelich.de/ and https://hub.trr379.de/ don't have sign-up options, so I can't get API tokens for them. How should we handle this?

I might be missing something obvious, but why do you even need an API token?

As the admin of ATRIS I could give you a bot account for this purpose, but I think all the relevant API endpoints should not require authentication to index public repositories. Using an account for these requests would also add repositories with "internal" visibility to the results, which might actually be undesirable to index as they are at least not fully public.

@yarikoptic
Copy link
Member Author

Words of wisdom @matrss and thank you for the offer -- frankly, I thought that API access is either entirely forbidden or rate limited (like github does) for non-authenticated access. But indeed, API and /search endpoint at least is available without authentication. May be @jwodder could shine more light.

And thank you for chiming in, so it would be ok with you if we query/index public repos on ATRIS?

DennisRasey pushed a commit to DennisRasey/forgejo that referenced this issue Feb 21, 2025
- Some endpoints (`/api/v1/repos/*/*/raw`, `/api/v1/repos/*/*/media`, ...;
anything that uses both `context.ReferencesGitRepo()` and
`context.RepoRefForAPI` really) returned a 500 when the repository was
completely empty. This resulted in some confusion in
datalad/datalad-usage-dashboard#47 because the
same request for a non-existent file in a repository could sometimes
generate a 404 and sometimes a 500, depending on if the git repository
is initialized at all or not.

Returning a 404 is more appropriate here, since this isn't an
unexpected internal error, but just another way of not finding the
requested data.

Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/7003
Reviewed-by: Gusted <[email protected]>
Co-authored-by: Matthias Riße <[email protected]>
Co-committed-by: Matthias Riße <[email protected]>
DennisRasey pushed a commit to DennisRasey/forgejo that referenced this issue Feb 21, 2025
**Backport:** https://codeberg.org/forgejo/forgejo/pulls/7003

Some endpoints (`/api/v1/repos/*/*/raw`, `/api/v1/repos/*/*/media`, ...;
anything that uses both `context.ReferencesGitRepo()` and
`context.RepoRefForAPI` really) returned a 500 when the repository was
completely empty. This resulted in some confusion in
datalad/datalad-usage-dashboard#47 because the
same request for a non-existent file in a repository could sometimes
generate a 404 and sometimes a 500, depending on if the git repository
is initialized at all or not.

Returning a 404 seems more appropriate here, since this isn't an
unexpected internal error, but just another way of not finding the
requested data.

Co-authored-by: Matthias Riße <[email protected]>
Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/7014
Reviewed-by: Gusted <[email protected]>
Co-authored-by: forgejo-backport-action <[email protected]>
Co-committed-by: forgejo-backport-action <[email protected]>
@matrss
Copy link

matrss commented Feb 21, 2025

And thank you for chiming in, so it would be ok with you if we query/index public repos on ATRIS?

Yes!

yarikoptic added a commit that referenced this issue Feb 21, 2025
Add support for hub.datalad.org
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants