Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annif: Find existing content in Scholar to be used as a training set #3

Open
Tracked by #14
hortongn opened this issue Jan 30, 2023 · 3 comments
Open
Tracked by #14

Comments

@hortongn
Copy link
Member

hortongn commented Jan 30, 2023

Create a list of works and/or collections in Scholar that have embedded text and existing metadata. We want a good mix of examples. Different files types, files with minimal metadata as well as well-described files.

Probably best to start with works that have only 1 file attached to avoid confusion.

https://github.com/NatLibFi/Annif-corpora

@crowesn
Copy link

crowesn commented Feb 9, 2023

I think this is the method used to get full text from files for indexing, could be useful when building a dataset.

https://github.com/samvera/hyrax/blob/eb8d42d4fb99f8c7e2116af51b5642bf07312ce7/app/services/hyrax/file_set_derivatives_service.rb#L123

@hortongn
Copy link
Member Author

hortongn commented Feb 9, 2023

Question: Send the document to AI or have Scholar extract the full text and send to AI?

  • Does we want to tie this Scholar or keep this portable?

@hortongn
Copy link
Member Author

Some Scholar collections that may have useful content for a training set:

CEAS Electrical Engineering and Computing Systems (EECS) Senior Design Projects
https://scholar.uc.edu/collections/x633f229p
(student works)

2017 CECH Information Technology Senior Design Projects
https://scholar.uc.edu/collections/bc3888732
(student works)

The Lucille M. Schultz 19th Century Composition Archive
https://scholar.uc.edu/collections/05741w32f
(documents)

Modernnati: Archiving & Preserving Cincinnati's Modernist Architecture
https://scholar.uc.edu/collections/9p290b783
(articles)

Nature of Black Holes
https://scholar.uc.edu/collections/t722hb29d
(articles)

Cincinnati Romance Review
https://scholar.uc.edu/collections/hd76s1380
(datasets???)

2019 Information Technology Research Symposium
https://scholar.uc.edu/collections/jq085m248
(articles)

@hortongn hortongn removed their assignment Mar 16, 2023
@hortongn hortongn changed the title Find existing content in Scholar to be used as a training set Annif: Find existing content in Scholar to be used as a training set Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: On Hold / Backlog
Development

No branches or pull requests

2 participants