Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect/Extract Project Website Information #32

Open
Max-at-Vlaanderen opened this issue Jul 12, 2024 · 0 comments
Open

Detect/Extract Project Website Information #32

Max-at-Vlaanderen opened this issue Jul 12, 2024 · 0 comments

Comments

@Max-at-Vlaanderen
Copy link

To further enrich the metadata, project websites can be a good source of clearly communicated info. In our current harvested data, the reference to such websites could already be found, making it possible to find the project website more accurately (as opposed to querying via google). This would require some further developments to be added to the Harvester.

Desired Behavior:

  1. Detection of URLs: Implement functionality to detect URLs in the metadata that link to project websites.
  2. URL Following: Automatically follow these URLs to access the linked websites.
  3. Detection of projectwebsite: not all URLs will lead to the projectwebsite. Finding a way to distinguish a project website from another will be necessary.
  4. Scraping Additional Information: Extract additional text from these project website
  5. Store URLs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant