Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM scraper #971

Merged
merged 44 commits into from
Jul 18, 2024
Merged

LLM scraper #971

merged 44 commits into from
Jul 18, 2024

Conversation

mikesndrs
Copy link
Contributor

@mikesndrs mikesndrs commented May 17, 2024

Summary of changes
Added functionality for using automated LLM scraping and metadata processing

Motivation and context
Robustness, metadata parsing, (potential) deduplication, collection management

Checklist

  • I have read and followed the CONTRIBUTING guide.
  • I confirm that I have the authority necessary to make this contribution on behalf of its copyright owner and agree
    to license it to the TeSS codebase under the
    BSD license.

@mikesndrs mikesndrs changed the title Draft: LLM scraper LLM scraper Jun 6, 2024
Copy link
Member

@fbacall fbacall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. Didn't have a lot of time last week to look at this due to the ELIXIR All-Hands meeting

Comment on lines 14 to 16
add_reference :events, :llm_object, foreign_key: true
add_column :events, :open_science, :string, array: true, default: []
add_column :materials, :llm_processed, :bool, default: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these fields needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed materials for now.

lib/ingestors/ingestor.rb Outdated Show resolved Hide resolved
lib/ingestors/ingestor.rb Outdated Show resolved Hide resolved
lib/modules/llm_service.rb Outdated Show resolved Hide resolved
lib/modules/llm_service.rb Outdated Show resolved Hide resolved
config/application.rb Outdated Show resolved Hide resolved
lib/modules/chatgpt_service.rb Outdated Show resolved Hide resolved
lib/tasks/tess.rake Outdated Show resolved Hide resolved
lib/ingestors/llm_ingestor.rb Outdated Show resolved Hide resolved
def post_process_func(event)
response = process(event)
event = unload_json(event, response)
event.llm_object_attributes = llm_object_attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrites the data from the "scrape" LLM interaction, right? Is there ever a need to retain the data for both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not I think. I'd like to still have data for the scrape in case we ever decide not to do post processing for whatever reason

@mikesndrs mikesndrs requested a review from fbacall June 21, 2024 14:42
@fbacall
Copy link
Member

fbacall commented Jun 28, 2024

Will try and review this next week. Had a busy week due to various projects wrapping up at the end of June

db/schema.rb Outdated Show resolved Hide resolved
lib/tasks/tess.rake Show resolved Hide resolved
lib/tasks/tess.rake Outdated Show resolved Hide resolved
lib/tasks/tess.rake Outdated Show resolved Hide resolved
lib/tasks/tess.rake Outdated Show resolved Hide resolved
lib/tasks/tess.rake Outdated Show resolved Hide resolved
@@ -233,22 +235,22 @@

def mock_biotools
biotools_file = File.read("#{Rails.root}/test/fixtures/files/annotation.json")
WebMock.stub_request(:get, /data.bioontology.org/).
to_return(:status => 200, :headers => {}, :body => biotools_file)
WebMock.stub_request(:get, /data.bioontology.org/)

Check failure

Code scanning / CodeQL

Incomplete regular expression for hostnames High test

This regular expression has an unescaped '.' before 'bioontology.org', so it might match more hosts than expected.

WebMock.stub_request(:get, /nominatim.openstreetmap.org/).
to_return(:status => 200, :headers => {}, :body => nominatim_file)
WebMock.stub_request(:get, /nominatim.openstreetmap.org/)

Check failure

Code scanning / CodeQL

Incomplete regular expression for hostnames High test

This regular expression has an unescaped '.' before 'openstreetmap.org', so it might match more hosts than expected.
test/test_helper.rb Dismissed Show dismissed Hide dismissed
test/test_helper.rb Dismissed Show dismissed Hide dismissed
@mikesndrs mikesndrs requested a review from fbacall July 12, 2024 09:21
@fbacall fbacall merged commit 6cd1508 into ElixirTeSS:master Jul 18, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants