LLM scraper #971

mikesndrs · 2024-05-17T08:56:24Z

Summary of changes
Added functionality for using automated LLM scraping and metadata processing

Motivation and context
Robustness, metadata parsing, (potential) deduplication, collection management

Checklist

I have read and followed the CONTRIBUTING guide.
I confirm that I have the authority necessary to make this contribution on behalf of its copyright owner and agree
to license it to the TeSS codebase under the
BSD license.

…vice modules

…scraper

fbacall

Some initial comments. Didn't have a lot of time last week to look at this due to the ELIXIR All-Hands meeting

fbacall · 2024-06-11T12:32:50Z

db/migrate/20240220144246_add_llm_check.rb

+    add_reference :events, :llm_object, foreign_key: true
+    add_column :events, :open_science, :string, array: true, default: []
+    add_column :materials, :llm_processed, :bool, default: false


Are these fields needed?

removed materials for now.

lib/ingestors/ingestor.rb

lib/modules/llm_service.rb

config/application.rb

lib/modules/chatgpt_service.rb

lib/tasks/tess.rake

lib/ingestors/llm_ingestor.rb

fbacall · 2024-06-17T12:18:27Z

lib/modules/llm_service.rb

+  def post_process_func(event)
+    response = process(event)
+    event = unload_json(event, response)
+    event.llm_object_attributes = llm_object_attributes


This overwrites the data from the "scrape" LLM interaction, right? Is there ever a need to retain the data for both?

Currently not I think. I'd like to still have data for the scrape in case we ever decide not to do post processing for whatever reason

fbacall · 2024-06-28T11:33:46Z

Will try and review this next week. Had a busy week due to various projects wrapping up at the end of June

db/schema.rb

lib/tasks/tess.rake

test/test_helper.rb

@@ -233,22 +235,22 @@

  def mock_biotools
    biotools_file = File.read("#{Rails.root}/test/fixtures/files/annotation.json")
-    WebMock.stub_request(:get, /data.bioontology.org/).
-      to_return(:status => 200, :headers => {}, :body => biotools_file)
+    WebMock.stub_request(:get, /data.bioontology.org/)


test/test_helper.rb


-    WebMock.stub_request(:get, /nominatim.openstreetmap.org/).
-      to_return(:status => 200, :headers => {}, :body => nominatim_file)
+    WebMock.stub_request(:get, /nominatim.openstreetmap.org/)


test/test_helper.rb

mikesndrs added 30 commits February 14, 2024 10:34

proof of concept

b8aca8c

add post processing

5e303b9

working start to finish gpt scraper

03d7b4f

working semi scraper

14132f0

merge origin

7a944ac

update migration

18e889e

fill llm object on scrape

210920a

post processing rake task

387ee35

llm_object model

b005528

fix curation visible name and build scalable class system for llm ser…

5d02d71

…vice modules

merge origin

a865a79

willma service

722f7a8

parse first json string from message

beaeb71

merge origgin

a982603

Merge remote-tracking branch 'origin/master' into feature/gpt_scraper

5da093c

cleanup

3bb223d

change gpt to llm in scraperee

f5ee2d4

merge origin

e741214

updated json extraction func

b4b6577

working scrape without llm_object

63880fc

working llm_object

fa92359

cleanup, post process still stops halfway

71a5d08

updated prompts

9dbbfbf

filter nonsense vars, add max_new_tokens

d5ad8df

migration fix

b87b310

working llm_ingestor subclass

8c8d651

test

c03978d

working test gpt

fcb4cd5

nil as default for fetch willma api key

4b4ca67

model test for llm_objects on events

7eae18c

mikesndrs added 3 commits June 6, 2024 10:41

remove accidental file

c32993a

Merge branch 'master' of github.com:ElixirTeSS/TeSS into feature/gpt_…

eed3690

…scraper

fix tests

88ed51a

mikesndrs changed the title ~~Draft: LLM scraper~~ LLM scraper Jun 6, 2024

fbacall reviewed Jun 17, 2024

View reviewed changes

mikesndrs added 5 commits June 18, 2024 13:09

handled most feedback

d6d0bcc

rescue standarderror instead of exception

79047ab

fix schema and reload llm module

d99575d

changed llm_objects fixture file name to llm_interactions

b381ffd

fix broken tests

096100f

mikesndrs requested a review from fbacall June 21, 2024 14:42

fbacall requested changes Jul 4, 2024

View reviewed changes

mikesndrs added 3 commits July 11, 2024 17:33

test cases for llm module

12725f4

updated schema

a3a94e0

merge origin

a1e944d

github-advanced-security bot found potential problems Jul 12, 2024

View reviewed changes

undo changes wrt incomplete regular expression for hostnames

3a460a8

github-advanced-security bot found potential problems Jul 12, 2024

View reviewed changes

test/test_helper.rb Dismissed Show dismissed Hide dismissed

test/test_helper.rb Dismissed Show dismissed Hide dismissed

mikesndrs added 2 commits July 12, 2024 09:22

fix language merge issue

8badfff

fix tests + add web request mocks in test helper

464cc8d

mikesndrs requested a review from fbacall July 12, 2024 09:21

fbacall approved these changes Jul 18, 2024

View reviewed changes

fbacall merged commit 6cd1508 into ElixirTeSS:master Jul 18, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM scraper #971

LLM scraper #971

mikesndrs commented May 17, 2024 •

edited

Loading

fbacall left a comment

fbacall Jun 11, 2024

mikesndrs Jun 19, 2024

fbacall Jun 17, 2024

mikesndrs Jun 18, 2024

fbacall commented Jun 28, 2024

LLM scraper #971

LLM scraper #971

Conversation

mikesndrs commented May 17, 2024 • edited Loading

fbacall left a comment

Choose a reason for hiding this comment

fbacall Jun 11, 2024

Choose a reason for hiding this comment

mikesndrs Jun 19, 2024

Choose a reason for hiding this comment

fbacall Jun 17, 2024

Choose a reason for hiding this comment

mikesndrs Jun 18, 2024

Choose a reason for hiding this comment

fbacall commented Jun 28, 2024

mikesndrs commented May 17, 2024 •

edited

Loading