Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BBIQ push command & JSON testdata #50

Merged
merged 12 commits into from
Aug 6, 2024

Conversation

kellnerd
Copy link
Contributor

@kellnerd kellnerd commented Jul 22, 2024

The BBIQ CLI got a new command which allows you to directly push the contents of a JSON file into the queue, which is useful for testing and debugging.
Especially with the upcoming entity import schema change I wanted to make the current BB importer setup more battle-proofed before I am going break everything again.

So I've produced a few files with test data which cover all currently available properties of the following entity types:

  • Author
  • Edition (requires an edition group BBID)
  • Edition Group
  • Publisher
  • Series
  • Work

I am going to mark this PR as a draft until I have tested the remaining entity types as much as possible (but without doing changes which would have to be redone after the schema change).

The main issue which I discovered with these tools was the lack of support for annotations on imported entities. See the following PRs for all issues:

@kellnerd kellnerd marked this pull request as ready for review July 24, 2024 21:25
@kellnerd
Copy link
Contributor Author

kellnerd commented Jul 24, 2024

I think this is ready now, I don't want to get this stuck with the failing approval of imported editions and series. All other entity types can be imported, viewed, edited and approved.


For series imports, approval fails at the SQL level only, everything else should be working:

error: insert into "bookbrainz"."series" ("alias_set_id", "annotation_id", "bbid", "disambiguation_id", "entity_type", "identifier_set_id", "ordering_type_id", "revision_id") values ($1, $2, $3, $4, $5, $6, $7, $8) returning * - cannot insert into view "series"

I have no clue what is wrong here, such an insert statement works for the views of all other entity types which I could test. Only for series it fails, via both code paths and directly in psql, so something about this view must be different from the rest?

Solution: Triggers for series were missing in the minimal DB, creating them solves the issue:

cat sql/scripts/create_triggers.sql | docker exec -i postgres psql -U bookbrainz -d bookbrainz_min

For edition imports, the approval fails during the automatic creation of an edition group (for both the "Approve" and the "Edit & Approve" action):

error: insert into "bookbrainz"."edition_group" ("alias_set_id", "author_credit_id", "bbid", "credit_section", "revision_id") values ($1, DEFAULT, $2, DEFAULT, $3) returning * - column "author_credit_id" of relation "edition_group" does not exist
From previous event: [...]
     at createEditionGroupForNewEdition (/home/bookbrainz/bookbrainz-site/node_modules/bookbrainz-data/lib/util.js:236:6)
     at autoCreateNewEditionGroup (/home/bookbrainz/bookbrainz-site/node_modules/bookbrainz-data/lib/models/entities/edition.js:41:31)
     at Child.<anonymous> (/home/bookbrainz/bookbrainz-site/node_modules/bookbrainz-data/lib/models/entities/edition.js:74:11)

Solution:

cat sql/migrations/2021-author-credits/up.sql | docker exec -i postgres psql -U bookbrainz -d bookbrainz_min

The below is outdated already, the remaining issues have been addressed in metabrainz/bookbrainz-site#1109

I didn't have the time to fully investigate that error so far, but my test import data definitely had a valid edition group BBID, so there should have been no need to create an EG automatically. It doesn't show up in the edition editor though, so it might have been lost at an intermediate transformation step.

But this still doesn't explain why the automatic creation fails.
It seems that the error occurs because the minimal DB has an outdated edition_group view (and edition_group_data table), back from the days when there were no author credits...
As I said before already, time to use a full DB dump 😅

@kellnerd
Copy link
Contributor Author

I've enhanced the author testdata with identifiers and more aliases. With both of these working fine, I think I have reached the limit of what data is currently possible to import.

P.S. We should replace the numeric IDs at the parsed entity level soon, it is cumbersome to obtain them in every producer implementation.
Additionally, specifying an invalid ID can lead to obscure errors which do not surface in the logs.

Example: Try changing the author aliases language ID from 134 to 34 and see how the real issue is hidden by the following error message:

error: Transaction for 'Test Author' (Author A1234) failed: error: insert into "bookbrainz"."identifier" ("type_id", "value") values ($1, $2) returning * - current transaction is aborted, commands ignored until end of transaction block

Rather than the last(?) statement of a failing transaction, the log should show the statement that caused the failure. I only discovered the real cause by gradually removing the newly added identifiers:

error: Transaction for 'Test Author' (Author A1234) failed: error: insert into "bookbrainz"."alias" ("language_id", "name", "primary", "sort_name") values ($1, $2, $3, $4) returning * - insert or update on table "alias" violates foreign key constraint "alias_language_id_fkey"

@MonkeyDo
Copy link
Member

MonkeyDo commented Aug 6, 2024

I've enhanced the author testdata with identifiers and more aliases. With both of these working fine, I think I have reached the limit of what data is currently possible to import.

On this topic, I just found out (or forgot and found out again) about OL publishing JSON-schema files for their dumps, which could be used to cross-reference if we are missing any data point, perhaps even could be useful for parsing and validating OL imports and give us useful error messages?
https://github.com/internetarchive/openlibrary-client/tree/master/olclient/schemata

Rather than the last(?) statement of a failing transaction, the log should show the statement that caused the failure. I only discovered the real cause by gradually removing the newly added identifiers:

That's a pickle. I'm trying to see if it's a know issue, but apart from 10-year-old issues related to Promise.all I can't find anything specificly similar to our case.
Maybe we are using nested try-catch blocks in a way that knex doesn't like, but it's going to be a PITA to test that hypothseis.

Copy link
Member

@MonkeyDo MonkeyDo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making useful debugging commands!
Sure beats manually entering sample data in the database :)

@MonkeyDo MonkeyDo merged commit 734abc0 into metabrainz:master Aug 6, 2024
@kellnerd kellnerd deleted the bbiq-push-file branch August 12, 2024 09:59
@kellnerd
Copy link
Contributor Author

On this topic, I just found out (or forgot and found out again) about OL publishing JSON-schema files for their dumps, which could be used to cross-reference if we are missing any data point, perhaps even could be useful for parsing and validating OL imports and give us useful error messages?

Yeah, I am already aware of these, I had converted them into type definitions a month ago before I focussed on other tasks.
That should be sufficient to handle all data points from the dumps which we are missing so far.
We will see whether all entries follow that schema or if we need additional runtime validation once I get back to the OL importer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants