Skip to content
This repository has been archived by the owner on Jun 22, 2023. It is now read-only.

Data validation tools to prevent duplicates/misspellings #74

Open
david-linssen opened this issue Jan 20, 2020 · 8 comments
Open

Data validation tools to prevent duplicates/misspellings #74

david-linssen opened this issue Jan 20, 2020 · 8 comments
Assignees
Labels
discuss enhancement New feature or request

Comments

@david-linssen
Copy link

submitted by UPF, relevant for Scholars & enthusiasts use-cases. awarded 3 dots, assigned to @alastair

@david-linssen
Copy link
Author

also submitted by @ChristiaanScheermeijer

@david-linssen
Copy link
Author

Scope for M24 is: import existing relations from IMSLP/Wikipedoa/MusicBrainz

@CasperCDR
Copy link

Preventing duplicates for external sources can be reached by adding a unique node property constraint on the source property for example. The identifier field can also be used, but is now filled with a uuid. The identifier (based on Thing) could also be the source uri. Are there any implications when doing this? If this is not a desirable solution, clients should check for existing nodes before inserting.
Discussion: Which field(s) to check to decide if a duplicate entry exists?

@alastair
Copy link
Member

alastair commented Mar 3, 2020

For duplicates from the same database, we can use the source field. This will be the exact link to the page where the data was collected from, e.g. https://musicbrainz.org/artist/8d610e51-64b4-4654-b8df-064b0fb7a9d9 or https://www.wikidata.org/entity/Q7304

For M24 we will import data from:

  • MusicBrainz
  • WikiData
  • IMSLP
  • CPDL
  • muziekweb
  • viaf (if it exists in one of the above)
  • library of congress (if it exists in one of the above)
  • worldcat (if it exists in one of the above)
  • isni (if it exists in one of the above)

If any of these sources has existing metadata links to any other source, we will use skos:ExactMatch to say that these items are the same.

The next part of this task (which for now will probably be out of the scope of M24) is to match items when there are no existing relationships (e.g. an artist on MusicBrainz and muziekweb which is the same, but has no common links to each other or through viaf/worldcat, etc). This matching will require some kind of heuristic (edit distance), or could be a crowd-sourcing task.
Once we identify these links, we should contribute them back to the primary data sources.

@ChristiaanScheermeijer
Copy link
Collaborator

@alastair, in a recent version of neo4j-graphql-js it is possible to add a @unique directive to properties which can only exist once in the Neo4j instance. This should also be added for all identifier properties as these can currently exist multiple times.

https://grandstack.io/docs/graphql-schema-directives

@ChristiaanScheermeijer
Copy link
Collaborator

I also suggest that we add some custom mutations making it easier to "tag" nodes related to each other. Now we would need to perform multiple queries/mutations to create a bi-directional relationship between two nodes.

p1:Person-[:EXACT_MATCH]->p2:Person
p2:Person-[:EXACT_MATCH]->p1:Person
type _matchInput {
  identifier: ID!
}

type _matchResult {
  fromIdentifier: ID!
  toIdentifier: ID!
}

type Mutation {
  AddBroadMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  AddCloseMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  AddExactMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  AddNarrowMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  AddRelatedMatch(from: _matchInput!, to: _matchInput!) : _matchResult

  RemoveBroadMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  RemoveCloseMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  RemoveExactMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  RemoveNarrowMatch(from: _matchInput!, to: _matchInput!) : _matchResult
  RemoveRelatedMatch(from: _matchInput!, to: _matchInput!) : _matchResult
}

@ChristiaanScheermeijer
Copy link
Collaborator

@CasperCDR @alastair we are now running a recent version of the neo4j-graphql-js which supports the @unique directive. Is it still relevant to add to the CE-API?

@ChristiaanScheermeijer ChristiaanScheermeijer added the enhancement New feature or request label Mar 12, 2021
@alastair
Copy link
Member

We have @unique on identifier, but we don't have it on source - it's still possible to import the same item from musicbrainz twice.
Having said that, I don't think it's a good idea to add unique to source, because we could have multiple objects that describe different aspects of a single source.

I don't know a good way (other than being careful with our code) to ensure that we don't import the same data multiple times.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
discuss enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants