Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk unindexing for IndexerInterface #598

Merged
merged 13 commits into from
Aug 27, 2024
Merged

Bulk unindexing for IndexerInterface #598

merged 13 commits into from
Aug 27, 2024

Conversation

Enet4
Copy link
Collaborator

@Enet4 Enet4 commented Jul 21, 2022

This proposes an extension to the IndexerInterface so that API consumers can request multiple items to be unindexed at once. Resolves #594.

This extension to the API should be carefully reviewed and evaluated to ensure that implementers can use it properly and that it can be well integrated into existing workflows.

Breaking changes are minimal. It is only a problem if an existing plugin already has a method with the exact same signature, which is unlikely. Hence, we are in position to deliver this in 3.4.0, or postpone to version 4 anyway.


To ensure some level of quality in this suggestion, we should have:

  • at least one proof of concept plugin implementation
    • We could take this as an opportunity to move forward with hosting the open plugins on GitHub.
  • at least one example that the Dicoogle core can consume this method in a usable way without downsides
    • I suggest an extension to the unindexing service so that it accepts multiple URIs

Problems resolved in the latest version:

  • No way to retrieve the task's unindexing progress from another thread. Now has a second parameter as a progress callback.
  • Even if the indexer periodically flushes some unindexing operations during the task, other routines cannot work with URIs which have already been unindexed before the task ends completely. This could pose an issue in file removal tasks, because it requires unindex to finish before we start cleaning up the files. The operation is now asynchronous.

Known caveats:

  • It is not clear whether we can properly constrain the number of parallel unindexing tasks by pushing this to our indexing task pool. The current indexing task manager depends on task objects returning a different kind of report, which UnindexReport does not implement. Maybe it should be subclassed from Report? As of the latest version, unindexing tasks can be dispatched by the indexer thread pool.
  • Having a second parameter to keep track of progress is quirky. One would naturally expect an UnindexTask<T> to extend Task<T>, but it is very different from indexing tasks and would make it even harder to attach to the task pool.
  • The resulting object is too complex and granular. While there is a lot of power in being able to specify an error for each URI, it is not always the case that files will be unindexed independently, so it may be hard to build such results in batched unindexing.

@bastiao
Copy link
Member

bastiao commented Jul 22, 2022

My opinion is to have it in 3.2.0.
Thinking aloud: my main concern about the current new API is the not capability to report the URIs unindexed (or perhaps the ones it failed). But I can take a look in our use case and see how it will look like.

@Enet4
Copy link
Collaborator Author

Enet4 commented Jul 22, 2022

my main concern about the current new API is the not capability to report the URIs unindexed (or perhaps the ones it failed).

That is true. I feared that including that information would overcomplicate the interface, but the cost of not knowing which files were unindexed and which ones were not is possibly too high. I will revise this.

@bastiao
Copy link
Member

bastiao commented Aug 25, 2022

do you have plans to "un-draft" this PR?

@Enet4
Copy link
Collaborator Author

Enet4 commented Aug 25, 2022

do you have plans to "un-draft" this PR?

I have updated the PR with a plan. It should not become ready for review until there is evidence that this API works well for implementers and consumers. We also have two known problems and further thinking is needed to either resolve them or just accept them as not enough of a problem to merit blocking the extension. Feedback is more than welcome.

@Enet4
Copy link
Collaborator Author

Enet4 commented Oct 15, 2022

Updated the API to include a progress callback. I updated the root message with known caveats.

@bastiao
Copy link
Member

bastiao commented Jul 8, 2023

The unindex bulk API can be beneficial in certain scenarios, and we are currently approaching a milestone where new APIs will be incorporated. It would be appreciated if you could take into account the undraft this pull request, as it has been pending for quite some time. If necessary, we can revisit and iterate on it at a later stage.

@Enet4
Copy link
Collaborator Author

Enet4 commented Jul 21, 2023

The unindex bulk API can be beneficial in certain scenarios, and we are currently approaching a milestone where new APIs will be incorporated. It would be appreciated if you could take into account the undraft this pull request, as it has been pending for quite some time. If necessary, we can revisit and iterate on it at a later stage.

I have integrated bulk unindexing to the core unindex Web service. It would be better to also have a proof of concept implementation other than the default.

@Enet4 Enet4 self-assigned this Aug 31, 2023
@Enet4 Enet4 marked this pull request as ready for review March 7, 2024 11:05
@Enet4 Enet4 added this to the 3.4.0 milestone Mar 7, 2024
@Enet4
Copy link
Collaborator Author

Enet4 commented Mar 25, 2024

I added this point to the known caveats.

The resulting object is too complex and granular. While there is a lot of power in being able to specify an error for each URI, it is not always the case that files will be unindexed independently, so it may be hard to build such results in batched unindexing.

After trying to implement this interface, I am convinced that the API needs to be redesigned to be easier to implement.

Enet4 added 12 commits April 17, 2024 15:52
- for unindexing in bulk
- clarify that both unindex methods are synchronous,
  unlike the indexing ones
- add `UnindexReport` class and nested classes
   - for containing errors which may occur in bulk unindexing
- change `IndexerInterface#unindex(Collection<URI>)`
   - returns `UnindexReport`
   - can throw `IOException`
- make it asynchronous: returns a `Task` like in `index`
- add second parameter for keeping track of progress
- clarify that it returns a task
- remove unused import
- can only handle one indexer at a time,
  but other than that it works
- remove deprecated method call #handles,
  check scheme instead
- record a collection of URIs in each unindex failure
@Enet4
Copy link
Collaborator Author

Enet4 commented Apr 17, 2024

I have readjusted the Unindex report so that each failure may specify multiple URIs, which is a better reflection of what will happen in most implementations.

- provide clearer methods to collect the counts
  of files which were not unindexed successfully
@tiberio-baptista tiberio-baptista self-requested a review June 20, 2024 17:03
Copy link
Collaborator

@tiberio-baptista tiberio-baptista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have applied the new unindex method for the interface in our index plugins, and it works as expected. Performance is definitely being improved.

LGTM!

@bastiao bastiao merged commit 7af44ec into dev Aug 27, 2024
2 checks passed
@bastiao bastiao deleted the new/indexer/unindex-bulk branch August 27, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Indexer batched unindexing
3 participants