Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: ES reindex implementation #343

Merged
merged 9 commits into from
Nov 11, 2024

Conversation

thepsalmist
Copy link
Contributor

@thepsalmist thepsalmist commented Oct 29, 2024

This PR addresses Issue #340

@thepsalmist thepsalmist self-assigned this Oct 29, 2024
@thepsalmist thepsalmist added the enhancement New feature or request label Oct 29, 2024
@thepsalmist thepsalmist changed the title chore: draft ES reindex implementation chore: ES reindex implementation Oct 29, 2024
@thepsalmist thepsalmist marked this pull request as ready for review October 29, 2024 14:00
@kilemensi
Copy link
Contributor

My (very) quick thoughts:

  1. Won't the script be better in Python (it could be just me and my sh allergies)?
  2. How do we plan on using this script?
    i. Would we want to use it to reindex documents matching a certain query? In that case, the script should support a query,
    ii. Do we care about any impact running this script could have on the cluster?, Then the script should support throttling or slicing, etc.

@thepsalmist
Copy link
Contributor Author

My (very) quick thoughts:

  1. Won't the script be better in Python (it could be just me and my sh allergies)?
  2. How do we plan on using this script?
    i. Would we want to use it to reindex documents matching a certain query? In that case, the script should support a query,
    ii. Do we care about any impact running this script could have on the cluster?, Then the script should support throttling or slicing, etc.

On slicing, Defaults to 1, meaning the task isn't sliced into subtasks. I think its better to set this to auto & let ES chose a reasonable number for most indices.

@thepsalmist
Copy link
Contributor Author

My (very) quick thoughts:

  1. Won't the script be better in Python (it could be just me and my sh allergies)?
  2. How do we plan on using this script?
    i. Would we want to use it to reindex documents matching a certain query? In that case, the script should support a query,
    ii. Do we care about any impact running this script could have on the cluster?, Then the script should support throttling or slicing, etc.

On the support for throttling, I'm not sure we wan't to define the requests_per_second to set throttling on initial reindex request without knowing the impact on the cluster.
Since we've settled on Kibana being part of the cluster, I believe we can set this using the Rethrottle API based on the monitoring stats

@thepsalmist thepsalmist requested review from kilemensi and m453h and removed request for kilemensi and m453h November 4, 2024 09:50
Copy link
Contributor

@m453h m453h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 🚀

Copy link
Member

@pgulley pgulley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust the details here- thanks for including good documentation!

Copy link
Contributor

@kilemensi kilemensi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bash-fu ain't all that but LGTM

@thepsalmist thepsalmist requested a review from philbudne November 8, 2024 07:21
@thepsalmist thepsalmist merged commit 2320597 into mediacloud:main Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants