Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: automate statistics generation #49

Open
jayaddison opened this issue Dec 2, 2024 · 0 comments
Open

Feature request: automate statistics generation #49

jayaddison opened this issue Dec 2, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@jayaddison
Copy link
Member

Is your feature request related to a problem? Please describe.
Currently we generate statistics using a manual SQL query issued to the PostgreSQL on the host server on an approximately daily basis, and chmod and mv the resulting .csv file to a directory hosted by the cluster-front nginx webserver. The configuration for this can be found here:

backend statistics
errorfile 503 /var/www/searches.csv

Before each update, we currently -- but hopefully temporarily -- apply some manual SQL UPDATE statements, to filter out traffic that may not exactly be bot-initiated, but appears to be related to problems users on some (as-yet-undetermined) devices experience when using the search engine, causing them to initiate duplicate empty search queries.

Generally the statistics are updated with a delay of at-most three or four days; and frequently they are updated next-day - sometimes soon after midnight UTC, depending on when our system operators (me) are available.

To take holiday/vacation time without statistics data becoming stale and outdated, it would be nice to automate the generation of these statistics.

Describe the solution you'd like
Some general requirements here are:

  • Historical data should -- with very few exceptions -- generally be treated as immutable. That is: if we said that we had X number of searches on day Y, that statistic should not change when subsequent statistics are generated.
  • Ideally we should re-use the existing SQL query that is used today to generate the stats.
  • Statistics updates -- whatever the mechanism used to generate and deploy them -- should not interfere with the production application, and should be designed to minimize that risk.
    • A not-always-obvious implication of this is that query load on the production database should either be minimized, or perhaps removed entirely if possible by querying a secondary database instance. The problem with querying a secondary database, however, is that it increases the chance of stale/inconsistent results, meaning that re-generating the statistics at a later date could differ (conflicting with the immutability requirement).

Describe alternatives you've considered
Continuing to manually update and generate the statistics, at least in the short term, continues to be an option. It does have the benefit that it means our operators (me) are somewhat familiar with trends/spikes and oddities in daily statistics.

Additional context
N/A

@jayaddison jayaddison added the enhancement New feature or request label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

1 participant