Feature request: automate statistics generation #49

jayaddison · 2024-12-02T16:28:13Z

Is your feature request related to a problem? Please describe.
Currently we generate statistics using a manual SQL query issued to the PostgreSQL on the host server on an approximately daily basis, and chmod and mv the resulting .csv file to a directory hosted by the cluster-front nginx webserver. The configuration for this can be found here:

infrastructure/etc/haproxy/haproxy.cfg

Lines 75 to 76 in 0c2f41e

    
           backend statistics 
        
               errorfile 503 /var/www/searches.csv

Before each update, we currently -- but hopefully temporarily -- apply some manual SQL UPDATE statements, to filter out traffic that may not exactly be bot-initiated, but appears to be related to problems users on some (as-yet-undetermined) devices experience when using the search engine, causing them to initiate duplicate empty search queries.

Generally the statistics are updated with a delay of at-most three or four days; and frequently they are updated next-day - sometimes soon after midnight UTC, depending on when our system operators (me) are available.

To take holiday/vacation time without statistics data becoming stale and outdated, it would be nice to automate the generation of these statistics.

Describe the solution you'd like
Some general requirements here are:

Historical data should -- with very few exceptions -- generally be treated as immutable. That is: if we said that we had X number of searches on day Y, that statistic should not change when subsequent statistics are generated.
Ideally we should re-use the existing SQL query that is used today to generate the stats.
Statistics updates -- whatever the mechanism used to generate and deploy them -- should not interfere with the production application, and should be designed to minimize that risk.
- A not-always-obvious implication of this is that query load on the production database should either be minimized, or perhaps removed entirely if possible by querying a secondary database instance. The problem with querying a secondary database, however, is that it increases the chance of stale/inconsistent results, meaning that re-generating the statistics at a later date could differ (conflicting with the immutability requirement).

Describe alternatives you've considered
Continuing to manually update and generate the statistics, at least in the short term, continues to be an option. It does have the benefit that it means our operators (me) are somewhat familiar with trends/spikes and oddities in daily statistics.

Additional context
N/A

The text was updated successfully, but these errors were encountered:

jayaddison added the enhancement New feature or request label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: automate statistics generation #49

Feature request: automate statistics generation #49

jayaddison commented Dec 2, 2024

Feature request: automate statistics generation #49

Feature request: automate statistics generation #49

Comments

jayaddison commented Dec 2, 2024