Adds a timeout for user globals #18

AlinaNova21 · 2018-02-07T22:23:14Z

Currently, the private server only resets globals if code is uploaded or if the global doesn't exist. This can cause performance issues on longer running servers. This PR adds timeouts to help maintain performance. The timeout is currently 6 minutes + a random addition of up to 30 seconds. The extra random time causes the resets to be spread out so all users reset at different times.
This is a direct port from screepsmod-mongo, which has been using this since early on.
It was originally added due to early issues on the ScreepsPlus server with runners dragging down performance.

artch · 2018-02-08T13:55:40Z

So performance degradation that some users have reported is likely due to this, not LokiJS issues?

AlinaNova21 · 2018-02-08T14:20:44Z

I've still seen slowdowns in the storage module itself, usually seems like the db traffic just becomes too much for it to handle after users grow. But, that said, this could be the cause of slowdowns people are seeing where a restart results in fast speeds again.
On S+, it originally didn't occur unless the server ran for hours, then once more players were joined, it became much more common and noticeable. An early fix was to restart the runners on a short cron, but that was disruptive since it would lag for a few minutes while everyones code recovered. Adding the timeouts solved the issue, although there were also a bunch of other tweaks and factors that may have also had an effect.

artch · 2018-02-08T14:33:24Z

In our cluster setup all runner and processor processes are restarted every hour, not just user VMs. Wouldn't it be more reliable to implement some logic in launcher that restarts these processes instead of what this PR does?

AlinaNova21 · 2018-02-08T14:44:37Z

Not sure, this has been reliable on S+ for the past few months, I'm about to attempt to setup a set of servers with a few bots and let them run for a day or two. One with current npm code, one with this patch, and one with hourly runner/processor restarts. (All three using the stock storage)

Although on that note, I haven't issues that required the processors to be restarted unless its caused by redis going readonly or something. (That was a result of memory gaining 16GB of history data and redis running out of RAM)

Lately its actually been the backend-local module I have to restart, after which everything else is still going fine.

artch · 2018-02-08T14:46:11Z

Yes, it's mostly about runner processes. Thanks for testing this, looking forward to the results.

AlinaNova21 · 2018-02-10T02:29:37Z

Current test setup: 3 identically configured servers with mod spawned bots in identical positions to reduce variations. Only mods are screepsmod-auth for API access, stats collection (the data for these graphs), and 2 small mods, one that adds setTimeout(() => process.exit(), 60 * 60 * 1000) to the processor and runner instances for the restart instance, and one that adds the timeout patches to the timeout instance.

After more testing, I'm closing this PR as not helpful, while I do believe it had a helpful effect on S+ before, it may have been due to other factors. With this showing such large results, I'll probably be updating screepsmod-mongo to remove the timeout code (It actually seems to be having no effect during testing) and also updating ScreepsPlus to restart runners and processors regularly. (I will probably try 30 minute intervals there)

Restarting however, does have a very noticeable beneficial impact. I'm seeing the waitForUsers stage hit 2600ms so far on both vanilla and my timeout test servers, however, on the restart server it never gets above 900ms. Similar case with waitForRooms although it consistently stays at ~50% of the waitForUsers value on all three. (Any idea why waitForRooms would be degrading at the exact same rate? Maybe a memory leak in the driver somewhere?)

Oddly, it seems dirty time is increasing, but actual user time is remaining constant.

This behavior would also explain the reports of servers being faster after restarting, regular restarts of these two modules are helping.

Graph of server timings (From top to bottom: main stage timing, user dirty time, user CPU time)

Note: Top row is in logarithmic view to make the lower stages visible

artch · 2018-02-10T21:46:38Z

Thanks, this is interesting. As for memory leaks in driver code, any help identifying them is welcome.

Adds a timeout for user globals

15a9380

AlinaNova21 closed this Feb 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds a timeout for user globals #18

Adds a timeout for user globals #18

Uh oh!

AlinaNova21 commented Feb 7, 2018 •

edited

Loading

Uh oh!

artch commented Feb 8, 2018

Uh oh!

AlinaNova21 commented Feb 8, 2018

Uh oh!

artch commented Feb 8, 2018

Uh oh!

AlinaNova21 commented Feb 8, 2018

Uh oh!

artch commented Feb 8, 2018

Uh oh!

AlinaNova21 commented Feb 10, 2018

Uh oh!

artch commented Feb 10, 2018

Uh oh!

Uh oh!

Adds a timeout for user globals #18

Adds a timeout for user globals #18

Uh oh!

Conversation

AlinaNova21 commented Feb 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artch commented Feb 8, 2018

Uh oh!

AlinaNova21 commented Feb 8, 2018

Uh oh!

artch commented Feb 8, 2018

Uh oh!

AlinaNova21 commented Feb 8, 2018

Uh oh!

artch commented Feb 8, 2018

Uh oh!

AlinaNova21 commented Feb 10, 2018

Uh oh!

artch commented Feb 10, 2018

Uh oh!

Uh oh!

AlinaNova21 commented Feb 7, 2018 •

edited

Loading