Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a timeout for user globals #18

Closed
wants to merge 1 commit into from

Conversation

AlinaNova21
Copy link
Contributor

@AlinaNova21 AlinaNova21 commented Feb 7, 2018

Currently, the private server only resets globals if code is uploaded or if the global doesn't exist. This can cause performance issues on longer running servers. This PR adds timeouts to help maintain performance. The timeout is currently 6 minutes + a random addition of up to 30 seconds. The extra random time causes the resets to be spread out so all users reset at different times.
This is a direct port from screepsmod-mongo, which has been using this since early on.
It was originally added due to early issues on the ScreepsPlus server with runners dragging down performance.

@artch
Copy link
Contributor

artch commented Feb 8, 2018

So performance degradation that some users have reported is likely due to this, not LokiJS issues?

@AlinaNova21
Copy link
Contributor Author

I've still seen slowdowns in the storage module itself, usually seems like the db traffic just becomes too much for it to handle after users grow. But, that said, this could be the cause of slowdowns people are seeing where a restart results in fast speeds again.
On S+, it originally didn't occur unless the server ran for hours, then once more players were joined, it became much more common and noticeable. An early fix was to restart the runners on a short cron, but that was disruptive since it would lag for a few minutes while everyones code recovered. Adding the timeouts solved the issue, although there were also a bunch of other tweaks and factors that may have also had an effect.

@artch
Copy link
Contributor

artch commented Feb 8, 2018

In our cluster setup all runner and processor processes are restarted every hour, not just user VMs. Wouldn't it be more reliable to implement some logic in launcher that restarts these processes instead of what this PR does?

@AlinaNova21
Copy link
Contributor Author

Not sure, this has been reliable on S+ for the past few months, I'm about to attempt to setup a set of servers with a few bots and let them run for a day or two. One with current npm code, one with this patch, and one with hourly runner/processor restarts. (All three using the stock storage)

Although on that note, I haven't issues that required the processors to be restarted unless its caused by redis going readonly or something. (That was a result of memory gaining 16GB of history data and redis running out of RAM)

Lately its actually been the backend-local module I have to restart, after which everything else is still going fine.

@artch
Copy link
Contributor

artch commented Feb 8, 2018

Yes, it's mostly about runner processes. Thanks for testing this, looking forward to the results.

@AlinaNova21
Copy link
Contributor Author

Current test setup: 3 identically configured servers with mod spawned bots in identical positions to reduce variations. Only mods are screepsmod-auth for API access, stats collection (the data for these graphs), and 2 small mods, one that adds setTimeout(() => process.exit(), 60 * 60 * 1000) to the processor and runner instances for the restart instance, and one that adds the timeout patches to the timeout instance.

After more testing, I'm closing this PR as not helpful, while I do believe it had a helpful effect on S+ before, it may have been due to other factors. With this showing such large results, I'll probably be updating screepsmod-mongo to remove the timeout code (It actually seems to be having no effect during testing) and also updating ScreepsPlus to restart runners and processors regularly. (I will probably try 30 minute intervals there)

Restarting however, does have a very noticeable beneficial impact. I'm seeing the waitForUsers stage hit 2600ms so far on both vanilla and my timeout test servers, however, on the restart server it never gets above 900ms. Similar case with waitForRooms although it consistently stays at ~50% of the waitForUsers value on all three. (Any idea why waitForRooms would be degrading at the exact same rate? Maybe a memory leak in the driver somewhere?)

Oddly, it seems dirty time is increasing, but actual user time is remaining constant.

This behavior would also explain the reports of servers being faster after restarting, regular restarts of these two modules are helping.

Graph of server timings (From top to bottom: main stage timing, user dirty time, user CPU time)

Note: Top row is in logarithmic view to make the lower stages visible
IMG

@artch
Copy link
Contributor

artch commented Feb 10, 2018

Thanks, this is interesting. As for memory leaks in driver code, any help identifying them is welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants