-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upload_pgn pserve and primary pserve can hang #2094
Comments
Did you confirm that the hang of the primary pserve is not a deadlock inside Fishtest? I am quite sure it is not since dead tasks were properly cleared (indicating that the task thread was running normally). |
One idea I had is that during a fleet visit we may run the root logger with debug level. I think that then waitress will print much more information on what's going on. |
This seems not true. I just tried it and not much more information is printed. |
we have relatively little info, in that the tracing doesn't give useful info. mongo seems responsive. It always involves upload_pgn, so that would be a logical first place to start looking IMO. |
I guess upload_pgn is the only api that sends a substantial amount of data from the client to the server. |
it is also probably one of the most frequent calls (on the non-primary pserves) |
We could do some instrumentation of the upload_pgn api (using #2067 we could even create a specific logger for this). |
I find a couple of suspicious errors in the logs:
|
Seems always at the same time as Validate_data_structures (but not necessarily related to the hang)
|
Thanks. I guess I either need to make a copy of those dicts before iterating over them, or else wrap them in an appropriate lock. |
I found out how to log waitress connections. From https://docs.pylonsproject.org/projects/pyramid/en/main/narr/logging.html. production.ini has to be modified as follows: [app:fishtest]
use = egg:fishtest_server
pyramid.reload_templates = false
pyramid.debug_authorization = false
pyramid.debug_notfound = false
pyramid.debug_routematch = false
pyramid.default_locale_name = en
mako.directories = fishtest:templates
fishtest.port = %(http_port)s
fishtest.primary_port = 6543
[filter:translogger]
use = egg:Paste#translogger
setup_console_handler = False
[pipeline:main]
pipeline = translogger
fishtest Output looks like this
|
I have been looking at upload_pgn code, and in principle this looks 'really simple'. What I don't figure out is how the the communication between the two pserves actually happens, and in particular if it could be just waiting forever (i.e. no timeout) in case some error happens there. |
Hmm. Without looking at the source code, off hand I would say there is no communication between the two pserves. We have been experimenting with an api where a secondary instance asks the primary instance for some information but this is currently on hold. |
right... so the only 'communication' is through mongodb in some sense. |
During a fleet visit we should probably turn on access logging for waitress (see #2094 (comment) ). If then pserve hangs (and it really seems to be a pserve hang, and not fishtest) we would know which api was served last, and by which thread. |
Now observed repeated under load that that these two pserves hang simultaneously under load. Since it always involved the upload_pgn api this maybe the place to start looking?
The text was updated successfully, but these errors were encountered: