Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload_pgn pserve and primary pserve can hang #2094

Open
vondele opened this issue Jul 2, 2024 · 15 comments
Open

upload_pgn pserve and primary pserve can hang #2094

vondele opened this issue Jul 2, 2024 · 15 comments

Comments

@vondele
Copy link
Member

vondele commented Jul 2, 2024

Now observed repeated under load that that these two pserves hang simultaneously under load. Since it always involved the upload_pgn api this maybe the place to start looking?

@vdbergh
Copy link
Contributor

vdbergh commented Jul 2, 2024

Did you confirm that the hang of the primary pserve is not a deadlock inside Fishtest?

I am quite sure it is not since dead tasks were properly cleared (indicating that the task thread was running normally).

@vdbergh
Copy link
Contributor

vdbergh commented Jul 2, 2024

One idea I had is that during a fleet visit we may run the root logger with debug level. I think that then waitress will print much more information on what's going on.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 2, 2024

One idea I had is that during a fleet visit we may run the root logger with debug level. I think that then waitress will print much more information on what's going on.

This seems not true. I just tried it and not much more information is printed.

@vondele
Copy link
Member Author

vondele commented Jul 2, 2024

we have relatively little info, in that the tracing doesn't give useful info. mongo seems responsive. It always involves upload_pgn, so that would be a logical first place to start looking IMO.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 2, 2024

I guess upload_pgn is the only api that sends a substantial amount of data from the client to the server.

@vondele
Copy link
Member Author

vondele commented Jul 2, 2024

it is also probably one of the most frequent calls (on the non-primary pserves)

@vdbergh
Copy link
Contributor

vdbergh commented Jul 2, 2024

We could do some instrumentation of the upload_pgn api (using #2067 we could even create a specific logger for this).

@vondele
Copy link
Member Author

vondele commented Jul 2, 2024

I find a couple of suspicious errors in the logs:

Jun 30 12:18:21 ns304014 pserve[2691]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 13:52:45 ns304014 pserve[6871]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 16:03:07 ns304014 pserve[14360]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 16:37:17 ns304014 pserve[16530]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 18:23:45 ns304014 pserve[18385]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 18:53:09 ns304014 pserve[33684]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:03:16 ns304014 pserve[36228]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:44:11 ns304014 pserve[39924]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:59:31 ns304014 pserve[39924]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 18:34:56 ns304014 pserve[56080]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 20:18:10 ns304014 pserve[929]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 21:39:04 ns304014 pserve[14959]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 21:53:33 ns304014 pserve[14959]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 23:11:13 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration
Jul 02 00:11:28 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration
Jul 02 00:56:47 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration

@vondele
Copy link
Member Author

vondele commented Jul 2, 2024

Seems always at the same time as Validate_data_structures (but not necessarily related to the hang)

Jun 30 19:03:16 ns304014 pserve[36228]: Validate_data_structures: validating Fishtest's internal data structures...
Jun 30 19:03:16 ns304014 pserve[36228]: RuntimeError while executing task: dictionary changed size during iteration

@vdbergh
Copy link
Contributor

vdbergh commented Jul 2, 2024

Thanks. I guess I either need to make a copy of those dicts before iterating over them, or else wrap them in an appropriate lock.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 2, 2024

I found out how to log waitress connections. From https://docs.pylonsproject.org/projects/pyramid/en/main/narr/logging.html. production.ini has to be modified as follows:

[app:fishtest]
use = egg:fishtest_server

pyramid.reload_templates = false
pyramid.debug_authorization = false
pyramid.debug_notfound = false
pyramid.debug_routematch = false
pyramid.default_locale_name = en

mako.directories = fishtest:templates

fishtest.port = %(http_port)s
fishtest.primary_port = 6543

[filter:translogger]
use = egg:Paste#translogger                                                                                                                                                                                 
setup_console_handler = False

[pipeline:main]
pipeline = translogger
           fishtest

Output looks like this

Starting server in PID 1239.
2024-07-02 05:38:25,985 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:38:25 +0000] "GET /tests HTTP/1.0" 200 86132 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:30,877 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:38:30 +0000] "GET /login HTTP/1.0" 200 18819 "https://3.82.105.159/tests" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:38,208 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:38:38 +0000] "GET /api/active_runs HTTP/1.0" 200 9949 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,401 INFO  [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "POST /login HTTP/1.0" 302 169 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,513 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "GET / HTTP/1.0" 302 194 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,635 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "GET /tests HTTP/1.0" 200 87702 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
Failed_task: failure for: https://tests.stockfishchess.org/tests/view/66824033546a9b654e82da4b, task_id: 17, worker: user01-1cores-deruyts-81da, reason: 'Terminated by signal SIGINT'
2024-07-02 05:38:55,412 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:38:55 +0000] "POST /api/failed_task HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:38:57,699 INFO  [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:38:56 +0000] "POST /api/upload_pgn HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:39:01,916 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:39:01 +0000] "POST /api/request_version HTTP/1.0" 200 36 "-" "python-requests/2.25.1"
2024-07-02 05:39:02,269 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:39:02 +0000] "POST /api/request_version HTTP/1.0" 200 38 "-" "python-requests/2.25.1"
2024-07-02 05:39:02,885 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:39:02 +0000] "POST /api/request_version HTTP/1.0" 200 38 "-" "python-requests/2.25.1"
2024-07-02 05:39:03,500 INFO  [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:39:03 +0000] "POST /api/request_task HTTP/1.0" 200 1241 "-" "python-requests/2.25.1"
Validate_data_structures: validating Fishtest's internal data structures...
Validate_random_run: validated cache run 6680543e4eb1934d34b5aba5...
Clean_wtt_map: 1 active workers...
2024-07-02 05:41:02,901 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:41:02 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
Clean_wtt_map: 1 active workers...
The run object 66823f9c546a9b654e82da47 does not validate: run['results_info'] is not in the schema
2024-07-02 05:43:03,325 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:43:03 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:45:03,857 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:45:03 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"

@vondele
Copy link
Member Author

vondele commented Jul 2, 2024

I have been looking at upload_pgn code, and in principle this looks 'really simple'. What I don't figure out is how the the communication between the two pserves actually happens, and in particular if it could be just waiting forever (i.e. no timeout) in case some error happens there.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 2, 2024

Hmm. Without looking at the source code, off hand I would say there is no communication between the two pserves. We have been experimenting with an api where a secondary instance asks the primary instance for some information but this is currently on hold.

@vondele
Copy link
Member Author

vondele commented Jul 2, 2024

right... so the only 'communication' is through mongodb in some sense.

vdbergh added a commit to vdbergh/fishtest that referenced this issue Jul 2, 2024
ppigazzini pushed a commit that referenced this issue Jul 3, 2024
@vdbergh
Copy link
Contributor

vdbergh commented Jul 6, 2024

During a fleet visit we should probably turn on access logging for waitress (see #2094 (comment) ). If then pserve hangs (and it really seems to be a pserve hang, and not fishtest) we would know which api was served last, and by which thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants