upload_pgn pserve and primary pserve can hang #2094

vondele · 2024-07-02T04:32:49Z

Now observed repeated under load that that these two pserves hang simultaneously under load. Since it always involved the upload_pgn api this maybe the place to start looking?

vdbergh · 2024-07-02T04:39:43Z

Did you confirm that the hang of the primary pserve is not a deadlock inside Fishtest?

I am quite sure it is not since dead tasks were properly cleared (indicating that the task thread was running normally).

vdbergh · 2024-07-02T04:42:28Z

One idea I had is that during a fleet visit we may run the root logger with debug level. I think that then waitress will print much more information on what's going on.

vdbergh · 2024-07-02T04:58:47Z

One idea I had is that during a fleet visit we may run the root logger with debug level. I think that then waitress will print much more information on what's going on.

This seems not true. I just tried it and not much more information is printed.

vondele · 2024-07-02T05:01:43Z

we have relatively little info, in that the tracing doesn't give useful info. mongo seems responsive. It always involves upload_pgn, so that would be a logical first place to start looking IMO.

vdbergh · 2024-07-02T05:06:57Z

I guess upload_pgn is the only api that sends a substantial amount of data from the client to the server.

vondele · 2024-07-02T05:09:26Z

it is also probably one of the most frequent calls (on the non-primary pserves)

vdbergh · 2024-07-02T05:15:56Z

We could do some instrumentation of the upload_pgn api (using #2067 we could even create a specific logger for this).

vondele · 2024-07-02T05:32:41Z

I find a couple of suspicious errors in the logs:

Jun 30 12:18:21 ns304014 pserve[2691]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 13:52:45 ns304014 pserve[6871]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 16:03:07 ns304014 pserve[14360]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 16:37:17 ns304014 pserve[16530]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 18:23:45 ns304014 pserve[18385]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 18:53:09 ns304014 pserve[33684]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:03:16 ns304014 pserve[36228]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:44:11 ns304014 pserve[39924]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:59:31 ns304014 pserve[39924]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 18:34:56 ns304014 pserve[56080]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 20:18:10 ns304014 pserve[929]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 21:39:04 ns304014 pserve[14959]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 21:53:33 ns304014 pserve[14959]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 23:11:13 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration
Jul 02 00:11:28 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration
Jul 02 00:56:47 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration

vondele · 2024-07-02T05:35:52Z

Seems always at the same time as Validate_data_structures (but not necessarily related to the hang)

Jun 30 19:03:16 ns304014 pserve[36228]: Validate_data_structures: validating Fishtest's internal data structures...
Jun 30 19:03:16 ns304014 pserve[36228]: RuntimeError while executing task: dictionary changed size during iteration

vdbergh · 2024-07-02T05:46:37Z

Thanks. I guess I either need to make a copy of those dicts before iterating over them, or else wrap them in an appropriate lock.

vdbergh · 2024-07-02T05:51:31Z

I found out how to log waitress connections. From https://docs.pylonsproject.org/projects/pyramid/en/main/narr/logging.html. production.ini has to be modified as follows:

[app:fishtest]
use = egg:fishtest_server

pyramid.reload_templates = false
pyramid.debug_authorization = false
pyramid.debug_notfound = false
pyramid.debug_routematch = false
pyramid.default_locale_name = en

mako.directories = fishtest:templates

fishtest.port = %(http_port)s
fishtest.primary_port = 6543

[filter:translogger]
use = egg:Paste#translogger                                                                                                                                                                                 
setup_console_handler = False

[pipeline:main]
pipeline = translogger
           fishtest

Output looks like this

Starting server in PID 1239.
2024-07-02 05:38:25,985 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:38:25 +0000] "GET /tests HTTP/1.0" 200 86132 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:30,877 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:38:30 +0000] "GET /login HTTP/1.0" 200 18819 "https://3.82.105.159/tests" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:38,208 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:38:38 +0000] "GET /api/active_runs HTTP/1.0" 200 9949 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,401 INFO  [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "POST /login HTTP/1.0" 302 169 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,513 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "GET / HTTP/1.0" 302 194 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,635 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "GET /tests HTTP/1.0" 200 87702 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
Failed_task: failure for: https://tests.stockfishchess.org/tests/view/66824033546a9b654e82da4b, task_id: 17, worker: user01-1cores-deruyts-81da, reason: 'Terminated by signal SIGINT'
2024-07-02 05:38:55,412 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:38:55 +0000] "POST /api/failed_task HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:38:57,699 INFO  [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:38:56 +0000] "POST /api/upload_pgn HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:39:01,916 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:39:01 +0000] "POST /api/request_version HTTP/1.0" 200 36 "-" "python-requests/2.25.1"
2024-07-02 05:39:02,269 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:39:02 +0000] "POST /api/request_version HTTP/1.0" 200 38 "-" "python-requests/2.25.1"
2024-07-02 05:39:02,885 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:39:02 +0000] "POST /api/request_version HTTP/1.0" 200 38 "-" "python-requests/2.25.1"
2024-07-02 05:39:03,500 INFO  [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:39:03 +0000] "POST /api/request_task HTTP/1.0" 200 1241 "-" "python-requests/2.25.1"
Validate_data_structures: validating Fishtest's internal data structures...
Validate_random_run: validated cache run 6680543e4eb1934d34b5aba5...
Clean_wtt_map: 1 active workers...
2024-07-02 05:41:02,901 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:41:02 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
Clean_wtt_map: 1 active workers...
The run object 66823f9c546a9b654e82da47 does not validate: run['results_info'] is not in the schema
2024-07-02 05:43:03,325 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:43:03 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:45:03,857 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:45:03 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"

vondele · 2024-07-02T06:32:22Z

I have been looking at upload_pgn code, and in principle this looks 'really simple'. What I don't figure out is how the the communication between the two pserves actually happens, and in particular if it could be just waiting forever (i.e. no timeout) in case some error happens there.

vdbergh · 2024-07-02T07:11:13Z

Hmm. Without looking at the source code, off hand I would say there is no communication between the two pserves. We have been experimenting with an api where a secondary instance asks the primary instance for some information but this is currently on hold.

vondele · 2024-07-02T07:35:30Z

right... so the only 'communication' is through mongodb in some sense.

vdbergh · 2024-07-06T08:47:45Z

During a fleet visit we should probably turn on access logging for waitress (see #2094 (comment) ). If then pserve hangs (and it really seems to be a pserve hang, and not fishtest) we would know which api was served last, and by which thread.

vdbergh added a commit to vdbergh/fishtest that referenced this issue Jul 2, 2024

Deal with official-stockfish#2094 (comment)

235af0a

vdbergh mentioned this issue Jul 2, 2024

Protect internal data by lock before validation #2096

Merged

ppigazzini pushed a commit that referenced this issue Jul 3, 2024

Deal with #2094 (comment)

e6b87a2

vondele mentioned this issue Jul 7, 2024

Remove race condition when creating new HTTPChannel Pylons/waitress#435

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upload_pgn pserve and primary pserve can hang #2094

upload_pgn pserve and primary pserve can hang #2094

vondele commented Jul 2, 2024

vdbergh commented Jul 2, 2024

vdbergh commented Jul 2, 2024

vdbergh commented Jul 2, 2024

vondele commented Jul 2, 2024

vdbergh commented Jul 2, 2024

vondele commented Jul 2, 2024 •

edited

Loading

vdbergh commented Jul 2, 2024

vondele commented Jul 2, 2024

vondele commented Jul 2, 2024 •

edited

Loading

vdbergh commented Jul 2, 2024

vdbergh commented Jul 2, 2024 •

edited

Loading

vondele commented Jul 2, 2024

vdbergh commented Jul 2, 2024

vondele commented Jul 2, 2024

vdbergh commented Jul 6, 2024

upload_pgn pserve and primary pserve can hang #2094

upload_pgn pserve and primary pserve can hang #2094

Comments

vondele commented Jul 2, 2024

vdbergh commented Jul 2, 2024

vdbergh commented Jul 2, 2024

vdbergh commented Jul 2, 2024

vondele commented Jul 2, 2024

vdbergh commented Jul 2, 2024

vondele commented Jul 2, 2024 • edited Loading

vdbergh commented Jul 2, 2024

vondele commented Jul 2, 2024

vondele commented Jul 2, 2024 • edited Loading

vdbergh commented Jul 2, 2024

vdbergh commented Jul 2, 2024 • edited Loading

vondele commented Jul 2, 2024

vdbergh commented Jul 2, 2024

vondele commented Jul 2, 2024

vdbergh commented Jul 6, 2024

vondele commented Jul 2, 2024 •

edited

Loading

vondele commented Jul 2, 2024 •

edited

Loading

vdbergh commented Jul 2, 2024 •

edited

Loading