-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recursive error handling failure #35151
Comments
I think I've figured out the KeyError, at least -- the param name in |
We sometimes see rendering errors in the error page itself, which then cause another attempt at rendering the error page. By catching all errors raised during error-page render and substituting in a hardcoded string, we can reduce server resources, avoid logging massive sequences of recursive stack traces, and still give the user *some* indication that yes, there was a problem. This should help address #35151 At least one of these rendering errors is known to be due to a translation error. There's a separate issue for restoring translation quality so that we avoid those issues in the future (openedx/openedx-translations#549) but in general we should catch all rendering errors, including unknown ones. Testing: - In `lms/envs/devstack.py` change `DEBUG` to `False` to ensure that the usual error page is displayed (rather than the debug error page). - Add line `1/0` to the top of the `student_dashboard` function in `common/djangoapps/student/views/dashboard.py` to make that view error. - In `lms/templates/static_templates/server-error.html` replace `static.get_platform_name()` with `None * 7` to make the error template itself produce an error. - Visit <http://localhost:18000/dashboard>. Without the fix, the response takes 10 seconds and produces a 6 MB, 85k line set of stack traces and the page displays "A server error occurred. Please contact the administrator." Without the fix, the response takes less than a second and produces three stack traces (one of which contains the error page's rendering error).
The bad translation explains:
It doesn't explain:
|
We sometimes see rendering errors in the error page itself, which then cause another attempt at rendering the error page. I'm not sure _exactly_ how the loop is occurring, but it looks something like this: 1. An error is raised in a view or middleware and is not caught by application code 2. Django catches the error and calls the registered uncaught error handler 3. Our handler tries to render an error page 4. The rendering code raises an error 5. GOTO 2 (until some sort of server limit is reached) By catching all errors raised during error-page render and substituting in a hardcoded string, we can reduce server resources, avoid logging massive sequences of recursive stack traces, and still give the user *some* indication that yes, there was a problem. This should help address #35151 At least one of these rendering errors is known to be due to a translation error. There's a separate issue for restoring translation quality so that we avoid those issues in the future (openedx/openedx-translations#549) but in general we should catch all rendering errors, including unknown ones. Testing: - In `lms/envs/devstack.py` change `DEBUG` to `False` to ensure that the usual error page is displayed (rather than the debug error page). - Add line `1/0` to the top of the `student_dashboard` function in `common/djangoapps/student/views/dashboard.py` to make that view error. - In `lms/templates/static_templates/server-error.html` replace `static.get_platform_name()` with `None * 7` to make the error template itself produce an error. - Visit <http://localhost:18000/dashboard>. Without the fix, the response takes 10 seconds and produces a 6 MB, 85k line set of stack traces and the page displays "A server error occurred. Please contact the administrator." Without the fix, the response takes less than a second and produces three stack traces (one of which contains the error page's rendering error).
We sometimes see rendering errors in the error page itself, which then cause another attempt at rendering the error page. I'm not sure _exactly_ how the loop is occurring, but it looks something like this: 1. An error is raised in a view or middleware and is not caught by application code 2. Django catches the error and calls the registered uncaught error handler 3. Our handler tries to render an error page 4. The rendering code raises an error 5. GOTO 2 (until some sort of server limit is reached) By catching all errors raised during error-page render and substituting in a hardcoded string, we can reduce server resources, avoid logging massive sequences of recursive stack traces, and still give the user *some* indication that yes, there was a problem. This should help address #35151 At least one of these rendering errors is known to be due to a translation error. There's a separate issue for restoring translation quality so that we avoid those issues in the future (openedx/openedx-translations#549) but in general we should catch all rendering errors, including unknown ones. Testing: - In `lms/envs/devstack.py` change `DEBUG` to `False` to ensure that the usual error page is displayed (rather than the debug error page). - Add line `1/0` to the top of the `student_dashboard` function in `common/djangoapps/student/views/dashboard.py` to make that view error. - In `lms/templates/static_templates/server-error.html` replace `static.get_platform_name()` with `None * 7` to make the error template itself produce an error. - Visit <http://localhost:18000/dashboard>. Without the fix, the response takes 10 seconds and produces a 6 MB, 85k line set of stack traces and the page displays "A server error occurred. Please contact the administrator." With the fix, the response takes less than a second and produces three stack traces (one of which contains the error page's rendering error).
Looking more broadly for any logs containing Much smaller numbers are found with
We're probably trying to do too much on this error page and should at the very least suppress these errors and have fallback strategies. I believe the remaining log messages are largely partial stack traces that aren't filtered out of this search because the logged fragment doesn't carry enough information. |
For reference, #35209 deployed to prod on 2024-07-31 at 19:10-19:16 UTC |
We sometimes see rendering errors in the error page itself, which then cause another attempt at rendering the error page. I'm not sure _exactly_ how the loop is occurring, but it looks something like this: 1. An error is raised in a view or middleware and is not caught by application code 2. Django catches the error and calls the registered uncaught error handler 3. Our handler tries to render an error page 4. The rendering code raises an error 5. GOTO 2 (until some sort of server limit is reached) By catching all errors raised during error-page render and substituting in a hardcoded string, we can reduce server resources, avoid logging massive sequences of recursive stack traces, and still give the user *some* indication that yes, there was a problem. This should help address openedx#35151 At least one of these rendering errors is known to be due to a translation error. There's a separate issue for restoring translation quality so that we avoid those issues in the future (openedx/openedx-translations#549) but in general we should catch all rendering errors, including unknown ones. Testing: - In `lms/envs/devstack.py` change `DEBUG` to `False` to ensure that the usual error page is displayed (rather than the debug error page). - Add line `1/0` to the top of the `student_dashboard` function in `common/djangoapps/student/views/dashboard.py` to make that view error. - In `lms/templates/static_templates/server-error.html` replace `static.get_platform_name()` with `None * 7` to make the error template itself produce an error. - Visit <http://localhost:18000/dashboard>. Without the fix, the response takes 10 seconds and produces a 6 MB, 85k line set of stack traces and the page displays "A server error occurred. Please contact the administrator." With the fix, the response takes less than a second and produces three stack traces (one of which contains the error page's rendering error).
On edx.org we're seeing some uncaught errors that are themselves spawning new uncaught errors relating to Mako templates, causing very large stack traces. See samples below. Instances take the form of a series of stack traces:
handle_uncaught_exception
and thenHttpResponseServerError(render_to_string('static_templates/server-error.html', {}, request=request))
as the server tries to serve an error page.Samples
Recursion Error
A RecursionError due to directly-recursive
gettext
calls with repetitions in the neighborhood of 850 recursions — growing by 7 on each error-handling iteration: RecursionError.txtFind more of these in Datadog using span search
env:prod service:edx-edxapp-lms @error.message:*RecursionError*
-- but note: The error itself is being investigated separately in edx/edx-arch-experiments#674KeyError
A KeyError involving an Arabic string ("اسم", or "name") while templating a navbar header: KeyError.txt
The text was updated successfully, but these errors were encountered: