Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive error handling failure #35151

Closed
timmc-edx opened this issue Jul 22, 2024 · 5 comments
Closed

Recursive error handling failure #35151

timmc-edx opened this issue Jul 22, 2024 · 5 comments
Assignees

Comments

@timmc-edx
Copy link
Contributor

timmc-edx commented Jul 22, 2024

On edx.org we're seeing some uncaught errors that are themselves spawning new uncaught errors relating to Mako templates, causing very large stack traces. See samples below. Instances take the form of a series of stack traces:

  • Each stack trace passes through Django's handle_uncaught_exception and then HttpResponseServerError(render_to_string('static_templates/server-error.html', {}, request=request)) as the server tries to serve an error page.
  • As Mako tries to template the navigation bar header, one of two errors occurs.
  • This is then handled as a new uncaught error ("During handling of the above exception, another exception occurred:") and the cycle repeats.

Samples

Recursion Error

A RecursionError due to directly-recursive gettext calls with repetitions in the neighborhood of 850 recursions — growing by 7 on each error-handling iteration: RecursionError.txt

Find more of these in Datadog using span search env:prod service:edx-edxapp-lms @error.message:*RecursionError* -- but note: The error itself is being investigated separately in edx/edx-arch-experiments#674

  File "/tmp/mako_lms/6f79637247faff419ef55a8128bc53da/header/navbar-logo-header.html.py", line 79, in render_body
    context['self'].navigation_logo(**pageargs)
  File "/tmp/mako_lms/6f79637247faff419ef55a8128bc53da/header/navbar-logo-header.html.py", line 134, in render_navigation_logo
    __M_writer(filters.html_escape(filters.decode.utf8(_("{platform_name} Home Page").format(platform_name=static.get_platform_name()))))
                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/edx/app/edxapp/venvs/edxapp/lib/python3.11/site-packages/django/utils/translation/__init__.py", line 95, in gettext
    return _trans.gettext(message)
           ^^^^^^^^^^^^^^^^^^^^^^^
[...more calls eventually leading to a gettext loop...]

KeyError

A KeyError involving an Arabic string ("اسم", or "name") while templating a navbar header: KeyError.txt

  File "/tmp/mako_lms/6f79637247faff419ef55a8128bc53da/header/navbar-logo-header.html.py", line 69, in render_body
    __M_writer(filters.html_escape(filters.decode.utf8(_('{name} Dashboard').format(name=enterprise_customer_link.get('name')))))
                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'اسم'
@timmc-edx timmc-edx moved this to In Progress in Arch-BOM Jul 22, 2024
@timmc-edx timmc-edx self-assigned this Jul 22, 2024
@timmc-edx timmc-edx changed the title Recursive error handling issue involving mako Recursive error handling issue involving gettext and/or mako Jul 30, 2024
@timmc-edx timmc-edx changed the title Recursive error handling issue involving gettext and/or mako Recursive error handling failure Jul 30, 2024
@timmc-edx
Copy link
Contributor Author

timmc-edx commented Jul 31, 2024

I think I've figured out the KeyError, at least -- the param name in {name} Dashboard got translated to Arabic, and so the template can't find the name param any more. We're supposed to have checks in place against this sort of thing. Added a comment here: openedx/openedx-translations#549 (comment)

timmc-edx added a commit that referenced this issue Jul 31, 2024
We sometimes see rendering errors in the error page itself, which then
cause another attempt at rendering the error page. By catching all errors
raised during error-page render and substituting in a hardcoded string, we
can reduce server resources, avoid logging massive sequences of recursive
stack traces, and still give the user *some* indication that yes, there was
a problem.

This should help address #35151

At least one of these rendering errors is known to be due to a translation
error. There's a separate issue for restoring translation quality so that
we avoid those issues in the future (openedx/openedx-translations#549)
but in general we should catch all rendering errors, including unknown
ones.

Testing:

- In `lms/envs/devstack.py` change `DEBUG` to `False` to ensure that the
  usual error page is displayed (rather than the debug error page).
- Add line `1/0` to the top of the `student_dashboard` function in
 `common/djangoapps/student/views/dashboard.py` to make that view error.
- In `lms/templates/static_templates/server-error.html` replace
  `static.get_platform_name()` with `None * 7` to make the error template
  itself produce an error.
- Visit <http://localhost:18000/dashboard>.

Without the fix, the response takes 10 seconds and produces a 6 MB, 85k
line set of stack traces and the page displays "A server error occurred.
Please contact the administrator."

Without the fix, the response takes less than a second and produces three
stack traces (one of which contains the error page's rendering error).
@timmc-edx
Copy link
Contributor Author

timmc-edx commented Jul 31, 2024

The bad translation explains:

  • The presence of the KeyError. (Bad translation for something used in the error template.)
  • Why we haven't been seeing this issue all along. (Bad translation was introduced after switch to Atlas, which doesn't have as good validation as the old system.)

It doesn't explain:

  • The RecursionError, although that could be a gettext bug or similar that is only provoked by bad data of some kind.
  • Why we're getting 500s in the first place (and we may not have good info on those until we get rid of these giant stack traces via fix: Prevent error page recursion #35209)
  • The clustering by host.

timmc-edx added a commit that referenced this issue Jul 31, 2024
We sometimes see rendering errors in the error page itself, which then
cause another attempt at rendering the error page. I'm not sure _exactly_
how the loop is occurring, but it looks something like this:

1. An error is raised in a view or middleware and is not caught by
   application code
2. Django catches the error and calls the registered uncaught error
   handler
3. Our handler tries to render an error page
4. The rendering code raises an error
5. GOTO 2 (until some sort of server limit is reached)

By catching all errors raised during error-page render and substituting in
a hardcoded string, we can reduce server resources, avoid logging massive
sequences of recursive stack traces, and still give the user *some*
indication that yes, there was a problem.

This should help address #35151

At least one of these rendering errors is known to be due to a translation
error. There's a separate issue for restoring translation quality so that
we avoid those issues in the future (openedx/openedx-translations#549)
but in general we should catch all rendering errors, including unknown
ones.

Testing:

- In `lms/envs/devstack.py` change `DEBUG` to `False` to ensure that the
  usual error page is displayed (rather than the debug error page).
- Add line `1/0` to the top of the `student_dashboard` function in
 `common/djangoapps/student/views/dashboard.py` to make that view error.
- In `lms/templates/static_templates/server-error.html` replace
  `static.get_platform_name()` with `None * 7` to make the error template
  itself produce an error.
- Visit <http://localhost:18000/dashboard>.

Without the fix, the response takes 10 seconds and produces a 6 MB, 85k
line set of stack traces and the page displays "A server error occurred.
Please contact the administrator."

Without the fix, the response takes less than a second and produces three
stack traces (one of which contains the error page's rendering error).
timmc-edx added a commit that referenced this issue Jul 31, 2024
We sometimes see rendering errors in the error page itself, which then
cause another attempt at rendering the error page. I'm not sure _exactly_
how the loop is occurring, but it looks something like this:

1. An error is raised in a view or middleware and is not caught by
   application code
2. Django catches the error and calls the registered uncaught error
   handler
3. Our handler tries to render an error page
4. The rendering code raises an error
5. GOTO 2 (until some sort of server limit is reached)

By catching all errors raised during error-page render and substituting in
a hardcoded string, we can reduce server resources, avoid logging massive
sequences of recursive stack traces, and still give the user *some*
indication that yes, there was a problem.

This should help address #35151

At least one of these rendering errors is known to be due to a translation
error. There's a separate issue for restoring translation quality so that
we avoid those issues in the future (openedx/openedx-translations#549)
but in general we should catch all rendering errors, including unknown
ones.

Testing:

- In `lms/envs/devstack.py` change `DEBUG` to `False` to ensure that the
  usual error page is displayed (rather than the debug error page).
- Add line `1/0` to the top of the `student_dashboard` function in
 `common/djangoapps/student/views/dashboard.py` to make that view error.
- In `lms/templates/static_templates/server-error.html` replace
  `static.get_platform_name()` with `None * 7` to make the error template
  itself produce an error.
- Visit <http://localhost:18000/dashboard>.

Without the fix, the response takes 10 seconds and produces a 6 MB, 85k
line set of stack traces and the page displays "A server error occurred.
Please contact the administrator."

With the fix, the response takes less than a second and produces three
stack traces (one of which contains the error page's rendering error).
@timmc-edx
Copy link
Contributor Author

Looking more broadly for any logs containing handle_uncaught_exception (which should never appear in a stack trace if we're handling errors properly) I see that there's a steady rate of these from a variety of hosts. Most of these are KeyError (which can be a steady trickle, or a burst) and RecursionError (almost always bursts).

Much smaller numbers are found with env:prod play:edxapp "handle_uncaught_exception" -"KeyError: 'اسم'" -RecursionError. Many have to do get_enterprise_learner_portal:

  File "/tmp/mako_lms/6f79637247faff419ef55a8128bc53da/header/navbar-logo-header.html.py", line 55, in render_body
    enterprise_customer_link = get_enterprise_learner_portal(request)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/edx/app/edxapp/edx-platform/openedx/features/enterprise_support/utils.py", line 328, in get_enterprise_learner_portal
    user = request.user
           ^^^^^^^^^^^^
AttributeError: 'WSGIRequest' object has no attribute 'user'
  File "/tmp/mako_lms/6f79637247faff419ef55a8128bc53da/header/header.html.py", line 93, in render_body
    runtime._include_file(context, 'navbar-logo-header.html', _template_uri, online_help_token=online_help_token)
  File "/edx/app/edxapp/venvs/edxapp/lib/python3.11/site-packages/mako/runtime.py", line 793, in _include_file
    callable_(ctx, **kwargs)
  File "/tmp/mako_lms/6f79637247faff419ef55a8128bc53da/header/navbar-logo-header.html.py", line 55, in render_body
    enterprise_customer_link = get_enterprise_learner_portal(request)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/edx/app/edxapp/edx-platform/openedx/features/enterprise_support/utils.py", line 342, in get_enterprise_learner_portal
    enterprise_customer_uuid = enterprise_customer_uuid_for_request(request)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/edx/app/edxapp/edx-platform/openedx/features/enterprise_support/api.py", line 385, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/edx/app/edxapp/edx-platform/openedx/features/enterprise_support/api.py", line 505, in enterprise_customer_uuid_for_request
    sso_provider_id = Registry.get_from_pipeline(running_pipeline).provider_id
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'provider_id'

We're probably trying to do too much on this error page and should at the very least suppress these errors and have fallback strategies.

I believe the remaining log messages are largely partial stack traces that aren't filtered out of this search because the logged fragment doesn't carry enough information.

@timmc-edx
Copy link
Contributor Author

For reference, #35209 deployed to prod on 2024-07-31 at 19:10-19:16 UTC

@timmc-edx
Copy link
Contributor Author

Error page recursion was indeed fixed in #35209

We sometimes still get some minor recursion. The 404 error page can raise an error, leading to rendering the 500 error page, and then that errors—but that's the end of it. Those cases are now ticketed as #35216 for future work.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Arch-BOM Aug 1, 2024
cmltaWt0 pushed a commit to raccoongang/edx-platform that referenced this issue Aug 8, 2024
We sometimes see rendering errors in the error page itself, which then
cause another attempt at rendering the error page. I'm not sure _exactly_
how the loop is occurring, but it looks something like this:

1. An error is raised in a view or middleware and is not caught by
   application code
2. Django catches the error and calls the registered uncaught error
   handler
3. Our handler tries to render an error page
4. The rendering code raises an error
5. GOTO 2 (until some sort of server limit is reached)

By catching all errors raised during error-page render and substituting in
a hardcoded string, we can reduce server resources, avoid logging massive
sequences of recursive stack traces, and still give the user *some*
indication that yes, there was a problem.

This should help address openedx#35151

At least one of these rendering errors is known to be due to a translation
error. There's a separate issue for restoring translation quality so that
we avoid those issues in the future (openedx/openedx-translations#549)
but in general we should catch all rendering errors, including unknown
ones.

Testing:

- In `lms/envs/devstack.py` change `DEBUG` to `False` to ensure that the
  usual error page is displayed (rather than the debug error page).
- Add line `1/0` to the top of the `student_dashboard` function in
 `common/djangoapps/student/views/dashboard.py` to make that view error.
- In `lms/templates/static_templates/server-error.html` replace
  `static.get_platform_name()` with `None * 7` to make the error template
  itself produce an error.
- Visit <http://localhost:18000/dashboard>.

Without the fix, the response takes 10 seconds and produces a 6 MB, 85k
line set of stack traces and the page displays "A server error occurred.
Please contact the administrator."

With the fix, the response takes less than a second and produces three
stack traces (one of which contains the error page's rendering error).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant