Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasional server error page appears randomly during navigation #597

Closed
1 of 3 tasks
mnaydan opened this issue Mar 4, 2024 · 6 comments
Closed
1 of 3 tasks

occasional server error page appears randomly during navigation #597

mnaydan opened this issue Mar 4, 2024 · 6 comments
Assignees
Labels

Comments

@mnaydan
Copy link
Contributor

mnaydan commented Mar 4, 2024

This is a generic error page from the library that appears randomly (no pattern detected) on the frontend and backend during navigation of the site. I encountered it a lot during the Brogan data work, and we discovered that at least some of the errors are 502 bad gateway errors. We made a patch release in which we restarted both servers, but that didn't solve the problem. Francis and Alicia then suggested another patch release to stop the bots from crawling the aggregated cluster URLs, as they were causing lots of errors and crowding the logs. We did that, but we may need to do another patch release to redirect the aggregated cluster URLs if traffic to those URLs doesn't decrease. Anecdotally, I haven't encountered the error page today as I've been navigating, but I'm not spending as much time on it as I was during the data work. @rlskoeser please add or amend anything I missed.

@mnaydan mnaydan added the bug label Mar 4, 2024
@rlskoeser rlskoeser self-assigned this Mar 4, 2024
@rlskoeser
Copy link
Contributor

When I checked the logs on the two VMs running the PPA web application, I couldn't find any errors in either the nginx log or the django logs. Francis reminded me that I can use datadog to look at traffic going to the loadbalancer, and once we figured out the correct way to filter to just requests going to PPA, we could see a large number of warnings and errors with the 502 gateway error (which is an upstream timeout, meaning the ppa application somehow isn't responding in time). All of the errors I saw in the datadog logs were triggered by bots crawling the site, with the malformed urls with multiple clusters. This was what triggered the decision to update production with the fixes for the cluster search urls.

I've been keeping an eye on datadog today, and those errors are trialing off significantly compared to the steady stream of them we were seeing last week. Hopefully, this means the problem is fixed - but even if it isn't, it should now be much easier to find any actual errors in the logs because it won't be buried by all the cluster url problems.

Francis said he would do the work to get our application logs included in datadog so that I can look at them there, rather than having to log in and look at two different VMs. I think there may a related stop / possible blocker of getting CDH ansible scripts running in ansible tower, which we want for other reasons anyway.

@rlskoeser
Copy link
Contributor

The number of errors I'm seeing in datadog logs has gone down significantly, but not gone away entirely.

Here's the error/warn incident graph for the past 15 days:
Screenshot 2024-03-07 at 5 49 17 PM

And here's the error/warn incident graph for the past 7 days:
Screenshot 2024-03-07 at 5 52 14 PM

When I inspect manually, nearly all the errors I look at are those bogus multi-cluster urls.

@mnaydan
Copy link
Contributor Author

mnaydan commented Mar 8, 2024

@rlskoeser do the errors at the current level seem to be causing problems for users? Anecdotally, I haven't encountered the server error page when I've been on the PPA recently but I haven't been doing data work for long stretches like I was before. Maybe it's something to monitor when I do my next round of data work.

@rlskoeser
Copy link
Contributor

@mnaydan my hope is that the significantly lower level of errors is enough for the us to stop seeing these. It should continue to trail off, since the bots crawling the site will no longer be queuing up these bogus cluster urls to crawl. All the errors I see in datadog now are triggered by bots, so hopefully no users are seeing the error at this point.

I agree that we should continue to keep an eye on it - you as you use the site, and I'll keep glancing at datadog occasionally.

@rlskoeser
Copy link
Contributor

I wish I had been more careful about the scale of errors before - we're still getting errors but the scale is much lower. The two-week graph I posted on March 7 had a max of 2k on the y-axis; the one I generated today has a max of 500.

Screenshot 2024-04-15 at 11 06 12 AM

@mnaydan
Copy link
Contributor Author

mnaydan commented Nov 13, 2024

I haven't seen this in a very long time so I'm going to close it.

@mnaydan mnaydan closed this as completed Nov 13, 2024
@github-project-automation github-project-automation bot moved this from IceBox to Done in Iteration Planning Board Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

2 participants