Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Issues causing covers.openlibrary.org to be unusably slow #10417

Closed
13 tasks done
cdrini opened this issue Feb 3, 2025 · 2 comments
Closed
13 tasks done

DNS Issues causing covers.openlibrary.org to be unusably slow #10417

cdrini opened this issue Feb 3, 2025 · 2 comments
Assignees
Labels
Affects: Operations Affects the IA DevOps folks Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] Type: Post-Mortem Log for when having to resolve a P0 issue

Comments

@cdrini
Copy link
Collaborator

cdrini commented Feb 3, 2025

Summary

  • What is wrong?: covers.openlibrary.org reporting high rate of HTTP 499 responses (brown bands in the graph below), and requests taking ~30s to load covers.

Image

Slow p95 and p99s in covers sentry, suspiciously aligned around the 4s and 5s, indicating some sort of timeout:

Image

  • What caused it? DNS resolution on our covers servers seems to timeout >4s 0.40% of the time. This caused covers to be delayed when connecting to our database server, resulting in worker pool saturation, resulting in likely nginx queueing/very slow response times -- which often resulted in 499s when a browser would close the connection after having waiting too long.

  • What fixed it? Switching our database config to specify the direct IP of our database server instead of their hostname.

Drop in worker utilization and covers workers in the database connect method (the dark purple bars are the number of workers in the connect method at a certain moment in a given minute). From ~50 / 60 workers in the connect method at a given moment, to <5.
Image

Drop in number of open database connections for covers:

Image

Drop in p95 and p99 load times on covers (sentry)

Image

Sentry apdex for covers requests also went up to ~100%:

Image

The impact on our web nodes was less pronounced, but we saw similar load improvements to eg the /query.json endpoint, which is effectively just a wrapper around a database call:
Image

And here is the change in workers in connect. Note only ol-web1 and ol-web2 were really affected; the investigation revealed that ol-web0 had a much lower DNS timeout rate, so I didn't bother changing it's config.

Image

  • What was the impact? Cover load times of 30s sometimes during the periods of high 499s

  • What could have gone better?

  • Followup actions:

  • Add graphs for monitoring DNS timeout rate (here)

  • Add graphs for monitoring covers traffic (here)

  • Add graphs for monitoring gunicorn worker utilization (here for ol-web, here for covers)

  • Raise with IA Ops to investigate why the DNS might be timeout out in this fashion

  • Do analysis to determine if certain servers timeout more than others -- they do (results, internal only)

Server MIN of DNS Query Time (ms) MEDIAN of DNS Query Time (ms) AVERAGE of DNS Query Time (ms) without outliers MAX of DNS Query Time (ms) SUM of Timedout COUNTA of Server Percent Timedout
ol-covers0 0 1 1.2085 5007 8 2000 0.40%
ol-web0 0 1 1.092 5004 1 2000 0.05%
ol-web1 0 1 1.196 5006 4 2000 0.20%
ol-web2 0 1 1.157 5005 4 2000 0.20%

Steps to close

  1. Assignment: Is someone assigned to this issue? (notetaker, responder)
  2. Labels: Is there an Affects: label applied?
  3. Diagnosis: Add a description and scope of the issue
  4. Updates: As events unfold, is notable provenance documented in issue comments? (i.e. useful debug commands / steps / learnings / reference links)
  5. "What caused it?" - please answer in summary
  6. "What fixed it?" - please answer in summary
  7. "Followup actions:" actions added to summary
@cdrini cdrini added Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] Type: Post-Mortem Log for when having to resolve a P0 issue labels Feb 3, 2025
@cdrini cdrini self-assigned this Feb 3, 2025
@mekarpeles mekarpeles added this to the Sprint 2025-02 milestone Feb 9, 2025
@mekarpeles mekarpeles changed the title covers.openlibrary.org unusably slow DNS Issues causing covers.openlibrary.org to be unusably slow Feb 9, 2025
@mekarpeles
Copy link
Member

This PR allowed us to switch from hostname to IP in several strategic and conveniently easy positions within our config to decrease impact of impact of recent expensive DNS lookups https://github.com/internetarchive/olsystem/pull/256

@cdrini cdrini added the Affects: Operations Affects the IA DevOps folks label Feb 10, 2025
@cdrini
Copy link
Collaborator Author

cdrini commented Feb 10, 2025

Another thing that I think would be worth investigating is whether this affects our solr requests, and whether we should try switching that to a direct IP, since I occasionally notice a similarly suspicious line ~4s in the sentry p95, p99 graphs:

Image

Which is doubly peculiar since the timeout for solr requests is 10 seconds:

return self.session.get(url, timeout=10)

I tried changing the config value for solr to the direct IP, but it seemed to cause an error; perhaps requests.get was unhappy with a direct IP there.

@cdrini cdrini closed this as completed Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Operations Affects the IA DevOps folks Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] Type: Post-Mortem Log for when having to resolve a P0 issue
Projects
None yet
Development

No branches or pull requests

2 participants