DNS Issues causing covers.openlibrary.org to be unusably slow #10417

cdrini · 2025-02-03T20:15:37Z

Summary

What is wrong?: covers.openlibrary.org reporting high rate of HTTP 499 responses (brown bands in the graph below), and requests taking ~30s to load covers.

Slow p95 and p99s in covers sentry, suspiciously aligned around the 4s and 5s, indicating some sort of timeout:

What caused it? DNS resolution on our covers servers seems to timeout >4s 0.40% of the time. This caused covers to be delayed when connecting to our database server, resulting in worker pool saturation, resulting in likely nginx queueing/very slow response times -- which often resulted in 499s when a browser would close the connection after having waiting too long.
What fixed it? Switching our database config to specify the direct IP of our database server instead of their hostname.

Drop in worker utilization and covers workers in the database connect method (the dark purple bars are the number of workers in the connect method at a certain moment in a given minute). From ~50 / 60 workers in the connect method at a given moment, to <5.

Drop in number of open database connections for covers:

Drop in p95 and p99 load times on covers (sentry)

Sentry apdex for covers requests also went up to ~100%:

The impact on our web nodes was less pronounced, but we saw similar load improvements to eg the /query.json endpoint, which is effectively just a wrapper around a database call:

And here is the change in workers in connect. Note only ol-web1 and ol-web2 were really affected; the investigation revealed that ol-web0 had a much lower DNS timeout rate, so I didn't bother changing it's config.

What was the impact? Cover load times of 30s sometimes during the periods of high 499s
What could have gone better?
Followup actions:
Add graphs for monitoring DNS timeout rate (here)
Add graphs for monitoring covers traffic (here)
Add graphs for monitoring gunicorn worker utilization (here for ol-web, here for covers)
Raise with IA Ops to investigate why the DNS might be timeout out in this fashion
Do analysis to determine if certain servers timeout more than others -- they do (results, internal only)

Server	MEDIAN of DNS Query Time (ms)	AVERAGE of DNS Query Time (ms) without outliers	MAX of DNS Query Time (ms)	SUM of Timedout	COUNTA of Server	Percent Timedout
ol-covers0	1	1.2085	5007	8	2000	0.40%
ol-web0	1	1.092	5004	1	2000	0.05%
ol-web1	1	1.196	5006	4	2000	0.20%
ol-web2	1	1.157	5005	4	2000	0.20%

Analyze whether making requests within/outside of docker makes a difference -- it doesn't (above spreadsheet)
Investigate updating gunicorn worker counts #10429

Steps to close

Assignment: Is someone assigned to this issue? (notetaker, responder)
Labels: Is there an Affects: label applied?
Diagnosis: Add a description and scope of the issue
Updates: As events unfold, is notable provenance documented in issue comments? (i.e. useful debug commands / steps / learnings / reference links)
"What caused it?" - please answer in summary
"What fixed it?" - please answer in summary
"Followup actions:" actions added to summary

The text was updated successfully, but these errors were encountered:

mekarpeles · 2025-02-10T21:04:21Z

This PR allowed us to switch from hostname to IP in several strategic and conveniently easy positions within our config to decrease impact of impact of recent expensive DNS lookups https://github.com/internetarchive/olsystem/pull/256

cdrini · 2025-02-10T21:54:02Z

Another thing that I think would be worth investigating is whether this affects our solr requests, and whether we should try switching that to a direct IP, since I occasionally notice a similarly suspicious line ~4s in the sentry p95, p99 graphs:

Which is doubly peculiar since the timeout for solr requests is 10 seconds:

openlibrary/openlibrary/utils/solr.py

Line 146 in b434d32

return self.session.get(url, timeout=10)

I tried changing the config value for solr to the direct IP, but it seemed to cause an error; perhaps requests.get was unhappy with a direct IP there.

cdrini added Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] Type: Post-Mortem Log for when having to resolve a P0 issue labels Feb 3, 2025

cdrini self-assigned this Feb 3, 2025

mekarpeles added this to the Sprint 2025-02 milestone Feb 9, 2025

mekarpeles changed the title ~~covers.openlibrary.org unusably slow~~ DNS Issues causing covers.openlibrary.org to be unusably slow Feb 9, 2025

cdrini added the Affects: Operations Affects the IA DevOps folks label Feb 10, 2025

cdrini closed this as completed Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS Issues causing covers.openlibrary.org to be unusably slow #10417

DNS Issues causing covers.openlibrary.org to be unusably slow #10417

cdrini commented Feb 3, 2025 •

edited

Loading

mekarpeles commented Feb 10, 2025

cdrini commented Feb 10, 2025 •

edited

Loading

DNS Issues causing covers.openlibrary.org to be unusably slow #10417

DNS Issues causing covers.openlibrary.org to be unusably slow #10417

Comments

cdrini commented Feb 3, 2025 • edited Loading

Summary

Steps to close

mekarpeles commented Feb 10, 2025

cdrini commented Feb 10, 2025 • edited Loading

cdrini commented Feb 3, 2025 •

edited

Loading

cdrini commented Feb 10, 2025 •

edited

Loading