DNS Issues causing covers.openlibrary.org to be unusably slow #10417
Labels
Affects: Operations
Affects the IA DevOps folks
Priority: 0
Fix now: Issue prevents users from using the site or active data corruption. [managed]
Type: Post-Mortem
Log for when having to resolve a P0 issue
Milestone
Summary
Slow p95 and p99s in covers sentry, suspiciously aligned around the 4s and 5s, indicating some sort of timeout:
What caused it? DNS resolution on our covers servers seems to timeout >4s 0.40% of the time. This caused covers to be delayed when connecting to our database server, resulting in worker pool saturation, resulting in likely nginx queueing/very slow response times -- which often resulted in 499s when a browser would close the connection after having waiting too long.
What fixed it? Switching our database config to specify the direct IP of our database server instead of their hostname.
Drop in worker utilization and covers workers in the database
![Image](https://private-user-images.githubusercontent.com/6251786/411749930-222dadce-52f6-4ad5-839b-0aa217dd00ea.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNzUzNTUsIm5iZiI6MTczOTM3NTA1NSwicGF0aCI6Ii82MjUxNzg2LzQxMTc0OTkzMC0yMjJkYWRjZS01MmY2LTRhZDUtODM5Yi0wYWEyMTdkZDAwZWEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTJUMTU0NDE1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NjliYzk3ZWZjZjgwZTBkNjFjNWRjODlmZmYzNDA1NjQwMzYxMWI4ZDQ1ZTBjMjEyYmViZTdmNTZkMjQ2MTIyMSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.-BZvi1t1wjaraz9vG5hUthThfFDrF2WfMH8jVhM5EjQ)
connect
method (the dark purple bars are the number of workers in theconnect
method at a certain moment in a given minute). From ~50 / 60 workers in the connect method at a given moment, to <5.Drop in number of open database connections for covers:
Drop in p95 and p99 load times on covers (sentry)
Sentry apdex for covers requests also went up to ~100%:
The impact on our web nodes was less pronounced, but we saw similar load improvements to eg the
![Image](https://private-user-images.githubusercontent.com/6251786/411755997-ec4bcb2c-6077-41b4-93f5-32ac6d2a598c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNzUzNTUsIm5iZiI6MTczOTM3NTA1NSwicGF0aCI6Ii82MjUxNzg2LzQxMTc1NTk5Ny1lYzRiY2IyYy02MDc3LTQxYjQtOTNmNS0zMmFjNmQyYTU5OGMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTJUMTU0NDE1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9N2ZlYTVkZWUyMWI3MzNmYTlkZjgyYmZkYmMzODc3YzVkMjY3NjdiOWIyYTg4OGFhYzQzOGIwMzczMzMyMTU2YiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.8l3ln0tOa_vPXnyHFFLEICQzgVhltmKFB2tI5bIV4oc)
/query.json
endpoint, which is effectively just a wrapper around a database call:And here is the change in workers in
connect
. Note only ol-web1 and ol-web2 were really affected; the investigation revealed that ol-web0 had a much lower DNS timeout rate, so I didn't bother changing it's config.What was the impact? Cover load times of 30s sometimes during the periods of high 499s
What could have gone better?
Followup actions:
Add graphs for monitoring DNS timeout rate (here)
Add graphs for monitoring covers traffic (here)
Add graphs for monitoring gunicorn worker utilization (here for ol-web, here for covers)
Raise with IA Ops to investigate why the DNS might be timeout out in this fashion
Do analysis to determine if certain servers timeout more than others -- they do (results, internal only)
gunicorn
worker counts #10429Steps to close
Affects:
label applied?The text was updated successfully, but these errors were encountered: