Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: BlockBuffer should manage download timeouts #53

Open
bitjson opened this issue Jan 24, 2023 · 0 comments
Open

Refactor: BlockBuffer should manage download timeouts #53

bitjson opened this issue Jan 24, 2023 · 0 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@bitjson
Copy link
Member

bitjson commented Jan 24, 2023

The agent code evolved to include download timeouts outside of the BlockBuffer, but this makes the code around "block reservations" harder to debug. There currently appears to be some unusual bug where block reservations can sometimes be created and never cancelled, and if enough of these stack up, syncing stops until the agent is restarted. (Seems to happen especially in “prod-sim” environments where everything is on the same machine – BCHN becomes unresponsive to the agent until it’s initial sync is complete, and if chipnet/testnet and mainnet are both syncing, the block buffer sometimes becomes full of reservations without any active downloads).

A more defensive way to design this is for the block buffer to itself manage a timeout for each reservation - rather than being a simple internal counter, reservations could be an array of objects, each with a cancellation callback and timestamp at which the reservation should be cancelled. Timed-out reservations can then be cancelled each time the block buffer is cleaned up. So before requesting a block, the agent simply registers the cancellation time and callback with the block buffer (maybe also the block height and node name for monitoring purposes), rather than managing download timers itself.

This refactor should also clean up our prioritization when requesting blocks: the current strategy of requesting the least-synced chain is a great basic behavior, but if it’s causing the agent to spend all of its time on one mostly-unresponsive node, we’re wasting potential sync time of other nodes. So maybe after a block download is cancelled, we should temporarily bias block selection to other, non-lagging nodes. In the best case scenario, load is organized to have nodes finish syncing at a similar time, but throughput is never wasted waiting for a slow node (e.g. while finishing an initial sync).

Related: BCHN gets very behind on serving requests during initial sync, and it can sometimes send back a requested block >5 minutes later (see BCHN issue). We need to be a bit more intelligent about simply canceling these requests, and they later come in and look like blocks mined by the node (when we simply forget we requested them).

@bitjson bitjson added bug Something isn't working good first issue Good for newcomers labels Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant