Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Сonsensus restarts after 25 minutes instead of the 3-hour interval after start #828

Open
oleksandrSydorenkoJ opened this issue Feb 9, 2024 · 1 comment
Assignees
Labels
bug Something isn't working release:2.5
Milestone

Comments

@oleksandrSydorenkoJ
Copy link

oleksandrSydorenkoJ commented Feb 9, 2024

Describe the bug
The consensus has 2 built-in timers for automatic restart in case of disconnection from the majority of nodes.

  1. STUCK_RESTART_INTERVAL_MS - triggers after 3 hours from the last mined block.
  2. HEALTHCHECK_ON_START_RETRY_TIME_SEC - starts after the Skaled launch and lasts for 1500 seconds.

If the consensus loses the majority and restarts after 3 hours, the second Skaled start will be triggered by HEALTHCHECK_ON_START_RETRY_TIME_SEC. This complicates the chain recovery procedure in case of a crash - it may happen that downloading a large snapshot physically becomes impossible within 25 minutes, but it is possible within 3 hours.
The result of https://github.com/skalenetwork/internal-support/issues/51

Note:
All Skaled, that have been restarted without majority of nodes automatically will be restarted in 3 hours.
All Skaled, that have been restarted with the majority of nodes after /issues/51 - will be restarted every 25 minutes

Preconditions:
Active schain medium type (16 nodes)
At least 1 chain on node

Version
skalenetwork/schain:3.17.1
skalenetwork/schain:3.18.0-beta.0

Steps to reproduce

  1. Stop 6 containers on schain
  2. Wait for 3 hours and restart the one of 10 active container on Node A
  3. Wait for 25 minutes and check skaled logs on the restarted container from node A

Expected behavior
Consensus should wait 3 hours before restarting himself if the majority of active nodes.

Actual state:
Consensus restarts after 25 minutes on node A when no majority on nodes.

message (40).txt

@oleksandrSydorenkoJ oleksandrSydorenkoJ added the bug Something isn't working label Feb 9, 2024
@oleksandrSydorenkoJ oleksandrSydorenkoJ changed the title Сonsensus restarts after 25 minutes if it fails to connect 2/3 peers since the last start Сonsensus restarts after 25 minutes if it fails to connect 2/3 peers when 11 active nodes Feb 9, 2024
@oleksandrSydorenkoJ oleksandrSydorenkoJ changed the title Сonsensus restarts after 25 minutes if it fails to connect 2/3 peers when 11 active nodes Сonsensus restarts after 25 minutes instead of the 3-hour interval after start Feb 9, 2024
@DmytroNazarenko DmytroNazarenko added this to the SKALE 2.4 milestone Feb 9, 2024
@PolinaKiporenko PolinaKiporenko moved this to Ready For Pickup in SKALE Engineering 🚀 Feb 13, 2024
@kladkogex kladkogex modified the milestones: SKALE 2.4, SKALE 2.5 Mar 21, 2024
@kladkogex
Copy link
Contributor

Moving to 2.5 as we dont have time for it in 2.4

@PolinaKiporenko PolinaKiporenko modified the milestones: SKALE 2.5, SKALE 2.6 Apr 24, 2024
@PolinaKiporenko PolinaKiporenko moved this from Ready For Pickup to To Do in SKALE Engineering 🚀 Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working release:2.5
Projects
Status: To Do
Development

No branches or pull requests

4 participants