Collation fetching fairness #4880

tdimitrov · 2024-06-26T07:13:31Z

Related to #1797

When fetching collations in collator protocol/validator side we need to ensure that each parachain has got a fair core time share depending on its assignments in the claim queue. This means that the number of collations fetched per parachain should ideally be equal to (but definitely not bigger than) the number of claims for the particular parachain in the claim queue.

The current implementation doesn't guarantee such fairness. For each relay parent there is a waiting_queue (PerRelayParent -> Collations -> waiting_queue) which holds any unfetched collations advertised to the validator. The collations are fetched on first in first out principle which means that if two parachains share a core and one of the parachains is more aggresive it might starve the second parachain. How? At each relay parent up to max_candidate_depth candidates are accepted (enforced in fn is_seconded_limit_reached) so if one of the parachains is quick enough to fill in the queue with its advertisements the validator will never fetch anything from the rest of the parachains despite they are scheduled. This doesn't mean that the aggressive parachain will occupy all the core time (this is guaranteed by the runtime) but it will deny the rest of the parachains sharing the same core to have collations backed.

~~The solution I am proposing extends the checks in is_seconded_limit_reached with an additional check.~~ The solution I am proposing is to limit fetches and advertisements based on the state of the claim queue. At each relay parent the claim queue for the core assigned to the validator is fetched. For each parachain a fetch limit is calculated (equal to the number of entries in the claim queue). Advertisements are not fetched for a parachain which has exceeded its claims in the claim queue. This solves the problem with aggressive parachains advertising too much collations.

The second part is in collation fetching logic. The collator will keep track on which collations it has fetched so far. When a new collation needs to be fetched instead of popping the first entry from the waiting_queue the validator examines the claim queue and looks for the earliest claim which hasn't got a corresponding fetch. This way the collator will always try to prioritise the most urgent entries.

polkadot/node/network/collator-protocol/src/validator_side/collation.rs

…al to `allowed_ancestry_len`

polkadot/node/network/collator-protocol/src/validator_side/mod.rs

polkadot/node/network/collator-protocol/src/validator_side/collation.rs

eskimor

Hmm. Had a quick look, but not yet following. We still seem to track per relay parent (and per peer): How can we guarantee fairness in such a scheme, given that collators are free in picking relay parents?

tdimitrov · 2024-10-29T10:51:33Z

Hmm. Had a quick look, but not yet following. We still seem to track per relay parent (and per peer): How can we guarantee fairness in such a scheme, given that collators are free in picking relay parents?

We count candidates at relay parent X and all previous relay parents within the view (here).

Why do you say we track per peer? We have a check that a peer doesn't provide more entries than the elements in the claim queue as a quick spam protection check in insert_advertisement (here) but it only serves as an initial spam protection so that PeerData doesn't grow indefinitely. We do check the claim queue after that.

alindima

Thanks for the effort you invested in this so far 👍🏻

Not introduced here, but this subsystem has not aged very well and the code is quite complicated and convoluted.

Generally I would love to see a refactor of the collator protocol. Maybe this could be done as part of the issue for removing the async backing parameters (which will probably add modifications to the collator protocol also)

polkadot/node/network/collator-protocol/src/validator_side/tests/prospective_parachains.rs

alindima · 2024-10-30T07:58:55Z

polkadot/node/network/collator-protocol/src/validator_side/mod.rs

@@ -398,7 +369,7 @@ struct State {
 	/// support prospective parachains. This mapping works as a replacement for
 	/// [`polkadot_node_network_protocol::View`] and can be dropped once the transition
 	/// to asynchronous backing is done.
-	active_leaves: HashMap<Hash, ProspectiveParachainsMode>,
+	active_leaves: HashMap<Hash, AsyncBackingParams>,


It looks like we can remove active_leaves altogether now that we don't need to support pre-async backing code (as the comment on it says also).

When we want to check if a relay parent is in the implicit view we can check against all_allowed_relay_parents instead of known_allowed_relay_parents_under

Please also update the comments. There are several comments about pre-async backing stuff

polkadot/node/network/collator-protocol/src/validator_side/mod.rs

alindima · 2024-10-30T08:26:28Z

polkadot/node/network/collator-protocol/src/validator_side/mod.rs

+					// Current assignments is equal to the length of the claim queue. No honest
+					// collator should send that much advertisements.
+					if candidates.len() > per_relay_parent.assignment.current.len() {
+						return Err(InsertAdvertisementError::PeerLimitReached)


we should do this check on the else branch as well

alindima · 2024-10-30T08:40:46Z

polkadot/node/network/collator-protocol/src/validator_side/tests/mod.rs

-		)
-		.await;
+		let head = Hash::from_low_u64_be(128);
+		let head_num: u32 = 0;


slightly unrelated. AFAICT ReportCollator message is never sent by any subsystems. We should remove it along with this test

I agree but let's not handle this here. I've opened an issue #6415

alindima · 2024-10-30T08:43:24Z

polkadot/node/network/collator-protocol/src/validator_side/tests/prospective_parachains.rs

@@ -1314,3 +1385,596 @@ fn child_blocked_from_seconding_by_parent(#[case] valid_parent: bool) {
 		virtual_overseer
 	});
 }
+
+#[test]


advertisement_spam_protection test should also check for the actual advertisement limit, not only duplicates

Can you elaborate?

If advertisement_spam_protection is extended to check the claim queue limit is respected it will overlap with collations_outside_limits_are_not_fetched. Imo they test different cases and it's nice to have them separated. The test names could be better though.

alindima · 2024-10-30T09:02:50Z

polkadot/node/network/collator-protocol/src/validator_side/mod.rs

 		)
 		.map_err(AdvertisementError::Invalid)?;

-	if per_relay_parent.collations.is_seconded_limit_reached(relay_parent_mode) {
+	let claims_for_para = per_relay_parent.collations.claims_for_para(&para_id);
+	let seconded_and_pending_at_ancestors = seconded_and_pending_for_para_in_view(


We are checking here for our ancestors only. But we could have already seconded collations at the latest leaf.
Suppose A B C relay parents (C being the latest and active)
We second 2 collations on C. Then want to second one on B, we'll only take into account the collations on A (which are none)

I built the whole PR under the wrong assumption that relay parents can't 'go backwards'. After talking with you offline and looking through the code again I can see this is not the case.

I'll think about how to fix this.

This is now reworked to consider old relay parents and all the paths up to an outer leaf.

eskimor · 2024-10-30T20:54:38Z

Hmm. Had a quick look, but not yet following. We still seem to track per relay parent (and per peer): How can we guarantee fairness in such a scheme, given that collators are free in picking relay parents?

We count candidates at relay parent X and all previous relay parents within the view (here).

Got it, thanks! Was a bit hidden ;-)

…nts below and above the target relay parent

Collation fetching fairness

f4738dc

tdimitrov added the T8-polkadot This PR/Issue is related to/affects the Polkadot network. label Jun 26, 2024

tdimitrov commented Jun 26, 2024

View reviewed changes

polkadot/node/network/collator-protocol/src/validator_side/collation.rs Outdated Show resolved Hide resolved

Comments

c7074da

tdimitrov commented Jun 26, 2024

View reviewed changes

polkadot/node/network/collator-protocol/src/validator_side/collation.rs Outdated Show resolved Hide resolved

tdimitrov added 4 commits June 26, 2024 16:39

Fix tests and add some logs

73eee87

Fix per para limit calculation in is_collations_limit_reached

fa321ce

Fix default TestState initialization: claim queue len should be equ…

96392a5

…al to `allowed_ancestry_len`

clippy

0f28aa8

tdimitrov force-pushed the tsv-collator-proto-fairness branch from c7f24aa to 0f28aa8 Compare June 28, 2024 08:19

Update is_collations_limit_reached - remove seconded limit

e5ea548

tdimitrov commented Jun 28, 2024

View reviewed changes

polkadot/node/network/collator-protocol/src/validator_side/mod.rs Show resolved Hide resolved

tdimitrov added 2 commits July 1, 2024 13:59

Fix pending fetches and more tests

9abc898

Remove unnecessary clone

c07890b

tdimitrov commented Jul 1, 2024

View reviewed changes

polkadot/node/network/collator-protocol/src/validator_side/collation.rs Outdated Show resolved Hide resolved

tdimitrov added 15 commits July 1, 2024 15:20

Comments

e50440e

Better var names

42b05c7

Fix pick_a_collation_to_fetch and add more tests

2f5a466

Fix test: collation_fetching_respects_claim_queue

ff96ef9

Add collation_fetching_fallback_works test + comments

e837689

More tests

91cdd13

Fix collation limit fallback

9f2d59b

Separate claim_queue_support from ProspectiveParachainsMode

a10c86d

Fix comments and add logs

b39858a

Update test: collation_fetching_prefer_entries_earlier_in_claim_queue

b30f340

Fix pick_a_collation_to_fetch and more tests

c0f18b9

Merge branch 'master' into tsv-collator-proto-fairness

703ed6d

Fix pick_a_collation_to_fetch - iter 1

fba7ca6

Fix pick_a_collation_to_fetch - iter 2

d4f4ce2

Remove a redundant runtime version check

5f52712

tdimitrov requested a review from a team as a code owner October 18, 2024 08:10

tdimitrov added 12 commits October 18, 2024 14:59

Merge branch 'master' into tsv-collator-proto-fairness

15e3a74

Relax expected block counts for each para

d6b35ca

Bump lookahead and decrease timeout

586b56b

Fix ZN pipeline - try 1

a04d480

Fix ZN pipeline - try 2

13d5d15

Fix ZN pipeline - try 3

86870d0

Fix ZN pipeline - try 4

7b822af

Merge branch 'master' into tsv-collator-proto-fairness

558c82e

Merge branch 'master' into tsv-collator-proto-fairness

06c0fd0

Rename ZN test

ade7f9b

Merge branch 'master' into tsv-collator-proto-fairness

8ba2a80

Handle merge conflicts

ab70567

eskimor reviewed Oct 29, 2024

View reviewed changes

alindima reviewed Oct 30, 2024

View reviewed changes

tdimitrov added 11 commits November 7, 2024 11:41

When counting occupied slots from the claim queue consider relay pare…

d24fdc1

…nts below and above the target relay parent

Add a test

f55390e

Small style fixes in tests

a2093ee

Fix a todo

ded6fb5

Fix paths_to_relay_parent

cda9330

Additional test for paths_to_relay_parent

505eb24

Simplifications

94f573a

Merge branch 'master' into tsv-collator-proto-fairness

fa82404

Resolve merge conflicts

55e7fb2

Fix todos

e27ddd4

Comment

a10c0c1

tdimitrov mentioned this pull request Nov 8, 2024

Remove ReportCollator` message #6415

Open

Remove unneeded log line

ee11c6a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collation fetching fairness #4880

Collation fetching fairness #4880

tdimitrov commented Jun 26, 2024 •

edited

Loading

eskimor left a comment

tdimitrov commented Oct 29, 2024

alindima left a comment

alindima Oct 30, 2024

alindima Oct 30, 2024

alindima Oct 30, 2024

tdimitrov Nov 8, 2024 •

edited

Loading

alindima Oct 30, 2024

tdimitrov Nov 8, 2024

alindima Oct 30, 2024

tdimitrov Oct 30, 2024

tdimitrov Nov 8, 2024

eskimor commented Oct 30, 2024

Collation fetching fairness #4880

Are you sure you want to change the base?

Collation fetching fairness #4880

Conversation

tdimitrov commented Jun 26, 2024 • edited Loading

eskimor left a comment

Choose a reason for hiding this comment

tdimitrov commented Oct 29, 2024

alindima left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdimitrov Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eskimor commented Oct 30, 2024

tdimitrov commented Jun 26, 2024 •

edited

Loading

tdimitrov Nov 8, 2024 •

edited

Loading