libn/networkdb: don't exceed broadcast size limit #49939

corhere · 2025-05-07T16:43:39Z

- What I did
NetworkDB uses a hierarchy of queues to prioritize messages for broadcast. Unfortunately the logic to pull from multiple queues is flawed. The length of the messages pulled from the first queue is not taken into account when pulling messages from the second queue. A list of messages up to tiwce as long as the limit could be returned! Messages beyond the limit will be truncated unceremoniously by memberlist.

Memberlist broadcast queues assume that all messages returned from a GetBroadcasts call will be broadcasted to other nodes in the cluster. Messages are popped from the queue once they have hit their retransmit limit. On a busy system messages may be broadcast fewer times than intended, possibly even being dropped without ever being broadcast!

I fixed this oversight so that the delegate never returns more data than the specified length limit.

- How I did it
Subtract the length of messages pulled from the first queue from the broadcast size limit so the limit is not exceeded when pulling from the second queue.

- How to verify it
By inspection 🤷

- Human readable description for the release notes

Improvements to the reliability and convergence speed of NetworkDB

- A picture of a cute animal (not mandatory but encouraged)

NetworkDB uses a hierarchy of queues to prioritize messages for broadcast. Unfortunately the logic to pull from multiple queues is flawed. The length of the messages pulled from the first queue is not taken into account when pulling messages from the second queue. A list of messages up to tiwce as long as the limit could be returned! Messages beyond the limit will be truncated unceremoniously by memberlist. Memberlist broadcast queues assume that all messages returned from a GetBroadcasts call will be broadcasted to other nodes in the cluster. Messages are popped from the queue once they have hit their retransmit limit. On a busy system messages may be broadcast fewer times than intended, possibly even being dropped without ever being broadcast! Subtract the length of messages pulled from the first queue from the broadcast size limit so the limit is not exceeded when pulling from the second queue. Signed-off-by: Cory Snider <[email protected]>

thaJeztah

LGTM

corhere added area/networking impact/changelog process/cherry-pick kind/bugfix PR's that fix bugs labels May 7, 2025

corhere force-pushed the libn/networkdb-broadcast-overflow branch from 3185bb6 to dacf445 Compare May 7, 2025 16:44

robmry mentioned this pull request May 7, 2025

Skip tests that recently started failing in rootless/firewalld #49940

Merged

robmry approved these changes May 8, 2025

View reviewed changes

thaJeztah approved these changes May 8, 2025

View reviewed changes

thaJeztah added this to the 28.2.0 milestone May 9, 2025

thaJeztah merged commit 97be633 into moby:master May 9, 2025
141 checks passed

corhere deleted the libn/networkdb-broadcast-overflow branch May 9, 2025 17:41

dperny mentioned this pull request May 15, 2025

[25.0] Backport network fixes #50005

Merged

robmry mentioned this pull request Jun 11, 2025

[28.x]: Revert overlay bug fixes / refactoring #50169

Merged

corhere added process/cherry-picked and removed process/cherry-pick labels Jun 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

libn/networkdb: don't exceed broadcast size limit #49939

libn/networkdb: don't exceed broadcast size limit #49939

Uh oh!

corhere commented May 7, 2025

Uh oh!

thaJeztah left a comment

Uh oh!

Uh oh!

Uh oh!

libn/networkdb: don't exceed broadcast size limit #49939

libn/networkdb: don't exceed broadcast size limit #49939

Uh oh!

Conversation

corhere commented May 7, 2025

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!