libnetwork/networkdb: fix logical race conditions #49932

corhere · 2025-05-06T23:34:15Z

- What I did

Fixed Wrong ipvs address weight during a deployment process #41896 and probably others

- How I did it
I fixed bugs in NetworkDB and the libnetwork cluster agent which manifest in imperfect conditions, when table events are received unreliably or out-of-order. And I changed the semantics of (*NetworkDB).Watch() so it can be used without racing table mutations from remote nodes.

See individual commit messages for more detail.

- How to verify it
New unit test

- Human readable description for the release notes

Improve the reliability of NetworkDB in busy clusters and lossy networks

- A picture of a cute animal (not mandatory but encouraged)

robmry · 2025-05-09T10:37:09Z

libnetwork/agent.go

-	case networkdb.CreateEvent:
-		log.G(context.TODO()).Debugf("handleEpTableEvent ADD %s R:%v", eid, epRec)
+	case networkdb.CreateEvent, networkdb.UpdateEvent:
+		log.G(context.TODO()).Debugf("handleEpTableEvent ADD/UPD %s R:%v", eid, epRec)


When things are weird, it might still help to know whether an event was a create or an update?

Could also log in one place, before the switch?

log.G(context.TODO()).WithFields(log.Fields{ "eid": eid, "T": fmt.Sprintf("%T", ev), // maybe there's a better way? "R": epRec, }).Debug("handleEpTableEvent)

Good idea! Done.

robmry · 2025-05-09T13:56:21Z

libnetwork/networkdb/watch_test.go

+				Value:     []byte("a"),
+			})
+		}
+	}


Maybe worth creating a deleted entry to check it doesn't show up in the replay?

NetworkDB gossips changes to table entries to other nodes using distinct CREATE, UPDATE and DELETE events. It is unfortunate that the wire protocol distinguishes CREATEs from UPDATEs as nothing useful can be done with this information. Newer events for an entry invalidate older ones, so there is no guarantee that a CREATE event is broadcast to any node before an UPDATE is broadcast. And due to the nature of gossip protocols, even if the CREATE event is broadcast from the originating node, there is no guarantee that any particular node will receive the CREATE before an UPDATE. Any code which handles an UPDATE event differently from a CREATE event is therefore going to behave in unexpected ways in less than perfect conditions. NetworkDB table watchers also receive CREATE, UPDATE and DELETE events. Since the watched tables are local to the node, the events could all have well-defined meanings that are actually useful. Unfortunately NetworkDB is just bubbling up the wire-protocol event types to the watchers. Redefine the table-watch events such that a CREATE event is broadcast when an entry pops into existence in the local NetworkDB, an UPDATE event is broadcast when an entry which was already present in the NetworkDB state is modified, and a DELETE event is broadcast when an entry which was already present in the NetworkDB state is marked for deletion. DELETE events are broadcast with the same value as the most recent CREATE or UPDATE event for the entry. The handler for endpoint table events in the libnetwork agent assumed incorrectly that CREATE events always correspond to adding a new active endpoint and that UPDATE events always correspond to disabling an endpoint. Fix up the handler to handle CREATE and UPDATE events using the same code path, checking the table entry's ServiceDisabled flag to determine which action to take. Signed-off-by: Cory Snider <[email protected]>

The gossip protocol which powers NetworkDB does not guarantee in-order reception of events. This poses a problem with deleting entries: without some mechanism to discard stale CREATE or UPDATE events received after a DELETE, out-of-order reception of events could result in a deleted entry being spuriously resurrected in the local NetworkDB state! NetworkDB handles this situation by storing "tombstone" entries for a period of time with the Lamport timestamps of the entries' respective DELETE events. Out-of-order CREATE or UPDATE events will be ignored by virtue of having older timestmaps than the tombstone entry, just like how it works for entries that have not yet been deleted. NetworkDB was only storing a tombstone if the entry was already present in the local database at the time of the DELETE event. If the first event received for an entry is a DELETE, no tombstone is stored. If a stale CREATE/UPDATE event for the entry (with an older timestamp than the DELETE) is subsequently received, NetworkDB erroneously creates a live entry in the local state with stale data. Modify NetworkDB to store tombstones for DELETE events irrespective of whether the entry was known to NetworkDB beforehand so that it correctly discards out-of-order CREATEs and UPDATEs in all cases. Signed-off-by: Cory Snider <[email protected]>

NetworkDB's Watch() facility is problematic to use in practice. The stream of events begins when the watch is started, so the watch cannot be used to process table entries that existed beforehand. Either option to process existing table entries is racy: walking the table before starting the watch leaves a race window where events could be missed, and walking the table after starting the watch leaves a race window where created/updated entries could be processed twice. Modify Watch() to initialize the channel with synthetic CREATE events for all existing entries owned by remote nodes before hooking it up to the live event stream. This way watchers observe an equivalent sequence of events irrespective of whether the watch was started before or after entries from remote nodes are added to the database. Remove the bespoke and racy synthetic event replay logic for driver watches from the libnetwork agent. Signed-off-by: Cory Snider <[email protected]>

When a node leaves a network or the cluster, or memberlist considers the node as failed, NetworkDB atomically deletes all table entries (for the left network) owned by the node. This maintains the invariant that table entries owned by a node are present in the local database indices iff that node is an active cluster member which is participating in the network the entries pertain to. (*NetworkDB).handleTableEvent() is written in a way which attempts to minimize the amount of time it is in a critical section with the mutex locked for writing. It first checks under a read-lock whether both the local node and the node where the event originated are participating in the network which the event pertains to. If the check passes, the mutex is unlocked for reading and locked for writing so the local database state is mutated in a critical section. That leaves a window of time between the participation check the write-lock being acquired for a network or node event to arrive and be processed. If a table event for a node+network races a node or network event which triggers the purge of all table entries for the same node+network, the invariant could be violated. The table entry described by the table event may be reinserted into the local database state after being purged by the node's leaving, resulting in an orphaned table entry which the local node will bulk-sync to other nodes indefinitely. It's not completely wrong to perform a pre-flight check outside of the critical section. It allows for an early return in the no-op case without having to bear the cost of synchronization. But such optimistic concurrency control is only sound if the condition is double-checked inside the critical section. It is tricky to get right, and this instance of optimistic concurrency control smells like a case of premature optimization. Move the pre-flight check into the critical section to ensure that the invariant is maintained. Signed-off-by: Cory Snider <[email protected]>

thaJeztah

LGTM, thanks!

corhere force-pushed the libn/networkdb-fixes branch 2 times, most recently from d91bb5c to 040bf09 Compare May 7, 2025 14:56

corhere changed the title ~~Libn/networkdb fixes~~ libnetwork/networkdb: grab-bag of bug fixes May 7, 2025

corhere changed the title ~~libnetwork/networkdb: grab-bag of bug fixes~~ libnetwork/networkdb: fix logical race conditions May 7, 2025

corhere added area/networking area/networking/d/overlay impact/changelog kind/bugfix PR's that fix bugs labels May 7, 2025

corhere requested review from thaJeztah and removed request for thaJeztah May 7, 2025 16:54

corhere force-pushed the libn/networkdb-fixes branch from dc77536 to 747a446 Compare May 7, 2025 17:26

corhere marked this pull request as ready for review May 7, 2025 17:26

corhere requested review from cpuguy83, thaJeztah and robmry May 7, 2025 17:27

robmry mentioned this pull request May 7, 2025

Skip tests that recently started failing in rootless/firewalld #49940

Merged

corhere force-pushed the libn/networkdb-fixes branch from 747a446 to ab20637 Compare May 7, 2025 19:33

corhere added the process/cherry-pick label May 7, 2025

corhere mentioned this pull request May 8, 2025

libnetwork/networkdb: New: use struct-literal for indexes #49941

Closed

corhere added this to Maintainer spotlight May 8, 2025

github-project-automation bot moved this to Up next in Maintainer spotlight May 8, 2025

robmry approved these changes May 9, 2025

View reviewed changes

corhere added 3 commits May 13, 2025 14:09

corhere force-pushed the libn/networkdb-fixes branch from ab20637 to a3aea15 Compare May 13, 2025 18:11

thaJeztah added this to the 28.2.0 milestone May 15, 2025

robmry approved these changes May 15, 2025

View reviewed changes

corhere mentioned this pull request May 15, 2025

libn/networkdb: fix handleTableEvent logical race #49984

Closed

robmry approved these changes May 15, 2025

View reviewed changes

thompson-shaun assigned robmry and corhere May 15, 2025

thompson-shaun moved this from Up next to Complete in Maintainer spotlight May 15, 2025

thompson-shaun assigned thaJeztah May 15, 2025

corhere mentioned this pull request May 15, 2025

[25.0] Backport network fixes #50005

Merged

thaJeztah approved these changes May 15, 2025

View reviewed changes

thaJeztah merged commit 63bcfab into moby:master May 15, 2025
141 checks passed

corhere deleted the libn/networkdb-fixes branch May 15, 2025 21:38

robmry mentioned this pull request Jun 11, 2025

[28.x]: Revert overlay bug fixes / refactoring #50169

Merged

corhere added process/cherry-picked and removed process/cherry-pick labels Jun 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

libnetwork/networkdb: fix logical race conditions #49932

libnetwork/networkdb: fix logical race conditions #49932

Uh oh!

corhere commented May 6, 2025 •

edited by vvoland

Loading

Uh oh!

robmry May 9, 2025

Uh oh!

corhere May 13, 2025

Uh oh!

robmry May 9, 2025

Uh oh!

corhere May 13, 2025

Uh oh!

thaJeztah left a comment

Uh oh!

Uh oh!

Uh oh!

libnetwork/networkdb: fix logical race conditions #49932

libnetwork/networkdb: fix logical race conditions #49932

Uh oh!

Conversation

corhere commented May 6, 2025 • edited by vvoland Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robmry May 9, 2025

Choose a reason for hiding this comment

Uh oh!

corhere May 13, 2025

Choose a reason for hiding this comment

Uh oh!

robmry May 9, 2025

Choose a reason for hiding this comment

Uh oh!

corhere May 13, 2025

Choose a reason for hiding this comment

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

corhere commented May 6, 2025 •

edited by vvoland

Loading