Active endpoints error error #49901

robmry · 2025-04-29T18:29:33Z

- What I did

fix Network removal fails with "network has active endpoints" error during docker-compose down #49887
- introduced by libnet: Controller: cache networks & endpoints in-memory #49736
- related to c8a66f5
fix "network XXX has active endpoints" error should account for empty endpoint names #49888
- introduced by libnet: add ep name in 'has active endpoints' error #49773

When a Sandbox is restored on startup (either to be used, because live-restore is enabled, or to clean up a container that was left running on shutdown) ... if the Network or Endpoint objects can't be loaded from store, stub objects are created.

If the Sandbox is being live-restored, the stub Endpoint is then just discarded. Otherwise, the Endpoint is added to the Sandbox then sb.delete is called to do the cleanup.

sb.delete calls Endpoint.Delete on each Endpoint - and Endpoint.Delete does nothing if it can't load both Network and Endpoint from the disk store.

So, either way, the stub Network/Endpoint objects end up having no effect.

#49887 added in-memory caches of Network and Endpoint objects. The caches are populated whenever those objects are read from or written to the on-disk store - but most readers don't use the cached objects (yet), they just load a new copy of the object from the disk store and use that.

That change added the stub objects to the in-memory cache - but they were never deleted from cache... if live-restore, they'd just be orphaned (the Endpoint wasn't added to the Sandbox). It not-live-restore, Endpoint.Delete returned early, before removing the cached Endpoint object so, again, the stub objects were orphaned.

That caused a problem for the one of the only uses (so far) of the in-memory caches, replacing the stored Endpoint Count for a network with a search of the cache. The orphaned stub Endpoints referred to the network, so network deletion would fail with ActiveEndpointsError. I've not been able to reproduce the problem reported in #49887, so don't have logs to prove it and none have been provided on the issue yet... but the error message has an empty Endpoint name, so it looks very much like the endpoints are stub objects.

If all that theorising is correct ...

it's a regression, it's no longer possible to delete a network after restore on startup has failed to read load an Endpoint from the store - and that's fixable.
but, the root cause, a stored Sandbox having a reference to an Endpoint that isn't stored is a separate issue.
- I guess it'll have been caused by the issue the in-memory caches should eventually resolve... multiple copies of the same object in memory, racing/trampling each other to the store. Although a race like that is detected, it's handled by just forcing a write to store anyway.

- How I did it

Report endpoint id as well as name in ActiveEndpointsError

When a network genuinely has active endpoints, the new error looks like this ...

# docker network rm nnn
Error response from daemon: error while removing network: network nnn has active endpoints (name:"romantic_cerf" id:"6e6ce97cc986", name:"charming_rubin" id:"2f704afa034c")

Improve logging and readability of Controller.sandboxRestore

Use structured logging.
- Which means ids are logged consistently.
Use variable 'isRestore' instead of extra map lookups.

Don't add stub Endpoint/Network object to cache on Sandbox restore

The stub objects were introduced by c8a66f5 - I don't think that logic applies any more, and ...

a stub Network that doesn't know which network driver it should have isn't going to be able to get involved in much cleanup, I think it was just a way to load an Endpoint object.
a stub Endpoint can't clean up anything like addresses/dns-entries etc, because it doesn't know what they are. But, that's ok because those things couldn't have been restored to internal structures anyway.
worst-case, an Endpoint object that couldn't be loaded because of a missing Network gets leaked in the disk-store (but, quite likely, it's not there anyway!).

So, removed them.

- How to verify it

Tricky - no repro, and I can't add a unit test for deleted code.

- Human readable description for the release notes

Fix an issue that could cause network deletion to fail after a daemon restart, with error "has active endpoints" listing empty endpoint names.

thaJeztah · 2025-04-30T11:23:12Z

libnetwork/network.go

+			name: n.name,
+			id:   n.id,
+			endpoints: sliceutil.Map(eps, func(ep *Endpoint) string {
+				return fmt.Sprintf("name:%q id:%q", ep.name, stringid.TruncateID(ep.id))


I know %q sometimes leads to additional escaping in output (foo:\"\") which can reduce readability; for id, I don't THINK we'd ever have an empty value(?) in that case at least for id we probably don't need %q.

For name (which from reports ended up being empty), I wonder if we could make the output "optional", e.g.

deadbeef (only ID was set)

deadbeef (happy_torvalds) (name was set)

Although .. I guess we list all of the endpoints within () in the error, so that would end up as;

has active endpoints (deadbeef (happy_torvalds), cafebeef (boring_wozniak))`

(not sure if the extra braces would make it hard to read, but perhaps that's .. "OK"?)

Ok, I'll change the %q to %s.

But an Endpoint should always have a name - particularly now I've deleted the stub instantiations made when a load from store failed. So, I don't think it's worth changing the message to deal with empty strings.

I put an example of the error in the description ...

network nnn has active endpoints (name:"romantic_cerf" id:"6e6ce97cc986", name:"charming_rubin" id:"2f704afa034c")

If there's something wrong with it - I think it's that it lists endpoint names and ids, and those aren't things users should need to know or care about. Listing the container name/id (if they can be tracked down) would be more meaningful. That's not this change - but, because the fields aren't anonymous, there's scope for adding container name/id if known.

🙈 Missed the example, as exactly that part was out of horizontal scroll.

Ah, right, yeah, so I was mostly looking at the double-quotes in general; the error returned looks good; maybe I was recalling situations where the error is logged (in which case it may end up nested in another string, resulting in the escaped quotes); see opencontainers/runc#2372 for the fun ones. So mostly exploring if there would be alternatives that didn't require quotes.

If we keep the quotes than %q is fine as well.

Signed-off-by: Rob Murray <[email protected]>

- Use structured logging. - Which means ids are logged consistently. - Use variable 'isRestore' instead of extra map lookups. Signed-off-by: Rob Murray <[email protected]>

thaJeztah · 2025-04-30T12:07:13Z

libnetwork/sandbox_store.go

-				}
+				log.G(ctx).WithError(err).Error("Failed to restore endpoint, getNetworkFromStore failed")


Would these benefit from a comment to say "really, this is one case where log-and-ignore is the most correct approach"?

I know we have a bunch of these in libnetwork where it may not be correct, and which should likely be updated, so mostly from a perspective where we explicitly acknowledge that we ignore on purpose.

Perhaps logging as Warn could make sense as well if it is a "recoverable" (and/or "somewhat expected") situation, and will never be a fatal error .

I think the same applies to most of libnetwork's error-ignoring ... errors shouldn't be ignored, but handling them can have highly unpredictable consequences.

Not finding these objects in the store should mean something really bad happened, like the data store on disk got corrupted. In which case, the right thing would be to abort and get the underlying issue fixed. (Particularly for the live-restore case... better not to restore anything than get to a broken state that'll be hard to debug.)

But, as the issue linked from this PR suggests, it does happen - probably because of the known races in the way libnetwork uses its data store. That's the issue Albin's new in-memory caches should eventually help with, and this issue/PR demonstrates how difficult it is to untangle without breaking things.

So, there's no way for this code to tell whether something bad is going to happen, or has already happened. I can add a comment to that effect. Not sure about Error vs Warning ... perhaps not knowing for-sure that we broke things means it can be a Warning, but it's not ideal.

On Sandbox restore if an Endpoint (or Network) can't be loaded from the store and the container is to be preserved (live-restore), stub Network and Endpoint objects are created and added to the cache. But then the Endpoint is just dropped - leaving the stub objects in cache. If not-live-restore, the Sandbox is reconstructed do that it can be deleted by sb.delete(force=true). But, the only thing sb.delete does with the Endpoint in this case is call ep.Delete. And, ep.Delete doesn't do anything if the Network and Endpoint can't both be loaded from the store. So, again, the stub objects are left in the in-memory cache and nothing that might have needed tidying up got tidied up. So, whether live-restoring or not, just log an error and ignore an Endpoint if it can't be loaded from disk. Signed-off-by: Rob Murray <[email protected]>

thaJeztah

LGTM

thaJeztah · 2025-05-01T09:04:24Z

libnetwork/sandbox_store.go

-		msg := " for cleanup"
 		create := true
 		isRestore := false
 		if val, ok := activeSandboxes[sb.ID()]; ok {
-			msg = ""


Ever so slightly considering if including a message in the log-message itself would still be useful (would it be easier to grep?) Not sure if the existing messages were accurate, but I could imagine "Failed to xxxx for cleanup" or "Failed to xxx during restore" (or whatever covers it best), would help debugging logs provided by users.

Not a blocker for sure!

Before, msg was only used by "failed to create osl sandbox". The two per-Endpoint log lines just said "for cleanup", whether it was or not ...

log.G(context.TODO()).Errorf("getNetworkFromStore for nid %s failed while trying to build sandbox for cleanup: %v", eps.Nid, err) log.G(context.TODO()).Errorf("getEndpointFromStore for eid %s failed while trying to build sandbox for cleanup: %v", eps.Eid, err)

Now, isRestore is always included as a logger field.

thaJeztah · 2025-05-01T09:08:59Z

libnetwork/sandbox_store.go

@@ -206,9 +205,17 @@ func (c *Controller) sandboxRestore(activeSandboxes map[string]interface{}) erro
 			sb.restoreResolvConfPath()
 			create = !sb.config.useDefaultSandBox
 		}
+
+		ctx := context.TODO()


Looks like punching-through the context is a realistic option (as a follow-up); I see this function is called from libnetwork.New, which already has a context and spans set up.

(Not sure if context-cancelation is expected to be handled in this function though, or if it would require a context.WithoutCancel().

robmry self-assigned this Apr 29, 2025

robmry force-pushed the active_endpoints_error_error branch from 61ab315 to c0257a0 Compare April 30, 2025 08:55

robmry added area/networking impact/changelog kind/bugfix PR's that fix bugs version/28.1 labels Apr 30, 2025

robmry added this to the 28.1.2 milestone Apr 30, 2025

robmry marked this pull request as ready for review April 30, 2025 10:05

robmry requested review from corhere, thaJeztah, vvoland and akerouanton April 30, 2025 10:05

thaJeztah reviewed Apr 30, 2025

View reviewed changes

robmry added 2 commits April 30, 2025 12:42

Report endpoint id as well as name in ActiveEndpointsError

5d8192f

Signed-off-by: Rob Murray <[email protected]>

Improve logging and readability of Controller.sandboxRestore

c129c0f

- Use structured logging. - Which means ids are logged consistently. - Use variable 'isRestore' instead of extra map lookups. Signed-off-by: Rob Murray <[email protected]>

robmry force-pushed the active_endpoints_error_error branch from c0257a0 to a33c4a7 Compare April 30, 2025 11:42

thaJeztah reviewed Apr 30, 2025

View reviewed changes

robmry force-pushed the active_endpoints_error_error branch from a33c4a7 to 33f5b9e Compare April 30, 2025 14:26

thaJeztah approved these changes May 1, 2025

View reviewed changes

thaJeztah mentioned this pull request May 7, 2025

"endpoint with name XX already exists in network" can't disconnect container from bridge docker/cli#1891

Closed

robmry requested a review from dmcgowan May 8, 2025 15:52

thaJeztah added this to Maintainer spotlight May 8, 2025

github-project-automation bot moved this to Up next in Maintainer spotlight May 8, 2025

dmcgowan approved these changes May 9, 2025

View reviewed changes

robmry modified the milestones: 28.1.2, 28.2.0 May 9, 2025

robmry moved this from Up next to Complete in Maintainer spotlight May 9, 2025

robmry merged commit f848191 into moby:master May 9, 2025
141 checks passed

robmry deleted the active_endpoints_error_error branch May 14, 2025 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Active endpoints error error #49901

Active endpoints error error #49901

Uh oh!

robmry commented Apr 29, 2025 •

edited by vvoland

Loading

Uh oh!

thaJeztah Apr 30, 2025

Uh oh!

robmry Apr 30, 2025

Uh oh!

thaJeztah Apr 30, 2025

Uh oh!

thaJeztah Apr 30, 2025

Uh oh!

robmry Apr 30, 2025

Uh oh!

thaJeztah left a comment

Uh oh!

thaJeztah May 1, 2025

Uh oh!

robmry May 1, 2025

Uh oh!

thaJeztah May 1, 2025

Uh oh!

Uh oh!

Uh oh!

		}
		log.G(ctx).WithError(err).Error("Failed to restore endpoint, getNetworkFromStore failed")

Active endpoints error error #49901

Active endpoints error error #49901

Uh oh!

Conversation

robmry commented Apr 29, 2025 • edited by vvoland Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Report endpoint id as well as name in ActiveEndpointsError

Improve logging and readability of Controller.sandboxRestore

Don't add stub Endpoint/Network object to cache on Sandbox restore

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

robmry commented Apr 29, 2025 •

edited by vvoland

Loading