Skip to content

Active endpoints error error #49901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 9, 2025
Merged

Conversation

robmry
Copy link
Contributor

@robmry robmry commented Apr 29, 2025

- What I did

When a Sandbox is restored on startup (either to be used, because live-restore is enabled, or to clean up a container that was left running on shutdown) ... if the Network or Endpoint objects can't be loaded from store, stub objects are created.

If the Sandbox is being live-restored, the stub Endpoint is then just discarded. Otherwise, the Endpoint is added to the Sandbox then sb.delete is called to do the cleanup.

sb.delete calls Endpoint.Delete on each Endpoint - and Endpoint.Delete does nothing if it can't load both Network and Endpoint from the disk store.

So, either way, the stub Network/Endpoint objects end up having no effect.

#49887 added in-memory caches of Network and Endpoint objects. The caches are populated whenever those objects are read from or written to the on-disk store - but most readers don't use the cached objects (yet), they just load a new copy of the object from the disk store and use that.

That change added the stub objects to the in-memory cache - but they were never deleted from cache... if live-restore, they'd just be orphaned (the Endpoint wasn't added to the Sandbox). It not-live-restore, Endpoint.Delete returned early, before removing the cached Endpoint object so, again, the stub objects were orphaned.

That caused a problem for the one of the only uses (so far) of the in-memory caches, replacing the stored Endpoint Count for a network with a search of the cache. The orphaned stub Endpoints referred to the network, so network deletion would fail with ActiveEndpointsError. I've not been able to reproduce the problem reported in #49887, so don't have logs to prove it and none have been provided on the issue yet... but the error message has an empty Endpoint name, so it looks very much like the endpoints are stub objects.

If all that theorising is correct ...

  • it's a regression, it's no longer possible to delete a network after restore on startup has failed to read load an Endpoint from the store - and that's fixable.
  • but, the root cause, a stored Sandbox having a reference to an Endpoint that isn't stored is a separate issue.
    • I guess it'll have been caused by the issue the in-memory caches should eventually resolve... multiple copies of the same object in memory, racing/trampling each other to the store. Although a race like that is detected, it's handled by just forcing a write to store anyway.

- How I did it

Report endpoint id as well as name in ActiveEndpointsError

When a network genuinely has active endpoints, the new error looks like this ...

# docker network rm nnn
Error response from daemon: error while removing network: network nnn has active endpoints (name:"romantic_cerf" id:"6e6ce97cc986", name:"charming_rubin" id:"2f704afa034c")

Improve logging and readability of Controller.sandboxRestore

  • Use structured logging.
    • Which means ids are logged consistently.
  • Use variable 'isRestore' instead of extra map lookups.

Don't add stub Endpoint/Network object to cache on Sandbox restore

The stub objects were introduced by c8a66f5 - I don't think that logic applies any more, and ...

  • a stub Network that doesn't know which network driver it should have isn't going to be able to get involved in much cleanup, I think it was just a way to load an Endpoint object.
  • a stub Endpoint can't clean up anything like addresses/dns-entries etc, because it doesn't know what they are. But, that's ok because those things couldn't have been restored to internal structures anyway.
  • worst-case, an Endpoint object that couldn't be loaded because of a missing Network gets leaked in the disk-store (but, quite likely, it's not there anyway!).

So, removed them.

- How to verify it

Tricky - no repro, and I can't add a unit test for deleted code.

- Human readable description for the release notes

Fix an issue that could cause network deletion to fail after a daemon restart, with error "has active endpoints" listing empty endpoint names.

@robmry robmry self-assigned this Apr 29, 2025
@robmry robmry force-pushed the active_endpoints_error_error branch from 61ab315 to c0257a0 Compare April 30, 2025 08:55
@robmry robmry added this to the 28.1.2 milestone Apr 30, 2025
@robmry robmry marked this pull request as ready for review April 30, 2025 10:05
name: n.name,
id: n.id,
endpoints: sliceutil.Map(eps, func(ep *Endpoint) string {
return fmt.Sprintf("name:%q id:%q", ep.name, stringid.TruncateID(ep.id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know %q sometimes leads to additional escaping in output (foo:\"\") which can reduce readability; for id, I don't THINK we'd ever have an empty value(?) in that case at least for id we probably don't need %q.

For name (which from reports ended up being empty), I wonder if we could make the output "optional", e.g.

  • deadbeef (only ID was set)
  • deadbeef (happy_torvalds) (name was set)

Although .. I guess we list all of the endpoints within () in the error, so that would end up as;

has active endpoints (deadbeef (happy_torvalds), cafebeef (boring_wozniak))`

(not sure if the extra braces would make it hard to read, but perhaps that's .. "OK"?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll change the %q to %s.

But an Endpoint should always have a name - particularly now I've deleted the stub instantiations made when a load from store failed. So, I don't think it's worth changing the message to deal with empty strings.

I put an example of the error in the description ...

network nnn has active endpoints (name:"romantic_cerf" id:"6e6ce97cc986", name:"charming_rubin" id:"2f704afa034c")

If there's something wrong with it - I think it's that it lists endpoint names and ids, and those aren't things users should need to know or care about. Listing the container name/id (if they can be tracked down) would be more meaningful. That's not this change - but, because the fields aren't anonymous, there's scope for adding container name/id if known.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈 Missed the example, as exactly that part was out of horizontal scroll.

Ah, right, yeah, so I was mostly looking at the double-quotes in general; the error returned looks good; maybe I was recalling situations where the error is logged (in which case it may end up nested in another string, resulting in the escaped quotes); see opencontainers/runc#2372 for the fun ones. So mostly exploring if there would be alternatives that didn't require quotes.

If we keep the quotes than %q is fine as well.

robmry added 2 commits April 30, 2025 12:42
- Use structured logging.
  - Which means ids are logged consistently.
- Use variable 'isRestore' instead of extra map lookups.

Signed-off-by: Rob Murray <[email protected]>
@robmry robmry force-pushed the active_endpoints_error_error branch from c0257a0 to a33c4a7 Compare April 30, 2025 11:42
Comment on lines 257 to 232
}
log.G(ctx).WithError(err).Error("Failed to restore endpoint, getNetworkFromStore failed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would these benefit from a comment to say "really, this is one case where log-and-ignore is the most correct approach"?

I know we have a bunch of these in libnetwork where it may not be correct, and which should likely be updated, so mostly from a perspective where we explicitly acknowledge that we ignore on purpose.

Perhaps logging as Warn could make sense as well if it is a "recoverable" (and/or "somewhat expected") situation, and will never be a fatal error .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same applies to most of libnetwork's error-ignoring ... errors shouldn't be ignored, but handling them can have highly unpredictable consequences.

Not finding these objects in the store should mean something really bad happened, like the data store on disk got corrupted. In which case, the right thing would be to abort and get the underlying issue fixed. (Particularly for the live-restore case... better not to restore anything than get to a broken state that'll be hard to debug.)

But, as the issue linked from this PR suggests, it does happen - probably because of the known races in the way libnetwork uses its data store. That's the issue Albin's new in-memory caches should eventually help with, and this issue/PR demonstrates how difficult it is to untangle without breaking things.

So, there's no way for this code to tell whether something bad is going to happen, or has already happened. I can add a comment to that effect. Not sure about Error vs Warning ... perhaps not knowing for-sure that we broke things means it can be a Warning, but it's not ideal.

On Sandbox restore if an Endpoint (or Network) can't be loaded from
the store and the container is to be preserved (live-restore), stub
Network and Endpoint objects are created and added to the cache. But
then the Endpoint is just dropped - leaving the stub objects in
cache.

If not-live-restore, the Sandbox is reconstructed do that it can be
deleted by sb.delete(force=true). But, the only thing sb.delete does
with the Endpoint in this case is call ep.Delete. And, ep.Delete
doesn't do anything if the Network and Endpoint can't both be loaded
from the store. So, again, the stub objects are left in the in-memory
cache and nothing that might have needed tidying up got tidied up.

So, whether live-restoring or not, just log an error and ignore an
Endpoint if it can't be loaded from disk.

Signed-off-by: Rob Murray <[email protected]>
@robmry robmry force-pushed the active_endpoints_error_error branch from a33c4a7 to 33f5b9e Compare April 30, 2025 14:26
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines -196 to -200
msg := " for cleanup"
create := true
isRestore := false
if val, ok := activeSandboxes[sb.ID()]; ok {
msg = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ever so slightly considering if including a message in the log-message itself would still be useful (would it be easier to grep?) Not sure if the existing messages were accurate, but I could imagine "Failed to xxxx for cleanup" or "Failed to xxx during restore" (or whatever covers it best), would help debugging logs provided by users.

Not a blocker for sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, msg was only used by "failed to create osl sandbox". The two per-Endpoint log lines just said "for cleanup", whether it was or not ...

log.G(context.TODO()).Errorf("getNetworkFromStore for nid %s failed while trying to build sandbox for cleanup: %v", eps.Nid, err)
log.G(context.TODO()).Errorf("getEndpointFromStore for eid %s failed while trying to build sandbox for cleanup: %v", eps.Eid, err)

Now, isRestore is always included as a logger field.

@@ -206,9 +205,17 @@ func (c *Controller) sandboxRestore(activeSandboxes map[string]interface{}) erro
sb.restoreResolvConfPath()
create = !sb.config.useDefaultSandBox
}

ctx := context.TODO()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like punching-through the context is a realistic option (as a follow-up); I see this function is called from libnetwork.New, which already has a context and spans set up.

(Not sure if context-cancelation is expected to be handled in this function though, or if it would require a context.WithoutCancel().

@robmry robmry modified the milestones: 28.1.2, 28.2.0 May 9, 2025
@robmry robmry moved this from Up next to Complete in Maintainer spotlight May 9, 2025
@robmry robmry merged commit f848191 into moby:master May 9, 2025
141 checks passed
@robmry robmry deleted the active_endpoints_error_error branch May 14, 2025 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Complete
3 participants