-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Active endpoints error error #49901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Active endpoints error error #49901
Conversation
61ab315
to
c0257a0
Compare
libnetwork/network.go
Outdated
name: n.name, | ||
id: n.id, | ||
endpoints: sliceutil.Map(eps, func(ep *Endpoint) string { | ||
return fmt.Sprintf("name:%q id:%q", ep.name, stringid.TruncateID(ep.id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know %q
sometimes leads to additional escaping in output (foo:\"\"
) which can reduce readability; for id
, I don't THINK we'd ever have an empty value(?) in that case at least for id
we probably don't need %q
.
For name (which from reports ended up being empty), I wonder if we could make the output "optional", e.g.
deadbeef
(only ID was set)deadbeef (happy_torvalds)
(name was set)
Although .. I guess we list all of the endpoints within ()
in the error, so that would end up as;
has active endpoints (deadbeef (happy_torvalds), cafebeef (boring_wozniak))`
(not sure if the extra braces would make it hard to read, but perhaps that's .. "OK"?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll change the %q
to %s
.
But an Endpoint should always have a name - particularly now I've deleted the stub instantiations made when a load from store failed. So, I don't think it's worth changing the message to deal with empty strings.
I put an example of the error in the description ...
network nnn has active endpoints (name:"romantic_cerf" id:"6e6ce97cc986", name:"charming_rubin" id:"2f704afa034c")
If there's something wrong with it - I think it's that it lists endpoint names and ids, and those aren't things users should need to know or care about. Listing the container name/id (if they can be tracked down) would be more meaningful. That's not this change - but, because the fields aren't anonymous, there's scope for adding container name/id if known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙈 Missed the example, as exactly that part was out of horizontal scroll.
Ah, right, yeah, so I was mostly looking at the double-quotes in general; the error returned looks good; maybe I was recalling situations where the error is logged (in which case it may end up nested in another string, resulting in the escaped quotes); see opencontainers/runc#2372 for the fun ones. So mostly exploring if there would be alternatives that didn't require quotes.
If we keep the quotes than %q
is fine as well.
Signed-off-by: Rob Murray <[email protected]>
- Use structured logging. - Which means ids are logged consistently. - Use variable 'isRestore' instead of extra map lookups. Signed-off-by: Rob Murray <[email protected]>
c0257a0
to
a33c4a7
Compare
libnetwork/sandbox_store.go
Outdated
} | ||
log.G(ctx).WithError(err).Error("Failed to restore endpoint, getNetworkFromStore failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would these benefit from a comment to say "really, this is one case where log-and-ignore is the most correct approach"?
I know we have a bunch of these in libnetwork where it may not be correct, and which should likely be updated, so mostly from a perspective where we explicitly acknowledge that we ignore on purpose.
Perhaps logging as Warn
could make sense as well if it is a "recoverable" (and/or "somewhat expected") situation, and will never be a fatal error .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the same applies to most of libnetwork's error-ignoring ... errors shouldn't be ignored, but handling them can have highly unpredictable consequences.
Not finding these objects in the store should mean something really bad happened, like the data store on disk got corrupted. In which case, the right thing would be to abort and get the underlying issue fixed. (Particularly for the live-restore case... better not to restore anything than get to a broken state that'll be hard to debug.)
But, as the issue linked from this PR suggests, it does happen - probably because of the known races in the way libnetwork uses its data store. That's the issue Albin's new in-memory caches should eventually help with, and this issue/PR demonstrates how difficult it is to untangle without breaking things.
So, there's no way for this code to tell whether something bad is going to happen, or has already happened. I can add a comment to that effect. Not sure about Error vs Warning ... perhaps not knowing for-sure that we broke things means it can be a Warning, but it's not ideal.
On Sandbox restore if an Endpoint (or Network) can't be loaded from the store and the container is to be preserved (live-restore), stub Network and Endpoint objects are created and added to the cache. But then the Endpoint is just dropped - leaving the stub objects in cache. If not-live-restore, the Sandbox is reconstructed do that it can be deleted by sb.delete(force=true). But, the only thing sb.delete does with the Endpoint in this case is call ep.Delete. And, ep.Delete doesn't do anything if the Network and Endpoint can't both be loaded from the store. So, again, the stub objects are left in the in-memory cache and nothing that might have needed tidying up got tidied up. So, whether live-restoring or not, just log an error and ignore an Endpoint if it can't be loaded from disk. Signed-off-by: Rob Murray <[email protected]>
a33c4a7
to
33f5b9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
msg := " for cleanup" | ||
create := true | ||
isRestore := false | ||
if val, ok := activeSandboxes[sb.ID()]; ok { | ||
msg = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ever so slightly considering if including a message in the log-message itself would still be useful (would it be easier to grep?) Not sure if the existing messages were accurate, but I could imagine "Failed to xxxx for cleanup" or "Failed to xxx during restore" (or whatever covers it best), would help debugging logs provided by users.
Not a blocker for sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, msg
was only used by "failed to create osl sandbox". The two per-Endpoint log lines just said "for cleanup", whether it was or not ...
log.G(context.TODO()).Errorf("getNetworkFromStore for nid %s failed while trying to build sandbox for cleanup: %v", eps.Nid, err)
log.G(context.TODO()).Errorf("getEndpointFromStore for eid %s failed while trying to build sandbox for cleanup: %v", eps.Eid, err)
Now, isRestore
is always included as a logger field.
@@ -206,9 +205,17 @@ func (c *Controller) sandboxRestore(activeSandboxes map[string]interface{}) erro | |||
sb.restoreResolvConfPath() | |||
create = !sb.config.useDefaultSandBox | |||
} | |||
|
|||
ctx := context.TODO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like punching-through the context is a realistic option (as a follow-up); I see this function is called from libnetwork.New
, which already has a context and spans set up.
(Not sure if context-cancelation is expected to be handled in this function though, or if it would require a context.WithoutCancel()
.
- What I did
When a
Sandbox
is restored on startup (either to be used, because live-restore is enabled, or to clean up a container that was left running on shutdown) ... if theNetwork
orEndpoint
objects can't be loaded from store, stub objects are created.If the
Sandbox
is being live-restored, the stubEndpoint
is then just discarded. Otherwise, theEndpoint
is added to theSandbox
thensb.delete
is called to do the cleanup.sb.delete
callsEndpoint.Delete
on eachEndpoint
- andEndpoint.Delete
does nothing if it can't load bothNetwork
andEndpoint
from the disk store.So, either way, the stub
Network
/Endpoint
objects end up having no effect.#49887 added in-memory caches of
Network
andEndpoint
objects. The caches are populated whenever those objects are read from or written to the on-disk store - but most readers don't use the cached objects (yet), they just load a new copy of the object from the disk store and use that.That change added the stub objects to the in-memory cache - but they were never deleted from cache... if live-restore, they'd just be orphaned (the Endpoint wasn't added to the Sandbox). It not-live-restore,
Endpoint.Delete
returned early, before removing the cachedEndpoint
object so, again, the stub objects were orphaned.That caused a problem for the one of the only uses (so far) of the in-memory caches, replacing the stored Endpoint Count for a network with a search of the cache. The orphaned stub Endpoints referred to the network, so network deletion would fail with
ActiveEndpointsError
. I've not been able to reproduce the problem reported in #49887, so don't have logs to prove it and none have been provided on the issue yet... but the error message has an empty Endpoint name, so it looks very much like the endpoints are stub objects.If all that theorising is correct ...
- How I did it
Report endpoint id as well as name in ActiveEndpointsError
When a network genuinely has active endpoints, the new error looks like this ...
Improve logging and readability of Controller.sandboxRestore
Don't add stub Endpoint/Network object to cache on Sandbox restore
The stub objects were introduced by c8a66f5 - I don't think that logic applies any more, and ...
Network
that doesn't know which network driver it should have isn't going to be able to get involved in much cleanup, I think it was just a way to load anEndpoint
object.Endpoint
can't clean up anything like addresses/dns-entries etc, because it doesn't know what they are. But, that's ok because those things couldn't have been restored to internal structures anyway.Network
gets leaked in the disk-store (but, quite likely, it's not there anyway!).So, removed them.
- How to verify it
Tricky - no repro, and I can't add a unit test for deleted code.
- Human readable description for the release notes