-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Avoid selecting duplicate host ports for mappings to 0.0.0.0 and specific addresses #50054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Because we set SO_REUSEADDR on sockets for host ports, if there are port mappings for INADDR_ANY (the default) as well as for specific host ports - bind() cannot be used to detect clashes. That means, for example, on daemon startup, if the port allocator returns the first port in its ephemeral range for a specific host adddress, and the next port mapping is for 0.0.0.0 - the same port is returned and both bind() calls succeed. Then, the container fails to start later when listen() spots the problem and it's too late to find another port. So, bind and listen to each set of ports as they're allocated instead of just binding. Signed-off-by: Rob Murray <[email protected]>
Signed-off-by: Rob Murray <[email protected]>
dc41518
to
7bace8c
Compare
We set SO_REUSEADDR on sockets used for host port mappings by docker-proxy - which means it's possible to bind the same port on a specific address as well as 0.0.0.0/::. For TCP sockets, an error is raised when listen() is called on both sockets - and the port allocator will be called again to avoid the clash (if the port was allocated from a range, otherwise the container will just fail to start). But, for UDP sockets, there's no listen() - so take more care to avoid the clash in the portallocator. The port allocator keeps a set of allocated ports for each of the host IP addresses it's seen, including 0.0.0.0/::. So, if a mapping to 0.0.0.0/:: is requested, find a port that's free in the range for each of the known IP addresses (but still only mark it as allocated against 0.0.0.0/::). And, if a port is requested for specific host addresses, make sure it's also free in the corresponding 0.0.0.0/:: set (but only mark it as allocated against the specific addresses - because the same port can be allocated against a different specific address). Signed-off-by: Rob Murray <[email protected]>
} else if (unspecAddr.Is4() && ips4) || (unspecAddr.Is6() && ips6) { | ||
ipToPortMapRef[unspecAddr] = &portMapRef{portMap: p.ipMap[unspecAddr][proto]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A socket listening on ::
will receive both IPv4 and IPv6 packets addressed to the destination port unless the IPV6_V6ONLY
sockopt is set. Do we follow the same semantics with port mappings, or does the user have to define two mappings for ::
and 0.0.0.0
to publish a port dual-stack? Which address families are mapped when a port mapping is specified in the Engine API request with the HostIP
field unset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A socket listening on
::
will receive both IPv4 and IPv6 packets addressed to the destination port unless theIPV6_V6ONLY
sockopt is set. Do we follow the same semantics with port mappings, or does the user have to define two mappings for::
and0.0.0.0
to publish a port dual-stack? Which address families are mapped when a port mapping is specified in the Engine API request with theHostIP
field unset?
When the host IP in a -p ip:port:port
is present, if it's 0.0.0.0
or ::
it's single-family.
If not present in the -p
, the host address is the value supplied by com.docker.network.bridge.host_binding_ipv4
when the network was created, or --ip
for the default bridge. If that's left unset, or it's set to 0.0.0.0
, the port mapping is dual-stack. (If it's set to ::
, it's IPv6 only.)
The portallocator is always single-stack... so requests to it don't correspond directly to the port mapping specified by the user - the portallocator's caller has to work out which address families it needs the reservation for.
In other words, it's a bit of a mess! But, from the portallocator's perspective it's simple.
As I was working on this, I noticed the daemon doesn't reserve its own port for IPv6, only for IPv4, when it's listening on both - maybe worth fixing separately, but it's not possible to start a container on that port, because the daemon is already listening on it. So, I think it could only cause a problem if a port mapping had a range that included the daemon's port, and the portallocator allocated it when other ports in the range were free ... but that's not going to be a common issue.
- What I did
- How I did it
Two fixes, for related issues ...
Listen on mapped host ports before mapping more ports
Because we set SO_REUSEADDR on sockets for host ports, if there are port mappings for INADDR_ANY (the default) as well as for specific host ports - bind() cannot be used to detect clashes.
That means, for example, on daemon startup, if the port allocator returns the first port in its ephemeral range for a specific host address, and the next port mapping is for 0.0.0.0 - the same port is returned and both bind() calls succeed. Then, the container fails to start later when listen() spots the problem and it's too late to find another port.
So, bind and listen to each set of ports as they're allocated instead of just binding.
This fixes ...
It doesn't matter whether the already-listening TCP socket belonged to docker, or some other process running on the host.
portallocator: always check for ports allocated for 0.0.0.0/::
We set SO_REUSEADDR on sockets used for host port mappings by docker-proxy - which means it's possible to bind the same port on a specific address as well as 0.0.0.0/::.
For TCP sockets, an error is raised when listen() is called on both sockets - and the port allocator will be called again to avoid the clash (if the port was allocated from a range, otherwise the container will just fail to start).
But, for UDP sockets, there's no listen() - so take more care to avoid the clash in the portallocator.
The port allocator keeps a set of allocated ports for each of the host IP addresses it's seen, including 0.0.0.0 and [::]. So, if a mapping to 0.0.0.0/:: is requested, find a port that's free in the range for each of the known IP addresses (but still only mark it as allocated against 0.0.0.0/::).
And, if a port is requested for specific host addresses, make sure it's also free in the corresponding 0.0.0.0/:: set (but only mark it as allocated against the specific addresses - because the same port can be allocated against a different specific address).
This fixes ...
In this case, if a host process is already using a port allocated by the daemon, the clash will not be detected. (So, we should update docs, adding a reason not to map ports to (the default)
0.0.0.0
.)- How to verify it
With the commands above (against a freshly started daemon).
New and updated unit/integration tests.
Without these changes, the integration test fails with ...
- Human readable description for the release notes