Linux Containers Internals - Networking

In this post, I’ll explore how Linux Container Networking works. We will use Docker as a reference implementation and explore how it creates virtual IPs for each container and exposes internal container ports to the outside world.

First, we need to understand how we can create a virtual network inside a host machine. Docker abstracts all this low-level plumbing for us but understanding how these Linux Kernel features are used internally is very helpful to diagnose container networking issues.

The network stack of each Docker container is isolated by network namespaces. A network namespace is an isolated instance of the Linux network stack, with its own configuration and network devices. All processes inherit their network namespace from their parent, usually the init process. Let’s see how network namespaces work in practice:

$ sudo ip netns add test1
$ sudo ip link add dev veth_test1 type veth peer name if_test1
$ sudo ip link set dev veth_test1 up
$ sudo ip link set dev if_test1 netns test1
$ sudo ip netns exec test1 ip link set dev lo up
$ sudo ip netns exec test1 ip address add 172.20.0.10/16 dev if_test1
$ sudo ip netns exec test1 ip link set dev if_test1 up

The commands above will create a network namespace called test1 and a virtual ethernet device with the IP address 172.20.0.10. Virtual ethernet devices or veth are always created in interconnected pairs and act as tunnels between namespaces to a network device in another namespace. When a packet is received by one device, it is immediately available on the other device of the pair. The example above creates two device pairs called veth_test1 and if_test1. The if_test1 endpoint is connected to our newly created namespace.

We can use the command ip netns exec to run a program inside the new namespace. For example, to list the network interface’s addresses inside the test1 namespace, we can use the following command:

$ ip netns exec test1 ip addr

Now let’s create another namespace called test2 with a veth at the address 172.20.0.20.

$ sudo ip netns add test2
$ sudo ip link add dev veth_test2 type veth peer name if_test2
$ sudo ip link set dev veth_test2 up
$ sudo ip link set dev if_test2 netns test2
$ sudo ip netns exec test2 ip link set dev lo up
$ sudo ip netns exec test2 ip address add 172.20.0.20/16 dev if_test2
$ sudo ip netns exec test2 ip link set dev if_test2 up

At this point, our namespaces are isolated from each other and also from the host network. To enable communication between them, we will create a bridge device. The bridge will act as a virtual network switch that can have network interfaces connected to it and forward packets between them.

$ sudo ip link add dev br0 type bridge
$ sudo ip address add 172.20.0.1/16 dev br0
$ sudo ip link set dev br0 up

Now we can connect the other end of each veth pair to the bridge using:

$ sudo ip link set dev veth_test1 master br0
$ sudo ip link set dev veth_test2 master br0

Although our namespaces are now connected to the bridge, we still need one more step to enable communication between them. We need to instruct the Linux Kernel to allow the forwarding of IP packets through the br0 interface.

$ sysctl -w net.ipv4.ip_forward=1
$ sudo iptables -A FORWARD -i br0 -j ACCEPT
$ sudo iptables -A FORWARD -o br0 -j ACCEPT

When the net.ipv4.ip_forward variable is enabled, the system will act as a router, forwarding IP packets between networks.

We should also add default routes to each namespace so that packets can be routed through the bridge:

$ ip -all netns exec ip route add default via 172.20.0.1

We can check the veth endpoints connected to the bridge using:

$ bridge link show br0
5: veth_test1@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 2
7: veth_test2@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 2

At this point, I have two virtual networks on my machine, and programs running inside each isolated network can talk to each other. Here is the network diagram at this point:

Network Diagram

You can use the commands below to check the communication between the bridge and namespaces, the namespaces, and the host with the namespaces:

$ ip netns exec test1 ping 172.20.0.20
$ ip netns exec test2 ping 172.20.0.10
$ ip netns exec test1 ping 172.20.0.1
$ ip netns exec test2 ping 172.20.0.1
$ ping 172.20.0.10
$ ping 172.20.0.20

But the programs running inside each namespace still do not have access to the outside world. To enable this, we need to add a NAT rule to our host to forward packets from the internal networks to the external physical network. We can do this with the following iptables rule:

$ sudo iptables -t nat -A POSTROUTING -s 172.20.0.0/16 -j MASQUERADE

We can now ping on an external host on the network:

$ ip netns exec test1 ping 8.8.8.8

Now let’s start an Echo TCP server at port 7777 in the test1 namespace.

$ ip netns exec test1 ncat -l 7777 --keep-open --exec "/bin/cat" &

The ncat tool is listening only on the 172.20.0.10 address. You can check it with the following command on the host machine or from the test2 namespace:

$ ncat 172.20.0.10 7777
$ ip netns exec test2 ncat 172.20.0.10 7777

But this port is unreachable outside the host machine. We can fix this by mapping the 7777 port listening on the test1 namespace on the physical host network.

$ iptables -t nat -A PREROUTING -m addrtype --dst-type LOCAL -p tcp -m tcp --dport 7777 -j DNAT --to-destination 172.20.0.10:7777
$ ncat 192.168.10.105 7777

Now it is possible to reach the ncat tool running inside test1 namespace from another machine on the network.

We now have a virtual internal network with multiple virtual interfaces and a server listening at one of these interfaces with its port mapped to the outside world via the host interface.

The problem with localhost

The iptables rule we created above to forward packets to the namespace does not work with packets originating from the loopback interface. We can check that running the command below:

$ ncat localhost 7777
ncat: Connection refused.

Unfortunately, Linux does not support the creation of NAT rules to the loopback interface. Docker circumvents that by using the docker-proxy daemon, its responsibility is to receive packets arriving at a specified port and redirect them to the container’s port.

A real-world example

Let’s see how everything works in practice using Docker. First, let’s start a docker image and forward the 80 port to the outside world:

$ docker run -dit --name docker-test -p 8080:80 httpd

This will start a Docker container running the Apache Web Server and forward the container’s port 80 to the external port 8080.

$ docker ps
CONTAINER ID   IMAGE     COMMAND              CREATED          STATUS          PORTS                  NAMES
182eda2f2cf9   httpd     "httpd-foreground"   13 minutes ago   Up 13 minutes   0.0.0.0:8080->80/tcp   docker-test

Now let’s see how Docker configured our networking environment.

$ ip link 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:0c:29:c0:45:20 brd ff:ff:ff:ff:ff:ff
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:d3:ca:89:f8 brd ff:ff:ff:ff:ff:ff
5: vethabf3305@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default
    link/ether c6:d4:e6:76:0d:eb brd ff:ff:ff:ff:ff:ff link-netnsid 0

There is a bridge called docker0 and a virtual interface called vethabf3305@if4 connected to the docker0 bridge.

If you try to run the ip netns now, it will not show any namespaces. That’s because the ip utility lists the symlinks created on /var/run/netns/ directory, and the Docker daemon does not create those symlinks. But we can fix that. Let’s first get the PID of the containers root process:

$ docker top 182eda2f2cf9
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                4929                4906                0                   21:36               pts/0               00:00:00            httpd -DFOREGROUND
daemon              4980                4929                0                   21:36               pts/0               00:00:00            httpd -DFOREGROUND
daemon              4981                4929                0                   21:36               pts/0               00:00:00            httpd -DFOREGROUND
daemon              4982                4929                0                   21:36               pts/0               00:00:00            httpd -DFOREGROUND

In this example, the PID we want is 4929.

There will be a file pointing to the network namespace created by Docker on the PID’s proc filesystem.

$ ls /proc/4929/ns/net -la
lrwxrwxrwx 1 root root 0 May  9 21:36 /proc/4929/ns/net -> 'net:[4026532622]'

We can now bind mount the network namespace descriptor of the process on the path where the ip utility is expecting it:

$ touch /var/run/netns/182eda2f2cf9
$ mount --bind /proc/4929/ns/net /var/run/netns/182eda2f2cf9

The ip utility should now have access to the network namespace created by docker:

$ ip netns exec 182eda2f2cf9 ip addr

But how is Docker forwarding the host port to the container’s port? We know that we can use iptables rules for that. Let’s check the iptables NAT rules:

$ iptables -t nat -L DOCKER
Chain DOCKER (2 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere
DNAT       tcp  --  anywhere             anywhere             tcp dpt:http-alt to:172.17.0.2:80

The DOCKER chain has a NAT rule with the port redirection configured exactly as we did in the previous section.

The docker-proxy process is also running, listening on 0.0.0.0:8080 and redirecting any packets not captured by the iptables rule above to the container’s port:

$ ps ax | grep docker-proxy
   4890 ?        Sl     0:00 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8080 -container-ip 172.17.0.2 -container-port 80

$ netstat -nap | grep docker-proxy
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      4890/docker-proxy

In the next post of this series, we will explore how container networking works under Kubernetes.

References

Back to top ↑