The Plights of Bare-Metal Docker Networking

Hello!

There was some discussion I had recently that led me to talk over a problem I’ve been facing in some of my personal infrastructure.

This has been something ongoing for a few years (since about 2017, actually) where I haven’t really come up with a better solution than the bandage I slapped onto it years ago, meanwhile it hums away and drives all of my ingress traffic to my cluster.

This is a rather long document that covers significant history of my hardware and how I handle application deployments. This isn’t exhaustive of my experiences, and it might or might not be useful to you.

Backstory

Let’s give a little context to my old setups.

Previously, I was managing most of my services manually through the Proxmox hypervisor (henceforth, PVE). This actually worked quite well for the time and I was making heavy use of the Linux Container features of the platform.

However, maintenance quickly grew and I was having a hard time keeping applications and their environments updated. This didn’t really stop me from using the applications, but I would miss out on some stability improvements for certain duct-taped applications (looking at you, decendents of Nzbdrone). I’d find the applications would crash or have some kind of problems and I wouldn’t know until I was trying to use it, which defeated the point of some applications.

In an ideal world, I would have rolled out some monitoring. However, I wasn’t interested in rigging a dozen Linux environments with some kind of monitoring just so I could know Sonarr had yet another crash because it has the stability of a stock market. Things were getting out of date, volume configuration was becoming very cumbersome since I’d have to manually manage the NFS shares or attach them as a datastore and pollute storage metrics generated by PVE. In my situation, PVE wasn’t really a lean and mean way to do things. It ended up being quite obese, in terms of all the background services that I simply didn’t use.

To be clear, it’s a great tool when it’s used correctly, but it wasn’t quite the right fit for me. My systems at the time were ancient and really weren’t equipped to deal with the bloat of PVE and all of the work I was trying to throw on top of them. PVE also had a lot of features that I simply wasn’t going to use.

In 2017, I jumped on the first-generation Threadripper bandwagon. I had significant savings at the time, despite being stuck in my undergraduate program, and purchased a TR 1950X with 64GB DDR4 non-ECC. I took this time to move over to something new, since I could slowly deprecate my services as I handled the conversion process.

Granted, I could have continued running PVE, but the looming maintenance problems still existed, and I had been using ephemeral containers on my desktop for some time now. The world was rapidly moving to a containers-first approach for everything. In my case, a lot of applications felt like they’d fit in a containerized environment far better. Not to mention the large ecosystem that had developed around Docker, so any application I wanted would likely have images built for it already. Rolling out complicated software stacks went from a tedious exercise of setting it all up manually to writing a few dozen lines of YAML and running a single command. At the time, I didn’t have better things to do, but I knew that wouldn’t always be the case. I eventually needed to transition to something else, so I went with Docker.

tl;dr: went from old virtualization environment on crappy hardware to containerized environment on fast hardware, never looked back.

Today

This was several years ago, and now I am regularly using containers for just about everything. Most of my infrastructure is designed containers-first and I have automated pipelines handling most of the pain of keeping software up-to-date. There’s still some stateful applications that I have to keep my eyes on and update routines that I have to follow, but many applications are just as simple as ripping out the old binaries and putting the new ones in.

Unfortunately, transitions like this tend to solve some problems but also create new ones. I had a whole new host of issues to try and solve, which I ended up tackling most of them. However, there was a single issue that I hadn’t been able to completely fend off, even today.

Networking in Docker

A distinct offering of my old setup with PVE was Linux Containers (LXC). Unlike Docker containers, they’re NOT ephemeral and the entire rootfs persists in all conditions. I was also able to assign a container any IP address of my choosing. Furthermore, I could migrate a container from one node to another and the network configuration would remain the same. Ultimately, I was able to take my firewall and point it at a single location, then I knew I would always make a connection regardless of what system it’s on.

Docker, while it does allow you to schedule just about any container on whatever node that you’d like (assuming you’re in a Swarm cluster), doesn’t have any kind of dynamic IP assignment. It does have support for binding on physical network, specifically using the macvlan adapter. There’s a downfall here, though. While traditional bridge networks managed by Docker do allow you to specify container IPs in many cases, this feature just magically doesn’t work on macvlan adapters.

What does this mean and how is it a problem?

In my case, I have a dedicated network (192.168.5.0/24) which is reserved specifically for ingress. I don’t make much use of Docker’s default routing capabilities, which is really just binding each service to a port on the host’s LAN IP address. This isn’t a great method for a Swarm setup since you’d be left pointing a firewall to a specific server. If that physical host ever went down and containers were allocated to another node, nothing would route because it’s not on the same IP address.

The real disadvantage of Docker’s macvlan adapter is the control that Docker insists on employing over it. It’s not that intelligent with how it’s handled either, as it will just increment to the next address when a new container appears. This is a problem because you cannot guarantee a certain address for any given network.

You might think to use DNS Round Robin to handle throwing incoming requests to any particular server, but this isn’t safe in the scope of generic TCP applications (which I have a fair few of). DNS isn’t designed to handle a problem like this. If you were only running HTTP/S services, you may actually be able to get away with doing this, even if it’s not the best way to handle it.

This is where the real pain point lies with Docker, for me. Theoretically, the Compose Specification does say you can set an explicit IPv4 address via ipam options under a service’s specific configuration and even provide examples for it. In practice, however, this option isn’t applicable for Swarm-scoped macvlan adapters at all. You can use this for other network types in a Swarm scope, but you’re really not able to assert a specific IP address when it’s addressing a physical network. There’s no direct error that originates from it, but the container will be allocated with an incremented IP address and will completely ignore your specification.

I am far from the only one with this problem. Today, if you look into this problem, you’ll likely come across this DevOps StackExchange post which is a perfect example of what I’m facing. In my case, my first exposure to this problem was this long-outstanding GitHub issue for moby/moby. There’s some great use-cases in here of why people need this functionality and it’s not practical or even neccessary to design around it with a load balancer or other means. Now, some people have figured out temporary solutions around this disaster, but they also come with their own problems.

Some of these hacky solutions (like the Gist above) have a whole new host of integration problems. Using the example above explicitly, you’re relying on the container to configure resources for itself, which already breaks some of the mantra that comes with modern containerization practices. In my case, if I were to employ something like this, I’d be left with two situations:

extending every single image that I don’t have full control over to support managing itself
developing and maintaining images for every single application that I’d like to use. (or, most likely, a combination of both)

There’s not always a need to extend an image provided by a vendor, though I would be forced to if I decided to use the above solution. Furthermore, rolling every single application that I use into its own image that is built on top of a base image which I also maintain is a total nightmare. Overall, I’m not interested in doing it. In terms of best practice, it’s all kinds of wrong, as you can hopefully see.

tl;dr: Docker Stack will completely ignore certain facets of your Compose because they decided not to support it, with lots of people unhappy about the situation. A load balancer is usually used in-place of this virtual IP allocation nonsense.

What about Nginx or HAProxy? They’re often used to handle ingress like you’re describing.

This is a great point and a universal TCP/UDP load balancer like nginx may be capable of handling my ingress. However, I will still face the exact same problem. A single, static point of ingress, when I really want something dynamic that can effectively be relocated at any point in time. I could use a middleman host, either running the load balancer bare-metal or containerize it and act as a point of ingress. Unfortunately, I just replace one problem for different problems.

First, I have a single point of failure. You could argue this is true for my firewall, but this is ingress for all kinds of traffic, and a lot of it originates from the local network. Furthermore, directing all of that potential traffic through a single application instance is less than ideal. I’d be stuck managing TCP, UDP, and a myriad of HTTP across a single instance and the configuration for it would quickly grow. Theoretically, once it’s set, I could forget about it, but every new service would need a new entry. Granted, this solution would be dynamic. I could likely throw it around on any particular node and it would be fine. There’s still the exact same problem where I am required to play a port dance, defining all kinds of ports for different services. Nginx has the great feature of server blocks, but that’s simply not applicable to the TCP/UDP layer. There’s also the connection. I would be stuck creating an individual bridge for the load balancer to connect to the service. I would likely need to create a point-to-point network for each service, which isn’t completely horrific but does complicate the configuration situation somewhat. I’m not sure if this would scale well based on the underlying network stack, but I’m probably not going to push it hard enough to warrant that kind of discussion.

All of that said, Kubernetes has a similar problem. Bare-metal clusters like mine aren’t well-supported in vanilla K8s. The most obvious case of this is NodePort ingress, which is just as equally disgusting as Docker’s management for it. It’s the exact same concept. Though, they’ve at least somewhat figured this out via MetalLB, which aims to solve a problem quite close to mine. Additionally, MetalLB is technically still alpha software, and didn’t exist when I was setting out to do all of this.

tl;dr: I hate NodePort and nginx doesn’t strictly solve all the problems

There’s good reasons to not do what you’re doing. What’s your reasoning? Is this really just a XY problem?

Maybe I don’t know better? Perhaps I’m missing a key to the puzzle here that I somehow haven’t figured out after all of these years of dealing with containerized applications. However, I’m not convinced this is an anti-pattern, strictly speaking. If you’ve read the whole document here, you might come to the conclusion of “well, duh, you’re trying to convert a very static environment to a dynamic container environment. Containers aren’t built for this.” You could be right and this is a completely valid interpretation.

I even understand some of this from a technical perspective. I’m effectively asking for a container to have a fixed IP address in a Swarm environment which can have highly-available replicas configured. Though, I’m not convinced this should be an impossible-to-overcome restriction. We already have plenty of mutually exclusive or otherwise restricted fields in a modern Stack file, so why wouldn’t they just enforce a replica limitation if I decided to do something infeasible (allocate 2 containers with that same IP)? We already have cases where services will fail and back off if network configurations aren’t valid.

It’s obvious that this problem won’t be solved by now, considering the issue has been open since 2016. Anyone that was interested in this solution has long since fixed the problem on their own via other means or has moved onto better-suited platforms. I suspect those behind the moby project and Docker proper aren’t bothered to tackle these problems. This isn’t bad-mouthing them by any means, as I’ve built an infrastructure off of their undoubtedly back-breaking labor, and so has most of the world.

Like others before me, I have duct-taped a solution to hopefully get things working.

What are you doing now?

I’ve had various solutions to this problem over the years but it has all been the same in spirit. Originally, it started with a bash script, then grew into a hackery of Python, then grew into a slightly more formal hackery of Python. The bash script was incredibly unstable and I was forced to eventually move to a Python solution since it lent more resiliency.

My current solution is a system service which attaches to the Docker socket directly and attempts to grok events from it. This isn’t without caveats, though. This setup is strongly dependent on labels and is very what-you-see-is-what-you-get. Multiple network definitions aren’t supported and it relies on full networking capabilities on the host, which is fine in my situation. It does require each host to be configured identically, which isn’t a problem in my situation since bare-metal deployments are managed via Ansible.

Another concern is the Docker Socket API itself. The Docker daemon is a little HTTP server which fronts a lot of system-level control. It’s not very fast, which I found somewhat recently. Granted, I never expected it to be all that speedy, but the most recent wall of performance problems has been insufferable. There’s an unknown threshold in my system where performance tanks dramatically. To be clear, the physical hardware isn’t brand-new but it is far from slow. All of the backing storage is solid state over NVMe, with the oldest systems being first-generation Threadrippers.

Around the 50 to 60 container threshold, things start slowing down dramatically. The Docker daemon starts writing to the disk a lot and the event queue over HTTP completely stalls out. Simple container inspection requests which only happen once per container start/stop event take minutes instead of milliseconds, and eventually stall out. There’s probably ways that I can solve this, and I haven’t completely given up on diagnosing this issue (as I’m still investigating), but it’s certainly discouraging that I’m hitting this kind of wall at such a small scale – not even 100 containers.

Regardless, current situations aside, this little tool has worked out pretty well so far. Exact IP addresses (and their backing interface) are expressed through container labels, which the tool listens out for and will assign them directly. The security measures here are pretty minimal, so just about any container with labels will end up convincing the application to do whatever assignment. Furthermore, assignments aren’t that intelligent, so you have to know your configuration won’t break anything. Some day I may make the service a bit more intelligent and do basic checks to be sure the configuration is valid, but right now it’ll let you screw things up all you want.

With that said, I really don’t like having to roll my own solution like this. It’s definitely not how the platform is meant to be used, but it has solved my problem. I’m somewhat stuck using this application whether I think it’s the best path forward, or I’ll have to find another way.

tl;dr: I am too stubborn to route things via a mix of ports and wrote a tool which somewhat overwhelms the Docker daemon even though the load is minimal. The tool is an automated virtual IP allocation service that creates bridges and allocates IP addresses using the host’s network stack and NATs them appropriately.

Image: Labels

What’s next?

Right, so you’re probably wondering what the whole point of this document is by now. Originally, I just wanted to vent about the whole situation and soldier on using my little tool that mostly handles things. However, I’m really hitting practical limits of my deployments here, not to forget the aforementioned performance issues.

After much deliberation, I’ll be moving over to Kubernetes (like the rest of the world did years ago).

Now, “upgrading” to Kubernetes isn’t a two-dimensional net-gain here. There’s several benefits, but also some drawbacks. Here’s a realized list of my digging so far:

+ A (somewhat) supported solution for static IP addressing, via MetalLB.
+ Faster storage, using OpenEBS Mayastor with NVMe over Fabric.
+ Much better formalization of deployment, from networking to storage.
+ Superior API, in performance and utility.
+ Excellent portability from one implementation to another.
+ A better road to a cloud provider if an application needs scale, fast.
- More complex deployment specification for projects.
- Dramatically more complex software, in terms of learning curve and management.
- Greater resource consumption.

With all of that considered, I’d still like to entertain the idea of moving over to Kubernetes. However, this has gone on for quite a while, and I’m better off rolling this into a “part two” where I document the migration process and the bumps I might encounter along the way.

Closing on Docker Swarm

I’ve had strong opinions for and against Docker Swarm for quite some time. In one case, there’s a lot of power for such a simple set of tools that allow much smaller scales to tackle complex deployment problems. I think Swarm makes a lot of sense for people at my scale, where there’s only a few machines and every server can fit into one rack. It’s completely impractical for a datacenter because the technology isn’t designed for it. I feel like Swarm (while practically abandoned at this point) still has a lot of potential, even with a little bit of love. Mirantis hasn’t completely killed Docker Swarm (the one bundled in Docker itself, mind you), but it certainly hasn’t seen a useful update in quite some time. New features simply don’t happen, so you see is what you get, forever.

There’s lots of other limitations to Swarm that I didn’t cover here, only my practical experience with it. I certainly haven’t been using the tool in the way that it was designed, but I suppose I could be too far stuck in the past to employ it correctly, or there’s really not a lot of great avenues available to me in the first place.

If you’ve really read this far, that’s incredible. I don’t expect this to be an educational read, as it was originally structured to be a total rant about the plights of Docker and ingress. Like I said before, there’s like a part two to this coming up, and it will detail my whole migration process to Kubernetes.

The Plights of Bare-Metal Docker Networking

Backstory #

Today #

Networking in Docker #

What does this mean and how is it a problem? #

What about Nginx or HAProxy? They’re often used to handle ingress like you’re describing. #

There’s good reasons to not do what you’re doing. What’s your reasoning? Is this really just a XY problem? #

What are you doing now? #

What’s next? #

Closing on Docker Swarm #