Docker Swarm on NixOS with GitOps

Building a declarative, highly available, git-driven container cluster

Contents

What this post covers

This is a full walkthrough of building a three-node Docker Swarm cluster running on NixOS, managed entirely through a Git repo and deployed via GitHub Actions. Every config change, every stack deployment, every OS-level tweak flows through a single repo push without the need for SSH-and-edit.

By the end you will have:

  • 3 NixOS VPS nodes in the same VPC, running Docker Swarm with 3 manager nodes
  • A separate NixOS VPS running Pangolin (reverse proxy + tunnel endpoint), also GitOps-managed
  • A single GitHub repo per concern (one for the Swarm cluster, one for the proxy) as the source of truth for all OS and service configuration
  • GitHub Actions that rebuild each node on push to main
  • Swarm stacks auto-discovered and deployed from the same repo
  • Shared persistent storage across Swarm nodes via a mounted filesystem
  • sops-nix for encrypted secrets, decrypted at deploy time on each node
  • Pangolin routing HTTPS traffic into the Swarm overlay network without publishing ports on hosts
  • In-house app builds pushed to a private container registry and deployed as Swarm services
  • A workstation recovery path so you can regain full access from a new laptop

The three building blocks

Before diving in, here is a quick grounding on the three core technologies this build combines. If you are already familiar with all three, skip to the architecture overview.

Docker Swarm

Docker Swarm is Docker's built-in container orchestration. If you have used docker compose up on a single host, Swarm is the multi-host version of that: it takes the same Compose-style YAML but distributes containers across a cluster of machines. You get service replication (run N copies of a container spread across nodes), rolling updates (replace containers one at a time with zero downtime), an overlay network (so containers on different physical hosts can talk to each other as if they were on the same LAN), and an ingress routing mesh (so traffic hitting any node on a published port gets forwarded to whichever node is actually running that container).

It is a lot simpler than Kubernetes. There is no etcd to manage, no control plane to babysit, no CRDs, no Helm charts. You write a YAML file, run docker stack deploy, and it works. For small to medium workloads, Swarm does everything you actually need without the operational overhead.

A Swarm cluster has two roles: manager nodes (which handle scheduling, cluster state, and Raft consensus) and worker nodes. In a three-node cluster like this one, all three are managers, which gives you fault tolerance since the cluster stays healthy and schedulable as long as two of three nodes are up. If one node goes down, the surviving two maintain quorum and reschedule its containers onto healthy nodes.

NixOS

NixOS is a Linux distribution where the entire system configuration is declared in code. You do not install packages imperatively (like apt install nginx), you do not edit config files scattered across /etc, and you do not maintain shell scripts that drift over time. Instead, you write a single Nix expression that describes the desired state of the system: every package, every service, every user account, every firewall rule, every mount point, every kernel module. Then you run nixos-rebuild switch, and the Nix package manager evaluates that expression and atomically transitions the running system to match.

This means the configuration is reproducible. If you lose a node, you provision a new VPS, clone the repo, and apply the same flake. You get the exact same system. If you want three identical nodes with small per-host differences (like different hostnames or VPC IPs), you write a shared module and parameterise the host-specific parts. And because the entire config is just files in a Git repo, it slots naturally into version control and CI/CD.

NixOS also gives you atomic rollbacks. Every rebuild creates a new "generation" of the system. If a change breaks something, you can roll back to the previous generation instantly, either from the boot menu or with nixos-rebuild switch --rollback.

GitOps

GitOps is a pattern where a Git repository is the single source of truth for infrastructure and application state. You do not SSH into servers and run commands. You commit changes to the repo, and automation applies those changes to the live environment.

In this build, GitOps means:

  • Push a NixOS config change to main → GitHub Actions SSHes into each node, pulls the latest commit, and runs nixos-rebuild switch. The OS, packages, services, users, firewall rules, and everything else update to match.
  • Push a new Swarm stack YAML to main → the same pipeline copies the stack file to the Swarm manager and runs docker stack deploy. The new service comes up.
  • Push an application code change → the pipeline builds a container image, pushes it to a private registry, and deploys the updated stack with the new image tag.

The result is that the running state of every node, the OS, Docker configuration, and every deployed service, is a function of what is in the repo. If someone makes a manual change on a node, the next push to main overwrites it. Drift is impossible by design.


Architecture overview

The infrastructure has two concerns, managed in two separate repos:

The Swarm cluster (3 VPS nodes in one VPC):Three VPS nodes sit in the same VPC (private network). Each runs NixOS with Docker enabled. Docker Swarm is initialised across all three using VPC private IPs for inter-node communication. A shared filesystem (virtiofs, NFS, or whatever your provider offers) provides persistent storage that all three nodes can access at the same mount path.

The reverse proxy (1 VPS, separate):A fourth VPS runs Pangolin (reverse proxy + WireGuard tunnel endpoint) as a Docker Compose stack on NixOS, managed by its own repo with the same GitOps pattern. This VPS sits outside the Swarm VPC. It receives inbound HTTPS traffic from the internet and routes it through a tunnel (Newt) into the Swarm's overlay network.

Services inside the Swarm do not publish ports on their hosts. The only way traffic reaches them is through the tunnel via Pangolin. This means you do not need to open service ports on the Swarm VPC firewall at all.

The critical constraint for the Swarm nodes: /var/lib/docker must stay local to each node. Docker Swarm requires each node to have its own local Docker state. You cannot put /var/lib/docker on shared storage. Shared persistent data for your applications is handled by bind-mounting directories from the shared filesystem into services, or by creating named volumes backed by bind mounts into the shared filesystem. This is the standard pattern.


Part 1: The Swarm Cluster

This covers everything needed to go from zero to a working three-node Swarm cluster with GitOps deployment.


Step 0: Workstation prerequisites

You need a handful of tools on your local machine before touching any servers.

Install packages

You need git, ssh, and base64 (usually already present on Linux/macOS). You also need the Nix package manager (not NixOS itself, just the Nix CLI) because several tools used in this guide (sops, age, ssh-to-age, mkpasswd) are easiest to run via Nix's ephemeral shell.

If you do not have Nix installed on your workstation, follow the official install instructions at nixos.org. The single-user or multi-user installer both work. You only need the nix CLI, not a full NixOS install.

If your Nix install does not have flakes enabled by default, you can either add experimental-features = nix-command flakes to /etc/nix/nix.conf and restart the daemon, or pass the flag inline on every command (shown below).

Create SSH keys

You need two distinct key pairs:

  • Admin key: your interactive SSH access to the nodes. You use this when you need to log in manually.
  • CI deploy key: used by GitHub Actions to SSH into nodes as the deploy user. This key lives as a base64-encoded GitHub secret.
# Admin key (your interactive SSH access)
ssh-keygen -t ed25519 -a 64 -f ~/.ssh/swarm_admin_ed25519 -N ""
cat ~/.ssh/swarm_admin_ed25519.pub

# CI deploy key (GitHub Actions to SSH into nodes)
ssh-keygen -t ed25519 -a 64 -f ~/.ssh/swarm_ci_deploy_ed25519 -N ""
cat ~/.ssh/swarm_ci_deploy_ed25519.pub

Keep the public key outputs handy. You will paste them into the NixOS flake.

Generate an admin password hash

NixOS needs a hashed password for the admin user (for sudo, console login, etc). Generate one:

nix --extra-experimental-features "nix-command flakes" shell nixpkgs#mkpasswd -c \
  bash -lc 'mkpasswd -m sha-512'

Type your desired password when prompted. Copy the $6$... output. You will paste it into flake.nix later.

Convert your admin SSH key to an age recipient

sops-nix uses age encryption. You need your admin SSH public key converted to an age recipient so you can encrypt and decrypt secrets on your workstation:

nix --extra-experimental-features "nix-command flakes" shell nixpkgs#ssh-to-age -c \
  bash -lc 'ssh-to-age < ~/.ssh/swarm_admin_ed25519.pub'

Copy the age1... string. You will use it in .sops.yaml.


Step 1: Create the repo structure

Create a private GitHub repo. This repo will contain the full NixOS configuration for all three nodes, all Swarm stack definitions, sops secrets config, and the CI/CD workflow.

mkdir -p my-swarm-cluster
cd my-swarm-cluster

mkdir -p hosts/swarm-1
mkdir -p hosts/swarm-2
mkdir -p hosts/swarm-3
mkdir -p swarm
mkdir -p .github/workflows

The final repo layout will look like this:

.
├── flake.nix                  # shared NixOS config for all 3 nodes
├── flake.lock                 # pinned input versions
├── secrets.yaml               # sops-encrypted secrets
├── .sops.yaml                 # sops recipient config
├── hosts/
│   ├── swarm-1/
│   │   ├── hardware-configuration.nix
│   │   └── networking.nix
│   ├── swarm-2/
│   │   ├── hardware-configuration.nix
│   │   └── networking.nix
│   └── swarm-3/
│       ├── hardware-configuration.nix
│       └── networking.nix
├── swarm/
│   ├── myapp-stack.yml
│   ├── newt-stack.yml
│   └── another-service-stack.yml
└── .github/
    └── workflows/
        └── deploy.yml

Step 2: Write the Nix flake

This is the core of the build. A single flake.nix defines all three nodes using a mkHost function. Each call to mkHost produces a complete NixOS system configuration for one node, importing its host-specific hardware config. Shared config (users, Docker, SSH hardening, packages, services, secrets) is written once and applies to all three.

Create flake.nix in the repo root. Read through the inline comments, they explain why each block exists:

{
  description = "Docker Swarm cluster on NixOS with sops-nix and GitHub Actions deploy";

  inputs = {
    # Pin to a specific NixOS release channel
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-25.11";

    # sops-nix for encrypted secrets management
    sops-nix.url = "github:Mic92/sops-nix";
    sops-nix.inputs.nixpkgs.follows = "nixpkgs";
  };

  outputs = { self, nixpkgs, sops-nix }:
    let
      system = "x86_64-linux";

      # mkHost: a function that builds a NixOS system config for one node.
      # All shared config lives inside this function.
      # Host-specific config (hardware, networking) is imported per host.
      mkHost = { hostName }:
        nixpkgs.lib.nixosSystem {
          inherit system;

          modules = [
            sops-nix.nixosModules.sops

            ({ config, pkgs, lib, ... }:
              {
                imports = [
                  # Each node has its own hardware-configuration.nix
                  # (auto-generated during NixOS install, unique per machine)
                  ./hosts/${hostName}/hardware-configuration.nix
                ];

                # stateVersion tells NixOS which release this system was
                # originally installed on. It affects default behaviours for
                # some services. Set it once and do not change it.
                system.stateVersion = "25.11";

                networking.hostName = hostName;
                time.timeZone = "Europe/London";

                nix.settings = {
                  experimental-features = [ "nix-command" "flakes" ];
                  warn-dirty = false;
                };

                # ── Boot ──────────────────────────────────────────────
                boot.loader.grub.enable = true;
                boot.loader.grub.devices = [ "/dev/vda" ];
                boot.loader.grub.useOSProber = false;

                # ── Users ─────────────────────────────────────────────
                # mutableUsers = false means user accounts are ONLY what
                # is defined here. No one can useradd/passwd on the box.
                # If you need to change passwords or keys, update the
                # flake and push.
                users.mutableUsers = false;
                security.sudo.wheelNeedsPassword = true;

                users.users.admin = {
                  isNormalUser = true;
                  description = "Admin";
                  extraGroups = [ "wheel" "docker" ];
                  openssh.authorizedKeys.keys = [
                    # Paste your admin public key here:
                    "ssh-ed25519 AAAA... your-admin-pubkey"
                  ];
                  # Paste the $6$... hash from mkpasswd:
                  hashedPassword = "$6$...your-hash-here";
                };

                users.users.deploy = {
                  isNormalUser = true;
                  description = "CI Deploy User";
                  extraGroups = [ "wheel" "docker" ];
                  openssh.authorizedKeys.keys = [
                    # Paste your CI deploy public key here:
                    "ssh-ed25519 AAAA... your-ci-deploy-pubkey"
                  ];
                  # No password. This user is SSH-key-only via GitHub Actions.
                };

                # deploy user gets tightly scoped NOPASSWD sudo.
                # It can ONLY run these four commands without a password.
                # nixos-rebuild (to apply config), git (to pull repo),
                # systemctl (to restart services), true (for connection tests).
                security.sudo.extraRules = [
                  {
                    users = [ "deploy" ];
                    commands = [
                      { command = "/run/current-system/sw/bin/nixos-rebuild";
                        options = [ "NOPASSWD" ]; }
                      { command = "/run/current-system/sw/bin/git";
                        options = [ "NOPASSWD" ]; }
                      { command = "/run/current-system/sw/bin/systemctl";
                        options = [ "NOPASSWD" ]; }
                      { command = "/run/current-system/sw/bin/true";
                        options = [ "NOPASSWD" ]; }
                    ];
                  }
                ];

                # ── SSH hardening ─────────────────────────────────────
                services.openssh = {
                  enable = true;
                  # openFirewall = false because we manage firewall at
                  # the provider level for the swarm nodes.
                  openFirewall = false;
                  settings = {
                    PasswordAuthentication = false;
                    KbdInteractiveAuthentication = false;
                    PermitRootLogin = "no";
                    X11Forwarding = false;
                    AllowTcpForwarding = "no";
                    AllowAgentForwarding = "no";
                    ClientAliveInterval = 300;
                    ClientAliveCountMax = 2;
                    MaxAuthTries = 3;
                    LogLevel = "VERBOSE";
                  };
                };

                services.fail2ban.enable = true;

                # ── Firewall ──────────────────────────────────────────
                # Disabled on swarm nodes because we use provider-level
                # firewall rules (Vultr, Hetzner, etc). The provider
                # firewall is easier to manage centrally and avoids
                # conflicts with Docker's iptables manipulation.
                # If your provider does not offer a firewall, enable this
                # and open ports 22, 2377, 7946/tcp, 7946/udp, 4789/udp.
                networking.firewall.enable = false;

                # ── Kernel modules ────────────────────────────────────
                # virtiofs: for the shared filesystem mount
                # vxlan: Docker Swarm overlay networking uses VXLAN
                # overlay: Docker overlay filesystem driver
                # br_netfilter: required for iptables to see bridged traffic
                boot.initrd.kernelModules = [ "virtiofs" ];
                boot.kernelModules = [
                  "virtiofs" "vxlan" "overlay" "br_netfilter"
                ];

                # ── Shared filesystem ─────────────────────────────────
                # This is where persistent data lives. All three nodes
                # mount the same filesystem. Your provider determines the
                # mechanism: virtiofs (Vultr), NFS, GlusterFS, etc.
                # Replace "your-mount-tag" with your actual mount tag/device.
                fileSystems."/mnt/vfs" = {
                  device = "your-mount-tag";
                  fsType = "virtiofs";
                  options = [ "rw" "relatime" ];
                };

                # Create directory structure via systemd-tmpfiles.
                # These directories are created on every boot if missing.
                systemd.tmpfiles.rules = [
                  "d /mnt/vfs 0755 root root -"
                  "d /mnt/vfs/docker-volumes 0755 root root -"
                  "d /opt/swarm 0750 root root -"
                  "d /opt/swarm/stacks 0750 root root -"
                ];

                # ── Activation script: copy stack files from repo ─────
                # When nixos-rebuild runs, this script copies stack YAML
                # files from the repo (which lives at /etc/nixos on the
                # node) into /opt/swarm/stacks with correct ownership.
                # The docker group ownership means the deploy user (who
                # is in the docker group) can read these files.
                system.activationScripts.swarmStacks.text = ''
                  set -euo pipefail
                  install -d -m 0750 -o root -g docker /opt/swarm
                  install -d -m 0750 -o root -g docker /opt/swarm/stacks
                  # Add one line per stack file:
                  install -m 0640 -o root -g docker \
                    ${./swarm/myapp-stack.yml} \
                    /opt/swarm/stacks/myapp-stack.yml
                '';

                # ── Docker ────────────────────────────────────────────
                virtualisation.docker = {
                  enable = true;
                  autoPrune = {
                    enable = true;
                    dates = "weekly";
                    flags = [ "--all" "--volumes" ];
                  };
                  daemon.settings = {
                    "log-driver" = "json-file";
                    "log-opts" = {
                      "max-size" = "10m";
                      "max-file" = "5";
                    };
                    # IMPORTANT: live-restore must be false for Swarm.
                    # Docker Swarm manages container lifecycle itself.
                    # live-restore conflicts with that and causes
                    # split-brain issues after daemon restarts.
                    "live-restore" = false;
                  };
                };

                # ── Packages ──────────────────────────────────────────
                environment.systemPackages = with pkgs; [
                  git curl jq sops age
                ];

                # ── Journald ──────────────────────────────────────────
                services.journald.extraConfig = ''
                  Storage=persistent
                  SystemMaxUse=1G
                '';

                # ── sops-nix ──────────────────────────────────────────
                # Each node has an age key at /var/lib/sops-nix/key.txt
                # that can decrypt secrets.yaml. You generate this key
                # during node provisioning (Step 8).
                sops = {
                  defaultSopsFile = ./secrets.yaml;
                  defaultSopsFormat = "yaml";
                  age.keyFile = "/var/lib/sops-nix/key.txt";
                };
              })
          ];
        };
    in
    {
      nixosConfigurations = {
        "swarm-1" = mkHost { hostName = "swarm-1"; };
        "swarm-2" = mkHost { hostName = "swarm-2"; };
        "swarm-3" = mkHost { hostName = "swarm-3"; };
      };
    };
}

Per-host networking

Each node needs its own VPC interface configuration. Create hosts/swarm-1/networking.nix:

{ ... }:
{
  # Replace ens9 with your VPC interface name and the IP with
  # the VPC private IP assigned to this node by your provider.
  networking.interfaces.ens9.useDHCP = false;
  networking.interfaces.ens9.ipv4.addresses = [
    { address = "10.8.96.2"; prefixLength = 20; }
  ];
}

Import this in the flake by adding it to the imports list:

imports = [
  ./hosts/${hostName}/hardware-configuration.nix
  ./hosts/${hostName}/networking.nix
];

Do the same for swarm-2 and swarm-3 with their respective VPC IPs.

Generate the lock file

From the repo root on your workstation:

nix --extra-experimental-features "nix-command flakes" flake lock

This resolves and pins all flake inputs (nixpkgs version, sops-nix version) in flake.lock. This file is what makes builds reproducible: every node will use the exact same package versions. Commit it to the repo.


Step 3: sops-nix configuration

Secrets (API keys, database passwords, tokens) live in the repo encrypted with age. Each node has its own age key that can decrypt them. Your workstation also has a recipient so you can encrypt new secrets locally.

Create .sops.yaml

keys:
  # Your workstation admin recipient (from Step 0)
  - &admin_age age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  # Node recipients - placeholders until you provision each node
  - &swarm1_age PLACEHOLDER_SWARM1
  - &swarm2_age PLACEHOLDER_SWARM2
  - &swarm3_age PLACEHOLDER_SWARM3

creation_rules:
  - path_regex: secrets\.yaml$
    key_groups:
      - age:
          - *admin_age
          - *swarm1_age
          - *swarm2_age
          - *swarm3_age

The creation_rules block says: any file matching secrets.yaml must be encrypted to all four recipients. This means your workstation can decrypt (for editing) and each node can decrypt (for applying config).

Create secrets.yaml (plaintext placeholder)

swarm:
  placeholder: "replace-me"

You will encrypt this after all three nodes are provisioned and their age recipients are filled in.


Step 4: GitHub Actions workflow

This is the CI/CD pipeline. On every push to main, it SSHes into each node, pulls the repo, runs nixos-rebuild switch, then auto-discovers and deploys all Swarm stacks.

Create .github/workflows/deploy.yml:

name: deploy

on:
  push:
    branches: ["main"]

# Only one deploy can run at a time. If you push twice quickly,
# the second push cancels the first. This prevents two rebuilds
# from racing on the same node.
concurrency:
  group: deploy-swarm
  cancel-in-progress: true

jobs:
  deploy:
    runs-on: ubuntu-latest

    env:
      SSH_OPTS: >-
        -o IdentitiesOnly=yes -o BatchMode=yes -o StrictHostKeyChecking=yes
      SCP_OPTS: -o StrictHostKeyChecking=yes
      REMOTE_STACK_DIR: /tmp/swarm-stacks

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      # Decode the base64-encoded CI deploy private key from GitHub secrets,
      # write it to disk, and load it into ssh-agent.
      - name: Setup SSH key
        env:
          DEPLOY_KEY_B64: ${{ secrets.CI_DEPLOY_PRIVATE_KEY_B64 }}
        run: |
          set -euo pipefail
          mkdir -p "$HOME/.ssh"
          chmod 700 "$HOME/.ssh"

          SSH_KEY_PATH="$HOME/.ssh/swarm_ci_deploy_ed25519"
          printf '%s' "$DEPLOY_KEY_B64" | tr -d '\r' | base64 -d > "$SSH_KEY_PATH"
          chmod 600 "$SSH_KEY_PATH"

          # Validate the key without leaking material (prints fingerprint only)
          ssh-keygen -lf "$SSH_KEY_PATH"

          eval "$(ssh-agent -s)"
          ssh-add "$SSH_KEY_PATH"
          echo "SSH_KEY_PATH=$SSH_KEY_PATH" >> "$GITHUB_ENV"

      # Pinned known_hosts prevents MITM. These are the SSH host key
      # fingerprints you collected from each node (Step 10).
      - name: Add pinned known_hosts
        env:
          SWARM_KNOWN_HOSTS: ${{ secrets.SWARM_KNOWN_HOSTS }}
        run: |
          set -euo pipefail
          mkdir -p "$HOME/.ssh"
          chmod 700 "$HOME/.ssh"
          printf '%s\n' "$SWARM_KNOWN_HOSTS" >> "$HOME/.ssh/known_hosts"
          chmod 600 "$HOME/.ssh/known_hosts"

      # ── NixOS rebuild on each node ──────────────────────────────
      # The pattern is identical for each node:
      # 1. SSH in as the deploy user
      # 2. cd to /etc/nixos (which is the repo checkout)
      # 3. git fetch + reset to origin/main (sudo because root owns /etc/nixos)
      # 4. nixos-rebuild switch with the node's flake target
      - name: Deploy NixOS swarm-1
        env:
          HOST: ${{ secrets.SWARM1_HOST }}
          USER: ${{ secrets.SWARM_USER }}
          TARGET: ${{ secrets.SWARM1_FLAKE_TARGET }}
        run: |
          set -euo pipefail
          ssh -i "$SSH_KEY_PATH" $SSH_OPTS "${USER}@${HOST}" "bash -lc '
            set -euo pipefail
            cd /etc/nixos
            sudo -n /run/current-system/sw/bin/git fetch --prune origin
            sudo -n /run/current-system/sw/bin/git checkout -f main
            sudo -n /run/current-system/sw/bin/git reset --hard origin/main
            sudo -n /run/current-system/sw/bin/nixos-rebuild switch \
              --flake \"path:/etc/nixos#${TARGET}\"
          '"

      - name: Deploy NixOS swarm-2
        env:
          HOST: ${{ secrets.SWARM2_HOST }}
          USER: ${{ secrets.SWARM_USER }}
          TARGET: ${{ secrets.SWARM2_FLAKE_TARGET }}
        run: |
          set -euo pipefail
          ssh -i "$SSH_KEY_PATH" $SSH_OPTS "${USER}@${HOST}" "bash -lc '
            set -euo pipefail
            cd /etc/nixos
            sudo -n /run/current-system/sw/bin/git fetch --prune origin
            sudo -n /run/current-system/sw/bin/git checkout -f main
            sudo -n /run/current-system/sw/bin/git reset --hard origin/main
            sudo -n /run/current-system/sw/bin/nixos-rebuild switch \
              --flake \"path:/etc/nixos#${TARGET}\"
          '"

      - name: Deploy NixOS swarm-3
        env:
          HOST: ${{ secrets.SWARM3_HOST }}
          USER: ${{ secrets.SWARM_USER }}
          TARGET: ${{ secrets.SWARM3_FLAKE_TARGET }}
        run: |
          set -euo pipefail
          ssh -i "$SSH_KEY_PATH" $SSH_OPTS "${USER}@${HOST}" "bash -lc '
            set -euo pipefail
            cd /etc/nixos
            sudo -n /run/current-system/sw/bin/git fetch --prune origin
            sudo -n /run/current-system/sw/bin/git checkout -f main
            sudo -n /run/current-system/sw/bin/git reset --hard origin/main
            sudo -n /run/current-system/sw/bin/nixos-rebuild switch \
              --flake \"path:/etc/nixos#${TARGET}\"
          '"

      # ── Auto-discover and deploy all Swarm stacks ───────────────
      # This finds every *-stack.yml in the swarm/ directory, copies
      # them to the manager node, and deploys each one.
      - name: Deploy Swarm stacks (auto-discover)
        env:
          HOST: ${{ secrets.SWARM1_HOST }}
          USER: ${{ secrets.SWARM_USER }}
          # Add your stack-specific secrets here as env vars.
          # Each one becomes available for variable substitution
          # in stack YAML files (e.g. ${NEWT_ID} in the compose).
          NEWT_ID: ${{ secrets.NEWT_ID }}
          NEWT_SECRET: ${{ secrets.NEWT_SECRET }}
          # Add more as needed for other stacks:
          # DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
        run: |
          set -euo pipefail

          # Find all stack files in the swarm/ directory
          mapfile -t STACK_FILES < <(
            find swarm -maxdepth 1 -type f \
              \( -name '*-stack.yml' -o -name '*-stack.yaml' \) | sort
          )
          if [ "${#STACK_FILES[@]}" -eq 0 ]; then
            echo "No stack files found under ./swarm/"
            exit 1
          fi

          # Clean the remote staging directory
          ssh -i "$SSH_KEY_PATH" $SSH_OPTS "${USER}@${HOST}" "bash -lc '
            set -euo pipefail
            mkdir -p \"${REMOTE_STACK_DIR}\"
            rm -f \"${REMOTE_STACK_DIR}\"/*-stack.y*ml
            rm -f \"${REMOTE_STACK_DIR}/deploy.env\"
          '"

          # Copy each stack file to the manager node
          for f in "${STACK_FILES[@]}"; do
            base="$(basename "$f")"
            scp -i "$SSH_KEY_PATH" $SCP_OPTS \
              "$f" "${USER}@${HOST}:${REMOTE_STACK_DIR}/${base}"
          done

          # Build an env file with all secrets. This file is copied
          # to the node, sourced before deploy, then deleted.
          ENV_LOCAL="$(mktemp)"
          chmod 600 "$ENV_LOCAL"
          {
            printf 'export NEWT_ID=%q\n' "$NEWT_ID"
            printf 'export NEWT_SECRET=%q\n' "$NEWT_SECRET"
            # Add more secrets here matching your env vars above:
            # printf 'export DB_PASSWORD=%q\n' "$DB_PASSWORD"
          } > "$ENV_LOCAL"

          scp -i "$SSH_KEY_PATH" $SCP_OPTS \
            "$ENV_LOCAL" "${USER}@${HOST}:${REMOTE_STACK_DIR}/deploy.env"
          rm -f "$ENV_LOCAL"

          # Deploy all stacks on the manager node
          ssh -i "$SSH_KEY_PATH" $SSH_OPTS "${USER}@${HOST}" "bash -lc '
            set -euo pipefail

            chmod 600 \"${REMOTE_STACK_DIR}/deploy.env\"
            source \"${REMOTE_STACK_DIR}/deploy.env\"

            # Create the overlay network if it does not exist.
            # All stacks that need to talk to each other (or to
            # the tunnel agent) should attach to this network.
            docker network inspect pangolin >/dev/null 2>&1 \
              || docker network create -d overlay --attachable pangolin

            # Deploy each stack file. The stack name is derived
            # from the filename: myapp-stack.yml becomes "myapp".
            for f in \"${REMOTE_STACK_DIR}\"/*-stack.yml \
                     \"${REMOTE_STACK_DIR}\"/*-stack.yaml; do
              [ -e \"\$f\" ] || continue
              base=\"\$(basename \"\$f\")\"
              stack=\"\${base%-stack.yml}\"
              stack=\"\${stack%-stack.yaml}\"
              docker stack deploy -c \"\$f\" \"\$stack\"
            done

            # Clean up the secrets file
            rm -f \"${REMOTE_STACK_DIR}/deploy.env\"

            docker stack ls
          '"

This auto-discover pattern means adding a new service to the cluster is:

  1. Create swarm/my-new-service-stack.yml
  2. If it needs secrets, add them to GitHub secrets and to the deploy.env block in the workflow
  3. Push to main

That is it. The pipeline finds the new file and deploys it.


Step 5: Commit and push

cd my-swarm-cluster
git init
git add -A
git commit -m "Initial NixOS Swarm cluster skeleton"
git branch -M main
git remote add origin [email protected]:your-org/my-swarm-cluster.git
git push -u origin main

The GitHub Actions workflow will fail at this point because the secrets are not configured yet and the nodes do not exist. That is expected.


Step 6: Provision the VPS nodes

For each of the three nodes, follow this sequence. Do node 1 first (it has extra steps), then repeat for nodes 2 and 3.

Create the VPS

Pick your provider (Vultr, Hetzner, DigitalOcean, etc). Requirements:

  • All three nodes in the same region
  • All three attached to the same VPC / private network
  • Shared storage attached to all three (virtiofs, NFS volume, etc)
  • NixOS ISO available (some providers offer it directly, others require you to upload the ISO)

Record for each node: the public IP and the VPC private IP.

Provider firewall rules

Set up provider-level firewall rules (this is why networking.firewall.enable = false in the flake):

Inbound from your workstation IP:

  • TCP 22 (SSH)

Inbound between the 3 swarm node private IPs only (VPC internal):

  • TCP 2377 (Swarm management)
  • TCP 7946 (Swarm gossip)
  • UDP 7946 (Swarm gossip)
  • UDP 4789 (VXLAN overlay traffic)

Inbound from your reverse proxy (if it has a VPC IP):

  • Scope to the proxy's IP. If using Pangolin with a tunnel (Newt), you do not need this because traffic arrives through the overlay network, not via IP.

Install NixOS from ISO

  1. Boot the VPS from the NixOS graphical ISO
  2. Install to disk with your preferred region/keyboard settings
  3. Create a temporary local user during install (e.g. local). This user is throwaway; it gets replaced when you apply the flake.
  4. Shutdown, detach the ISO, boot from disk

Bootstrap SSH access

Log into the VPS console (provider web console) as the temporary user. Edit /etc/nixos/configuration.nix and add these blocks inside the top-level { ... }:

services.openssh = {
  enable = true;
  openFirewall = false;
  settings = {
    PasswordAuthentication = false;
    KbdInteractiveAuthentication = false;
    PermitRootLogin = "no";
  };
};

networking.firewall.enable = false;

Also add git, curl, jq, and age to environment.systemPackages. Set the hostname. Apply:

sudo nixos-rebuild switch

Add your admin SSH key

On the VPS console as the temporary user:

mkdir -p ~/.ssh && chmod 700 ~/.ssh
nano ~/.ssh/authorized_keys
# Paste your admin public key (from swarm_admin_ed25519.pub) on a single line
chmod 600 ~/.ssh/authorized_keys

Test from your workstation:

ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes local@NODE_PUBLIC_IP 'whoami'

If it prints local, SSH is working. You can now close the VPS console and work over SSH from here.


Step 7: Set up GitHub repo access on the node

The GitHub Actions workflow runs git fetch and git reset under sudo on the node, so root needs read access to the repo. Each node gets its own read-only deploy key.

SSH in as the temporary user, then become root:

ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes local@NODE_PUBLIC_IP
sudo -i

Generate a deploy key:

mkdir -p /root/.ssh && chmod 700 /root/.ssh
ssh-keygen -t ed25519 -a 64 -f /root/.ssh/repo_deploy_ed25519 -N ""
cat /root/.ssh/repo_deploy_ed25519.pub

In your GitHub repo settings, go to Settings > Deploy keys > Add deploy key:

  • Title: swarm-1-readonly (use a distinct name per node)
  • Paste the public key
  • Leave "Allow write access" off

Configure root's SSH config for GitHub:

cat >/root/.ssh/config <<'EOF'
Host github.com
  HostName github.com
  User git
  IdentityFile /root/.ssh/repo_deploy_ed25519
  IdentitiesOnly yes
EOF
chmod 600 /root/.ssh/config

Test:

ssh -T [email protected] || true

You should see a message like "Hi your-org/my-swarm-cluster! You've successfully authenticated, but GitHub does not provide shell access."


Step 8: Replace /etc/nixos with the repo

Still as root on the node:

Back up the auto-generated hardware config

cp -a /etc/nixos/hardware-configuration.nix /root/hardware-configuration.nix

Copy the contents of this file to your workstation and commit it into the repo at hosts/swarm-1/hardware-configuration.nix. This file is unique to each machine (it describes disks, PCI devices, kernel modules needed for the specific hardware).

Clone the repo into /etc/nixos

rm -rf /etc/nixos
git clone [email protected]:your-org/my-swarm-cluster.git /etc/nixos

From this point, /etc/nixos is a live checkout of your repo. Every nixos-rebuild switch reads its config from this checkout.


Step 9: Generate the node's age key and apply the flake

Generate the age key for sops-nix

sudo -i
mkdir -p /var/lib/sops-nix && chmod 700 /var/lib/sops-nix
age-keygen -o /var/lib/sops-nix/key.txt
chmod 600 /var/lib/sops-nix/key.txt

The command prints the public recipient (starts with age1...). Copy it. On your workstation, replace the PLACEHOLDER_SWARM1 entry in .sops.yaml with this value. Commit the change.

Apply the flake

cd /etc/nixos
git pull  # pick up the .sops.yaml change you just pushed
nixos-rebuild switch --flake "path:/etc/nixos#swarm-1"

This is the moment the node transitions from its temporary install config to the full production config defined in your flake. When it completes:

  • The admin and deploy users exist
  • The temporary local user from install no longer exists (because users.mutableUsers = false and it is not defined in the flake)
  • Docker is running
  • SSH is hardened
  • fail2ban is active
  • The shared filesystem is mounted

Verify

# Shared filesystem mounted
mount | grep /mnt/vfs
df -h /mnt/vfs

# Docker running
systemctl status docker --no-pager
docker info | head

# Admin user works (from workstation)
ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes admin@NODE_PUBLIC_IP 'whoami'

The last command should print admin. If it does, you can stop using the temporary user entirely. It does not exist anymore.


Step 10: Repeat for nodes 2 and 3

Follow Steps 6 through 9 for each remaining node. Remember:

  • Each node gets its own hardware-configuration.nix in the repo
  • Each node gets its own deploy key in GitHub (with a distinct title)
  • Each node gets its own age key, and the recipient goes into .sops.yaml

After all three nodes are provisioned and all age recipients are in .sops.yaml, encrypt the secrets file on your workstation:

nix --extra-experimental-features "nix-command flakes" shell nixpkgs#sops nixpkgs#age -c \
  sops -e -i secrets.yaml

Commit and push:

git add -A
git commit -m "Add all sops recipients and encrypt secrets"
git push

Step 11: Pin SSH host keys for GitHub Actions

You need to tell GitHub Actions what SSH host keys to expect from each node. This prevents man-in-the-middle attacks during deployment.

On each node, get the pinned host key line:

sudo -i
PUB_IP="$(curl -fsS https://api.ipify.org)"
printf "%s %s\n" "$PUB_IP" "$(cat /etc/ssh/ssh_host_ed25519_key.pub)"

Copy the output from each node. Combine all three lines into one text blob (three lines total, one per node). This becomes the SWARM_KNOWN_HOSTS GitHub secret.


Step 12: Add GitHub Actions secrets

In your repo settings, go to Settings > Secrets and variables > Actions > New repository secret. Add:

Secret Value
CI_DEPLOY_PRIVATE_KEY_B64 Base64-encoded CI deploy private key: base64 -w0 ~/.ssh/swarm_ci_deploy_ed25519
SWARM_KNOWN_HOSTS The three pinned host key lines from Step 11
SWARM_USER deploy
SWARM1_HOST Node 1 public IP or DNS
SWARM2_HOST Node 2 public IP or DNS
SWARM3_HOST Node 3 public IP or DNS
SWARM1_FLAKE_TARGET swarm-1
SWARM2_FLAKE_TARGET swarm-2
SWARM3_FLAKE_TARGET swarm-3
NEWT_ID From Pangolin dashboard (Step in Part 2)
NEWT_SECRET From Pangolin dashboard (Step in Part 2)

Push any commit to main. Watch the Actions run. All three nodes should rebuild successfully and report the stack deploy.


Step 13: Confirm VPC networking and initialise Docker Swarm

Before forming the Swarm cluster, confirm the VPC networking is correct.

Verify on each node

# Check the VPC interface has the right IP
ip -br a | egrep 'ens3|ens9'

# Confirm routing uses the VPC interface
ip route get 10.8.96.1

You want to see ens9 (or your VPC interface name) with a 10.x.x.x/20 address, and the route going via that interface.

Confirm cross-node connectivity

From node 1, ping node 2 and node 3 using their VPC private IPs:

ping -c 2 10.8.96.X   # node 2
ping -c 2 10.8.96.Y   # node 3

If pings fail, check your provider's VPC configuration and firewall rules.

Initialise Swarm on node 1

SSH in as admin:

ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes admin@SWARM1_PUBLIC_IP

Init Swarm using the VPC private IP (not the public IP). Swarm's inter-node traffic must go over the VPC:

docker swarm init --advertise-addr SWARM1_VPC_PRIVATE_IP

Get the manager join token

docker swarm join-token manager

Copy the full docker swarm join command from the output.

Join node 2

ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes admin@SWARM2_PUBLIC_IP

docker swarm join \
  --token TOKEN_FROM_OUTPUT \
  --advertise-addr SWARM2_VPC_PRIVATE_IP \
  SWARM1_VPC_PRIVATE_IP:2377

The --advertise-addr flag tells this node to advertise its own VPC IP to the cluster.

Join node 3

Same as node 2 but with node 3's VPC IP:

ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes admin@SWARM3_PUBLIC_IP

docker swarm join \
  --token TOKEN_FROM_OUTPUT \
  --advertise-addr SWARM3_VPC_PRIVATE_IP \
  SWARM1_VPC_PRIVATE_IP:2377

Validate

Back on node 1:

docker node ls

You should see three nodes, all Status: Ready, with Manager Status showing one Leader and two Reachable.


Step 14: Deploy your first Swarm stack

Create a stack file in the repo. Here is a minimal example:

# swarm/myapp-stack.yml
version: "3.9"

services:
  myapp:
    image: myapp/myapp:latest
    networks:
      pangolin:
        aliases:
          - myapp
    deploy:
      replicas: 2
      restart_policy:
        condition: any
      update_config:
        parallelism: 1
        order: start-first

networks:
  pangolin:
    external: true

Key things to note:

  • No published ports. The service does not expose any host ports. Traffic reaches it via the overlay network through the Pangolin tunnel. This is the pattern for all services behind the proxy.
  • networks: pangolin: external: true means this stack attaches to the pangolin overlay network that the deploy pipeline creates. All stacks that need to communicate (with each other or with the tunnel agent) should use this network.
  • order: start-first means during a rolling update, Swarm starts the new container before stopping the old one. This gives you zero-downtime deployments.
  • replicas: 2 means two copies run across the cluster. Swarm distributes them across different nodes for redundancy.

Push to main. The pipeline auto-discovers the stack file, copies it to the manager, and runs docker stack deploy. Verify:

docker stack ls
docker service ls
docker service ps myapp_myapp

Using shared storage with stacks

For services that need persistent data (databases, config files, uploads), use the shared filesystem mount. Because /mnt/vfs is mounted on all three nodes, a service can run on any node and access the same data.

First, create the volume directories. You only need to do this once, on any node (the filesystem is shared):

sudo -i
install -d -m 0750 -o root -g docker /mnt/vfs/docker-volumes/myapp
install -d -m 0750 -o root -g docker /mnt/vfs/docker-volumes/myapp/data
install -d -m 0750 -o root -g docker /mnt/vfs/docker-volumes/myapp/config

Then reference them as bind mounts in the stack:

services:
  myapp:
    image: myapp/myapp:latest
    volumes:
      - type: bind
        source: /mnt/vfs/docker-volumes/myapp/data
        target: /app/data
      - type: bind
        source: /mnt/vfs/docker-volumes/myapp/config
        target: /app/config
    networks:
      pangolin:
        aliases:
          - myapp
    deploy:
      replicas: 1
      restart_policy:
        condition: any

If you are migrating an existing service, copy its data into the bind mount directory before deploying the stack.


Deploying in-house apps from a private container registry

For custom applications that you build from source, the pipeline can build a Docker image, push it to a private registry, and deploy the stack with the new image tag.

Add the app source to the repo

apps/
  my-internal-app/
    Dockerfile
    app.py
    requirements.txt

Create the stack file with an image tag variable

# swarm/my-internal-app-stack.yml
version: "3.9"

services:
  my-internal-app:
    image: registry.example.com/my-org/my-internal-app:${APP_TAG}
    command: ["python", "-u", "app.py"]
    networks:
      pangolin:
        aliases:
          - my-internal-app
    deploy:
      replicas: 1
      restart_policy:
        condition: any
      update_config:
        order: start-first

networks:
  pangolin:
    external: true

The ${APP_TAG} is set to the Git commit SHA by the pipeline, so every push produces a unique, traceable image tag.

Add GitHub secrets

Add REGISTRY_USERNAME and REGISTRY_API_KEY to your repo secrets.

Add build steps to the workflow

Insert these steps before the "Deploy Swarm stacks" step:

      - name: Set image tag
        run: echo "APP_TAG=${GITHUB_SHA}" >> "$GITHUB_ENV"

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to private registry
        uses: docker/login-action@v3
        with:
          registry: registry.example.com
          username: ${{ secrets.REGISTRY_USERNAME }}
          password: ${{ secrets.REGISTRY_API_KEY }}

      - name: Build and push my-internal-app
        uses: docker/build-push-action@v6
        with:
          context: ./apps/my-internal-app
          push: true
          tags: |
            registry.example.com/my-org/my-internal-app:${{ env.APP_TAG }}

Then in the stack deploy step, add registry login on the Swarm manager so it can pull the private image. The key detail is using a temporary Docker config directory and --with-registry-auth:

      # Inside the "Deploy Swarm stacks" SSH block, before the deploy loop:
      export DOCKER_CONFIG="$(mktemp -d)"
      echo "$REGISTRY_API_KEY" | docker login registry.example.com \
        -u "$REGISTRY_USERNAME" --password-stdin

      # In the deploy loop, use --with-registry-auth:
      docker stack deploy --with-registry-auth -c "$f" "$stack"

      # After the loop, clean up:
      docker logout registry.example.com || true
      rm -rf "$DOCKER_CONFIG"
      unset DOCKER_CONFIG

This keeps registry credentials ephemeral. They exist in a temp directory for the duration of the deploy, then get deleted. The --with-registry-auth flag tells Swarm to propagate the credentials to all nodes so they can pull the image.


Part 2: The Pangolin Reverse Proxy VPS

The Pangolin VPS is a separate machine that handles inbound HTTPS traffic and routes it into the Swarm cluster through a WireGuard tunnel. It uses the exact same GitOps pattern: NixOS + flake + sops-nix + GitHub Actions, in its own repo.

If you have already read my Pangolin post, you know how Pangolin works at a high level. This section covers deploying it on NixOS with full GitOps, which is different from the quick-install Docker Compose approach in that post.


Repo layout

.
├── flake.nix
├── flake.lock
├── hardware-configuration.nix
├── secrets.yaml
├── .sops.yaml
├── pangolin/
│   ├── docker-compose.yml
│   └── config/
│       ├── config.yml
│       └── traefik/
│           ├── traefik_config.yml
│           └── dynamic_config.yml
└── .github/
    └── workflows/
        └── deploy.yml

The Pangolin flake

This flake differs from the Swarm flake in a few important ways:

  • It runs a single host, not three
  • The NixOS firewall is enabled (this VPS faces the internet directly, unlike the Swarm nodes which use provider-level rules)
  • It includes kernel sysctl hardening (IP forwarding for Docker, plus security tunables)
  • It defines a systemd service that runs docker-compose up -d to start Pangolin
  • It uses sops-nix to inject the Pangolin SERVER_SECRET as an environment variable
{
  description = "Pangolin reverse proxy on hardened NixOS with GitOps deploy";

  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-25.11";
    sops-nix.url = "github:Mic92/sops-nix";
    sops-nix.inputs.nixpkgs.follows = "nixpkgs";
  };

  outputs = { self, nixpkgs, sops-nix }:
    let
      system = "x86_64-linux";

      mkHost = { hostName }:
        nixpkgs.lib.nixosSystem {
          inherit system;

          modules = [
            sops-nix.nixosModules.sops

            ({ config, pkgs, lib, ... }:
              {
                imports = [
                  ./hardware-configuration.nix
                ];

                system.stateVersion = "25.11";
                networking.hostName = hostName;
                time.timeZone = "Europe/London";

                nix.settings = {
                  experimental-features = [ "nix-command" "flakes" ];
                  warn-dirty = false;
                };

                boot.loader.grub.enable = true;
                boot.loader.grub.devices = [ "/dev/vda" ];
                boot.loader.grub.useOSProber = false;

                # ── Kernel hardening ──────────────────────────────────
                # ip_forward is required for Docker networking.
                # The rest disables ICMP redirects, source routing,
                # enables SYN cookies, and turns on reverse path filtering.
                boot.kernel.sysctl = {
                  "net.ipv4.ip_forward" = 1;
                  "net.ipv4.conf.all.accept_redirects" = 0;
                  "net.ipv4.conf.default.accept_redirects" = 0;
                  "net.ipv4.conf.all.send_redirects" = 0;
                  "net.ipv4.conf.default.send_redirects" = 0;
                  "net.ipv4.conf.all.accept_source_route" = 0;
                  "net.ipv4.conf.default.accept_source_route" = 0;
                  "net.ipv4.tcp_syncookies" = 1;
                  "net.ipv4.conf.all.rp_filter" = 1;
                  "net.ipv4.conf.default.rp_filter" = 1;
                };

                # ── Users ─────────────────────────────────────────────
                users.mutableUsers = false;
                security.sudo.wheelNeedsPassword = true;

                users.users.admin = {
                  isNormalUser = true;
                  description = "Admin";
                  extraGroups = [ "wheel" "docker" ];
                  openssh.authorizedKeys.keys = [
                    "ssh-ed25519 AAAA... your-admin-pubkey"
                  ];
                  hashedPassword = "$6$...your-hash-here";
                };

                users.users.deploy = {
                  isNormalUser = true;
                  description = "CI Deploy User";
                  extraGroups = [ "wheel" "docker" ];
                  openssh.authorizedKeys.keys = [
                    "ssh-ed25519 AAAA... your-ci-deploy-pubkey"
                  ];
                };

                security.sudo.extraRules = [
                  {
                    users = [ "deploy" ];
                    commands = [
                      { command = "/run/current-system/sw/bin/nixos-rebuild";
                        options = [ "NOPASSWD" ]; }
                      { command = "/run/current-system/sw/bin/git";
                        options = [ "NOPASSWD" ]; }
                      { command = "/run/current-system/sw/bin/true";
                        options = [ "NOPASSWD" ]; }
                      { command = "/run/current-system/sw/bin/systemctl";
                        options = [ "NOPASSWD" ]; }
                    ];
                  }
                ];

                # ── SSH hardening ─────────────────────────────────────
                services.openssh = {
                  enable = true;
                  # openFirewall = true here because the NixOS firewall
                  # is enabled on this VPS (unlike the Swarm nodes).
                  openFirewall = true;
                  settings = {
                    PasswordAuthentication = false;
                    KbdInteractiveAuthentication = false;
                    PermitRootLogin = "no";
                    X11Forwarding = false;
                    AllowTcpForwarding = "no";
                    AllowAgentForwarding = "no";
                    ClientAliveInterval = 300;
                    ClientAliveCountMax = 2;
                    MaxAuthTries = 3;
                    LogLevel = "VERBOSE";
                  };
                };

                services.fail2ban.enable = true;

                # ── Firewall ──────────────────────────────────────────
                # Unlike the Swarm nodes, the Pangolin VPS uses the
                # NixOS firewall because it faces the internet directly.
                networking.firewall = {
                  enable = true;
                  allowedTCPPorts = [
                    22    # SSH
                    80    # HTTP (Let's Encrypt ACME + redirect)
                    443   # HTTPS (Traefik)
                  ];
                  allowedUDPPorts = [
                    51820 # Gerbil: site tunnels (Newt connections)
                    21820 # Gerbil: client tunnels (optional)
                  ];
                  allowPing = false;
                };

                # ── Docker ────────────────────────────────────────────
                virtualisation.docker = {
                  enable = true;
                  autoPrune = {
                    enable = true;
                    dates = "weekly";
                    flags = [ "--all" "--volumes" ];
                  };
                  daemon.settings = {
                    "log-driver" = "json-file";
                    "log-opts" = {
                      "max-size" = "10m";
                      "max-file" = "5";
                    };
                    # live-restore is fine here (no Swarm on this node)
                    "live-restore" = true;
                  };
                };

                environment.systemPackages = with pkgs; [
                  git curl jq openssl sops age docker-compose
                ];

                services.journald.extraConfig = ''
                  Storage=persistent
                  SystemMaxUse=1G
                '';

                # ── sops-nix ──────────────────────────────────────────
                sops = {
                  defaultSopsFile = ./secrets.yaml;
                  defaultSopsFormat = "yaml";
                  age.keyFile = "/var/lib/sops-nix/key.txt";
                  # Decrypt the pangolin server_secret from secrets.yaml
                  secrets."pangolin/server_secret" = { };
                };

                # Create an env file from the decrypted secret.
                # Pangolin reads SERVER_SECRET from the environment.
                sops.templates."pangolin.env" = {
                  content = ''
                    SERVER_SECRET=${config.sops.placeholder."pangolin/server_secret"}
                  '';
                  owner = "root";
                  group = "root";
                  mode = "0400";
                };

                # ── Pangolin directories ──────────────────────────────
                systemd.tmpfiles.rules = [
                  "d /var/lib/pangolin 0750 root root -"
                  "d /var/lib/pangolin/db 0750 root root -"
                  "d /var/lib/pangolin/letsencrypt 0750 root root -"
                  "d /var/lib/pangolin/logs 0750 root root -"
                  "d /var/lib/pangolin/traefik 0750 root root -"
                  "d /opt/pangolin 0750 root root -"
                ];

                # Copy compose and config files from the repo into
                # their runtime locations on every nixos-rebuild.
                system.activationScripts.pangolinFiles.text = ''
                  set -euo pipefail
                  install -d -m 0750 /opt/pangolin
                  install -m 0640 \
                    ${./pangolin/docker-compose.yml} \
                    /opt/pangolin/docker-compose.yml
                  install -d -m 0750 /var/lib/pangolin
                  install -d -m 0750 /var/lib/pangolin/traefik
                  install -m 0640 \
                    ${./pangolin/config/config.yml} \
                    /var/lib/pangolin/config.yml
                  install -m 0640 \
                    ${./pangolin/config/traefik/traefik_config.yml} \
                    /var/lib/pangolin/traefik/traefik_config.yml
                  install -m 0640 \
                    ${./pangolin/config/traefik/dynamic_config.yml} \
                    /var/lib/pangolin/traefik/dynamic_config.yml
                '';

                # ── Pangolin systemd service ──────────────────────────
                # This runs docker-compose up/down as a systemd service,
                # so Pangolin starts on boot and stops cleanly on shutdown.
                # The EnvironmentFile points to the sops-rendered env file
                # containing the decrypted SERVER_SECRET.
                systemd.services.pangolin = {
                  description = "Pangolin (docker compose)";
                  after = [ "network-online.target" "docker.service" ];
                  wants = [ "network-online.target" "docker.service" ];
                  wantedBy = [ "multi-user.target" ];
                  serviceConfig = {
                    Type = "oneshot";
                    RemainAfterExit = true;
                    WorkingDirectory = "/opt/pangolin";
                    EnvironmentFile =
                      config.sops.templates."pangolin.env".path;
                    ExecStart =
                      "${pkgs.docker-compose}/bin/docker-compose up -d";
                    ExecStop =
                      "${pkgs.docker-compose}/bin/docker-compose down";
                    TimeoutStartSec = "300";
                  };
                };
              })
          ];
        };
    in
    {
      nixosConfigurations = {
        "proxy-1" = mkHost { hostName = "proxy-1"; };
      };
    };
}

What the sops integration does

The important part here is the sops.templates block. When nixos-rebuild switch runs:

  1. sops-nix decrypts secrets.yaml using the node's age key
  2. It extracts the value at pangolin/server_secret
  3. It renders the pangolin.env template, substituting the placeholder with the real secret
  4. The rendered file lands at a path on disk with mode 0400 (root read-only)
  5. The systemd.services.pangolin service reads this file as its EnvironmentFile
  6. Docker Compose gets SERVER_SECRET as an environment variable

The secret never appears in the repo in plaintext. It is encrypted at rest, decrypted only on the node, and injected into the container environment at runtime.


Pangolin Docker Compose

Create pangolin/docker-compose.yml:

services:
  pangolin:
    image: fosrl/pangolin:latest
    container_name: pangolin
    restart: always
    environment:
      - SERVER_SECRET=${SERVER_SECRET}
    volumes:
      - /var/lib/pangolin:/app/config
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3001/api/v1/"]
      interval: "3s"
      timeout: "3s"
      retries: 15

  gerbil:
    image: fosrl/gerbil:latest
    container_name: gerbil
    restart: always
    depends_on:
      pangolin:
        condition: service_healthy
    command:
      - --reachableAt=http://gerbil:3004
      - --generateAndSaveKeyTo=/var/config/key
      - --remoteConfig=http://pangolin:3001/api/v1/
    volumes:
      - /var/lib/pangolin:/var/config
    cap_add:
      - NET_ADMIN
      - SYS_MODULE
    ports:
      - 51820:51820/udp
      - 21820:21820/udp
      - 443:443
      - 80:80

  traefik:
    image: traefik:v3.4.0
    container_name: traefik
    restart: always
    network_mode: service:gerbil
    depends_on:
      pangolin:
        condition: service_healthy
    command:
      - --configFile=/etc/traefik/traefik_config.yml
    volumes:
      - /var/lib/pangolin/traefik:/etc/traefik:ro
      - /var/lib/pangolin/letsencrypt:/letsencrypt
      - /var/lib/pangolin/traefik/logs:/var/log/traefik

networks:
  default:
    driver: bridge
    name: pangolin

Note that Traefik uses network_mode: service:gerbil, meaning it shares Gerbil's network namespace. This is how Traefik can terminate TLS on ports that Gerbil binds.

Pangolin config files

Create pangolin/config/config.yml:

app:
  dashboard_url: "https://proxy.example.com"
  log_level: "info"
  save_logs: true
  log_failed_attempts: true

domains:
  domain1:
    base_domain: "proxy.example.com"
    cert_resolver: "letsencrypt"

server:
  trust_proxy: 1

gerbil:
  base_endpoint: "proxy.example.com"

flags:
  require_email_verification: false
  disable_signup_without_invite: true
  disable_user_create_org: true

Create pangolin/config/traefik/traefik_config.yml:

api:
  insecure: false
  dashboard: false

providers:
  http:
    endpoint: "http://pangolin:3001/api/v1/traefik-config"
    pollInterval: "5s"
  file:
    filename: "/etc/traefik/dynamic_config.yml"

experimental:
  plugins:
    badger:
      moduleName: "github.com/fosrl/badger"
      version: "v1.3.0"

log:
  level: "INFO"
  format: "common"

certificatesResolvers:
  letsencrypt:
    acme:
      httpChallenge:
        entryPoint: web
      email: "[email protected]"
      storage: "/letsencrypt/acme.json"
      caServer: "https://acme-v02.api.letsencrypt.org/directory"

entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"
    transport:
      respondingTimeouts:
        readTimeout: "30m"
    http:
      tls:
        certResolver: "letsencrypt"

serversTransport:
  insecureSkipVerify: true

ping:
  entryPoint: "web"

Create pangolin/config/traefik/dynamic_config.yml:

http:
  middlewares:
    badger:
      plugin:
        badger:
          disableForwardAuth: true
    redirect-to-https:
      redirectScheme:
        scheme: https

  routers:
    main-app-router-redirect:
      rule: "Host(`proxy.example.com`)"
      service: next-service
      entryPoints:
        - web
      middlewares:
        - redirect-to-https
        - badger

    next-router:
      rule: "Host(`proxy.example.com`) && !PathPrefix(`/api/v1`)"
      service: next-service
      entryPoints:
        - websecure
      middlewares:
        - badger
      tls:
        certResolver: letsencrypt

    api-router:
      rule: "Host(`proxy.example.com`) && PathPrefix(`/api/v1`)"
      service: api-service
      entryPoints:
        - websecure
      middlewares:
        - badger
      tls:
        certResolver: letsencrypt

    ws-router:
      rule: "Host(`proxy.example.com`)"
      service: api-service
      entryPoints:
        - websecure
      middlewares:
        - badger
      tls:
        certResolver: letsencrypt

  services:
    next-service:
      loadBalancer:
        servers:
          - url: "http://pangolin:3002"
    api-service:
      loadBalancer:
        servers:
          - url: "http://pangolin:3000"

Replace proxy.example.com throughout with your actual domain.


Pangolin GitHub Actions workflow

The deploy workflow for the Pangolin VPS is simpler than the Swarm one (single node, no stack discovery). Create .github/workflows/deploy.yml:

name: deploy

on:
  push:
    branches: ["main"]

concurrency:
  group: deploy-proxy
  cancel-in-progress: true

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Setup SSH key
        env:
          DEPLOY_KEY_B64: ${{ secrets.DEPLOY_SSH_PRIVATE_KEY_B64 }}
        run: |
          set -euo pipefail
          mkdir -p ~/.ssh && chmod 700 ~/.ssh
          printf '%s' "$DEPLOY_KEY_B64" | tr -d '\r' | base64 -d > ~/.ssh/deploy_key
          chmod 600 ~/.ssh/deploy_key
          ssh-keygen -lf ~/.ssh/deploy_key
          eval "$(ssh-agent -s)"
          ssh-add ~/.ssh/deploy_key

      - name: Add pinned known_hosts
        env:
          VPS_KNOWN_HOSTS: ${{ secrets.VPS_KNOWN_HOSTS }}
        run: |
          set -euo pipefail
          mkdir -p ~/.ssh && chmod 700 ~/.ssh
          printf '%s\n' "$VPS_KNOWN_HOSTS" >> ~/.ssh/known_hosts
          chmod 600 ~/.ssh/known_hosts

      - name: Deploy
        env:
          VPS_HOST: ${{ secrets.VPS_HOST }}
          VPS_USER: ${{ secrets.VPS_USER }}
          FLAKE_TARGET: ${{ secrets.FLAKE_TARGET }}
        run: |
          set -euo pipefail
          ssh -i ~/.ssh/deploy_key \
            -o IdentitiesOnly=yes -o BatchMode=yes -o StrictHostKeyChecking=yes \
            "${VPS_USER}@${VPS_HOST}" "bash -lc '
              set -euo pipefail
              cd /etc/nixos
              sudo -n /run/current-system/sw/bin/git fetch --prune origin
              sudo -n /run/current-system/sw/bin/git checkout -f main
              sudo -n /run/current-system/sw/bin/git reset --hard origin/main
              sudo -n /run/current-system/sw/bin/nixos-rebuild switch \
                --flake \"path:/etc/nixos#${FLAKE_TARGET}\"
            '"

Provisioning the Pangolin VPS

The process is identical to provisioning a Swarm node (Steps 6-9 in Part 1), with these differences:

  • Only one node to provision, not three
  • Point DNS (proxy.example.com A record) to this VPS's public IP before applying the flake, because Let's Encrypt needs to reach port 80 for ACME validation
  • The NixOS firewall is enabled (the flake handles the port rules)
  • After applying the flake, Pangolin's systemd service starts automatically, pulling Docker images and bringing up the compose stack

After nixos-rebuild switch:

# Check Pangolin containers
docker ps

# Check logs if something is restarting
docker logs pangolin --tail 100
docker logs traefik --tail 100

Common first-run issues:

  • DNS not pointing to the VPS yet (Let's Encrypt fails)
  • Port 80 blocked by provider firewall (ACME validation fails)
  • SERVER_SECRET not decrypted properly (check sops config and age key)

Once the dashboard is up, open https://proxy.example.com/auth/initial-setup, create the admin account, and create your first Organisation.


Connecting the Swarm cluster to Pangolin

In the Pangolin dashboard, create a Site using Newt Tunnel. Pangolin generates a NEWT_ID and NEWT_SECRET. Add these as GitHub secrets in your Swarm cluster repo.

Then create the Newt stack in the Swarm repo:

# swarm/newt-stack.yml
version: "3.9"

services:
  newt:
    image: fosrl/newt:latest
    environment:
      PANGOLIN_ENDPOINT: "https://proxy.example.com"
      NEWT_ID: "${NEWT_ID}"
      NEWT_SECRET: "${NEWT_SECRET}"
    networks:
      - pangolin
    deploy:
      replicas: 1
      restart_policy:
        condition: any

networks:
  pangolin:
    external: true

Push to main. The pipeline deploys the Newt stack. In the Pangolin dashboard, the site status should move from Offline to Online.

Now create Resources in the Pangolin dashboard for each service you want to expose. The upstream target is the service's overlay network alias and port, e.g. http://myapp:8080. Pangolin routes HTTPS traffic through the tunnel to the Swarm overlay network, reaching the service without any published ports on the Swarm hosts.


Updating the OS

To update NixOS and all system packages across the entire infrastructure:

cd my-swarm-cluster
nix --extra-experimental-features "nix-command flakes" flake update
git add flake.lock
git commit -m "Update NixOS flake inputs"
git push

Do the same in the Pangolin repo. GitHub Actions picks up the new lock file and rebuilds each node with updated packages. That is the entire OS update process.


New laptop, who dis

If you change workstations, you need your SSH keys and optionally your age identity file (for editing encrypted secrets locally).

Restore SSH keys

Copy from your password manager or backup:

chmod 700 ~/.ssh
chmod 600 ~/.ssh/swarm_admin_ed25519 ~/.ssh/swarm_ci_deploy_ed25519
chmod 644 ~/.ssh/swarm_admin_ed25519.pub ~/.ssh/swarm_ci_deploy_ed25519.pub

Set up SSH config

Host swarm-1
  HostName NODE1_PUBLIC_IP
  User admin
  IdentityFile ~/.ssh/swarm_admin_ed25519
  IdentitiesOnly yes

Host swarm-2
  HostName NODE2_PUBLIC_IP
  User admin
  IdentityFile ~/.ssh/swarm_admin_ed25519
  IdentitiesOnly yes

Host swarm-3
  HostName NODE3_PUBLIC_IP
  User admin
  IdentityFile ~/.ssh/swarm_admin_ed25519
  IdentitiesOnly yes

Host proxy-1
  HostName PROXY_PUBLIC_IP
  User admin
  IdentityFile ~/.ssh/swarm_admin_ed25519
  IdentitiesOnly yes

Restore age identity (for sops editing)

Only needed if you decrypt or edit secrets.yaml locally:

mkdir -p ~/.config/sops/age && chmod 700 ~/.config/sops/age
cp /path/from_backup/keys.txt ~/.config/sops/age/keys.txt
chmod 600 ~/.config/sops/age/keys.txt

Test:

cd my-swarm-cluster
sops -d secrets.yaml >/dev/null && echo "OK"

If you lost the SSH keys

Generate new ones, update the public keys in both flake.nix files (Swarm and Pangolin repos), update the base64-encoded CI key in both GitHub secrets, push, and rebuild. The nodes recover because the config is in the repo. You will need console access (provider web console) to the nodes one time to apply the first rebuild if you cannot SSH in with the old keys.

Verify the cluster

ssh swarm-1 'docker node ls'
ssh swarm-1 'docker stack ls'
ssh swarm-1 'docker service ls'

Closing

The combination of NixOS, Docker Swarm, and GitOps gives you something that is hard to achieve with traditional setups: a production container cluster where the entire state, OS config, services, secrets, and reverse proxy, is version-controlled, reproducible, and deployable from a single git push.

NixOS handles the OS layer declaratively. Swarm handles container orchestration without the complexity tax of Kubernetes. Pangolin handles the ingress layer with identity-aware routing through WireGuard tunnels. GitHub Actions ties it all together so the only manual step after initial provisioning is committing code.

If a Swarm node dies, you provision a new VPS, clone the repo, generate an age key, apply the flake, and join the swarm. If the proxy VPS dies, you provision a new one, point DNS at it, clone the repo, apply the flake, and Pangolin rebuilds itself from config. The system recovers from the repo because the repo is the system.

Quite neat.