Docker Swarm on NixOS with GitOps
Building a declarative, highly available, git-driven container cluster
Contents
- What this post covers
- The three building blocks
- Docker Swarm
- NixOS
- GitOps
- Architecture overview
- Part 1: The Swarm Cluster
- Step 0: Workstation prerequisites
- Step 1: Create the repo structure
- Step 2: Write the Nix flake
- Step 3: sops-nix configuration
- Step 4: GitHub Actions workflow
- Step 5: Commit and push
- Step 6: Provision the VPS nodes
- Step 7: Set up GitHub repo access on the node
- Step 8: Replace /etc/nixos with the repo
- Step 9: Generate the node's age key and apply the flake
- Step 10: Repeat for nodes 2 and 3
- Step 11: Pin SSH host keys for GitHub Actions
- Step 12: Add GitHub Actions secrets
- Step 13: Confirm VPC networking and initialise Docker Swarm
- Step 14: Deploy your first Swarm stack
- Using shared storage with stacks
- Deploying in-house apps from a private container registry
- Part 2: The Pangolin Reverse Proxy VPS
- Repo layout
- The Pangolin flake
- Pangolin Docker Compose
- Pangolin config files
- Pangolin GitHub Actions workflow
- Provisioning the Pangolin VPS
- Connecting the Swarm cluster to Pangolin
- Updating the OS
- New laptop, who dis
- Restore SSH keys
- Set up SSH config
- Restore age identity (for sops editing)
- If you lost the SSH keys
- Verify the cluster
- Closing
What this post covers
This is a full walkthrough of building a three-node Docker Swarm cluster running on NixOS, managed entirely through a Git repo and deployed via GitHub Actions. Every config change, every stack deployment, every OS-level tweak flows through a single repo push without the need for SSH-and-edit.
By the end you will have:
- 3 NixOS VPS nodes in the same VPC, running Docker Swarm with 3 manager nodes
- A separate NixOS VPS running Pangolin (reverse proxy + tunnel endpoint), also GitOps-managed
- A single GitHub repo per concern (one for the Swarm cluster, one for the proxy) as the source of truth for all OS and service configuration
- GitHub Actions that rebuild each node on push to
main - Swarm stacks auto-discovered and deployed from the same repo
- Shared persistent storage across Swarm nodes via a mounted filesystem
- sops-nix for encrypted secrets, decrypted at deploy time on each node
- Pangolin routing HTTPS traffic into the Swarm overlay network without publishing ports on hosts
- In-house app builds pushed to a private container registry and deployed as Swarm services
- A workstation recovery path so you can regain full access from a new laptop
The three building blocks
Before diving in, here is a quick grounding on the three core technologies this build combines. If you are already familiar with all three, skip to the architecture overview.
Docker Swarm
Docker Swarm is Docker's built-in container orchestration. If you have used docker compose up on a single host, Swarm is the multi-host version of that: it takes the same Compose-style YAML but distributes containers across a cluster of machines. You get service replication (run N copies of a container spread across nodes), rolling updates (replace containers one at a time with zero downtime), an overlay network (so containers on different physical hosts can talk to each other as if they were on the same LAN), and an ingress routing mesh (so traffic hitting any node on a published port gets forwarded to whichever node is actually running that container).
It is a lot simpler than Kubernetes. There is no etcd to manage, no control plane to babysit, no CRDs, no Helm charts. You write a YAML file, run docker stack deploy, and it works. For small to medium workloads, Swarm does everything you actually need without the operational overhead.
A Swarm cluster has two roles: manager nodes (which handle scheduling, cluster state, and Raft consensus) and worker nodes. In a three-node cluster like this one, all three are managers, which gives you fault tolerance since the cluster stays healthy and schedulable as long as two of three nodes are up. If one node goes down, the surviving two maintain quorum and reschedule its containers onto healthy nodes.
NixOS
NixOS is a Linux distribution where the entire system configuration is declared in code. You do not install packages imperatively (like apt install nginx), you do not edit config files scattered across /etc, and you do not maintain shell scripts that drift over time. Instead, you write a single Nix expression that describes the desired state of the system: every package, every service, every user account, every firewall rule, every mount point, every kernel module. Then you run nixos-rebuild switch, and the Nix package manager evaluates that expression and atomically transitions the running system to match.
This means the configuration is reproducible. If you lose a node, you provision a new VPS, clone the repo, and apply the same flake. You get the exact same system. If you want three identical nodes with small per-host differences (like different hostnames or VPC IPs), you write a shared module and parameterise the host-specific parts. And because the entire config is just files in a Git repo, it slots naturally into version control and CI/CD.
NixOS also gives you atomic rollbacks. Every rebuild creates a new "generation" of the system. If a change breaks something, you can roll back to the previous generation instantly, either from the boot menu or with nixos-rebuild switch --rollback.
GitOps
GitOps is a pattern where a Git repository is the single source of truth for infrastructure and application state. You do not SSH into servers and run commands. You commit changes to the repo, and automation applies those changes to the live environment.
In this build, GitOps means:
- Push a NixOS config change to
main→ GitHub Actions SSHes into each node, pulls the latest commit, and runsnixos-rebuild switch. The OS, packages, services, users, firewall rules, and everything else update to match. - Push a new Swarm stack YAML to
main→ the same pipeline copies the stack file to the Swarm manager and runsdocker stack deploy. The new service comes up. - Push an application code change → the pipeline builds a container image, pushes it to a private registry, and deploys the updated stack with the new image tag.
The result is that the running state of every node, the OS, Docker configuration, and every deployed service, is a function of what is in the repo. If someone makes a manual change on a node, the next push to main overwrites it. Drift is impossible by design.
Architecture overview
The infrastructure has two concerns, managed in two separate repos:
The Swarm cluster (3 VPS nodes in one VPC):Three VPS nodes sit in the same VPC (private network). Each runs NixOS with Docker enabled. Docker Swarm is initialised across all three using VPC private IPs for inter-node communication. A shared filesystem (virtiofs, NFS, or whatever your provider offers) provides persistent storage that all three nodes can access at the same mount path.
The reverse proxy (1 VPS, separate):A fourth VPS runs Pangolin (reverse proxy + WireGuard tunnel endpoint) as a Docker Compose stack on NixOS, managed by its own repo with the same GitOps pattern. This VPS sits outside the Swarm VPC. It receives inbound HTTPS traffic from the internet and routes it through a tunnel (Newt) into the Swarm's overlay network.
Services inside the Swarm do not publish ports on their hosts. The only way traffic reaches them is through the tunnel via Pangolin. This means you do not need to open service ports on the Swarm VPC firewall at all.
The critical constraint for the Swarm nodes: /var/lib/docker must stay local to each node. Docker Swarm requires each node to have its own local Docker state. You cannot put /var/lib/docker on shared storage. Shared persistent data for your applications is handled by bind-mounting directories from the shared filesystem into services, or by creating named volumes backed by bind mounts into the shared filesystem. This is the standard pattern.
Part 1: The Swarm Cluster
This covers everything needed to go from zero to a working three-node Swarm cluster with GitOps deployment.
Step 0: Workstation prerequisites
You need a handful of tools on your local machine before touching any servers.
Install packages
You need git, ssh, and base64 (usually already present on Linux/macOS). You also need the Nix package manager (not NixOS itself, just the Nix CLI) because several tools used in this guide (sops, age, ssh-to-age, mkpasswd) are easiest to run via Nix's ephemeral shell.
If you do not have Nix installed on your workstation, follow the official install instructions at nixos.org. The single-user or multi-user installer both work. You only need the nix CLI, not a full NixOS install.
If your Nix install does not have flakes enabled by default, you can either add experimental-features = nix-command flakes to /etc/nix/nix.conf and restart the daemon, or pass the flag inline on every command (shown below).
Create SSH keys
You need two distinct key pairs:
- Admin key: your interactive SSH access to the nodes. You use this when you need to log in manually.
- CI deploy key: used by GitHub Actions to SSH into nodes as the
deployuser. This key lives as a base64-encoded GitHub secret.
# Admin key (your interactive SSH access)
ssh-keygen -t ed25519 -a 64 -f ~/.ssh/swarm_admin_ed25519 -N ""
cat ~/.ssh/swarm_admin_ed25519.pub
# CI deploy key (GitHub Actions to SSH into nodes)
ssh-keygen -t ed25519 -a 64 -f ~/.ssh/swarm_ci_deploy_ed25519 -N ""
cat ~/.ssh/swarm_ci_deploy_ed25519.pub
Keep the public key outputs handy. You will paste them into the NixOS flake.
Generate an admin password hash
NixOS needs a hashed password for the admin user (for sudo, console login, etc). Generate one:
nix --extra-experimental-features "nix-command flakes" shell nixpkgs#mkpasswd -c \
bash -lc 'mkpasswd -m sha-512'
Type your desired password when prompted. Copy the $6$... output. You will paste it into flake.nix later.
Convert your admin SSH key to an age recipient
sops-nix uses age encryption. You need your admin SSH public key converted to an age recipient so you can encrypt and decrypt secrets on your workstation:
nix --extra-experimental-features "nix-command flakes" shell nixpkgs#ssh-to-age -c \
bash -lc 'ssh-to-age < ~/.ssh/swarm_admin_ed25519.pub'
Copy the age1... string. You will use it in .sops.yaml.
Step 1: Create the repo structure
Create a private GitHub repo. This repo will contain the full NixOS configuration for all three nodes, all Swarm stack definitions, sops secrets config, and the CI/CD workflow.
mkdir -p my-swarm-cluster
cd my-swarm-cluster
mkdir -p hosts/swarm-1
mkdir -p hosts/swarm-2
mkdir -p hosts/swarm-3
mkdir -p swarm
mkdir -p .github/workflows
The final repo layout will look like this:
.
├── flake.nix # shared NixOS config for all 3 nodes
├── flake.lock # pinned input versions
├── secrets.yaml # sops-encrypted secrets
├── .sops.yaml # sops recipient config
├── hosts/
│ ├── swarm-1/
│ │ ├── hardware-configuration.nix
│ │ └── networking.nix
│ ├── swarm-2/
│ │ ├── hardware-configuration.nix
│ │ └── networking.nix
│ └── swarm-3/
│ ├── hardware-configuration.nix
│ └── networking.nix
├── swarm/
│ ├── myapp-stack.yml
│ ├── newt-stack.yml
│ └── another-service-stack.yml
└── .github/
└── workflows/
└── deploy.yml
Step 2: Write the Nix flake
This is the core of the build. A single flake.nix defines all three nodes using a mkHost function. Each call to mkHost produces a complete NixOS system configuration for one node, importing its host-specific hardware config. Shared config (users, Docker, SSH hardening, packages, services, secrets) is written once and applies to all three.
Create flake.nix in the repo root. Read through the inline comments, they explain why each block exists:
{
description = "Docker Swarm cluster on NixOS with sops-nix and GitHub Actions deploy";
inputs = {
# Pin to a specific NixOS release channel
nixpkgs.url = "github:NixOS/nixpkgs/nixos-25.11";
# sops-nix for encrypted secrets management
sops-nix.url = "github:Mic92/sops-nix";
sops-nix.inputs.nixpkgs.follows = "nixpkgs";
};
outputs = { self, nixpkgs, sops-nix }:
let
system = "x86_64-linux";
# mkHost: a function that builds a NixOS system config for one node.
# All shared config lives inside this function.
# Host-specific config (hardware, networking) is imported per host.
mkHost = { hostName }:
nixpkgs.lib.nixosSystem {
inherit system;
modules = [
sops-nix.nixosModules.sops
({ config, pkgs, lib, ... }:
{
imports = [
# Each node has its own hardware-configuration.nix
# (auto-generated during NixOS install, unique per machine)
./hosts/${hostName}/hardware-configuration.nix
];
# stateVersion tells NixOS which release this system was
# originally installed on. It affects default behaviours for
# some services. Set it once and do not change it.
system.stateVersion = "25.11";
networking.hostName = hostName;
time.timeZone = "Europe/London";
nix.settings = {
experimental-features = [ "nix-command" "flakes" ];
warn-dirty = false;
};
# ── Boot ──────────────────────────────────────────────
boot.loader.grub.enable = true;
boot.loader.grub.devices = [ "/dev/vda" ];
boot.loader.grub.useOSProber = false;
# ── Users ─────────────────────────────────────────────
# mutableUsers = false means user accounts are ONLY what
# is defined here. No one can useradd/passwd on the box.
# If you need to change passwords or keys, update the
# flake and push.
users.mutableUsers = false;
security.sudo.wheelNeedsPassword = true;
users.users.admin = {
isNormalUser = true;
description = "Admin";
extraGroups = [ "wheel" "docker" ];
openssh.authorizedKeys.keys = [
# Paste your admin public key here:
"ssh-ed25519 AAAA... your-admin-pubkey"
];
# Paste the $6$... hash from mkpasswd:
hashedPassword = "$6$...your-hash-here";
};
users.users.deploy = {
isNormalUser = true;
description = "CI Deploy User";
extraGroups = [ "wheel" "docker" ];
openssh.authorizedKeys.keys = [
# Paste your CI deploy public key here:
"ssh-ed25519 AAAA... your-ci-deploy-pubkey"
];
# No password. This user is SSH-key-only via GitHub Actions.
};
# deploy user gets tightly scoped NOPASSWD sudo.
# It can ONLY run these four commands without a password.
# nixos-rebuild (to apply config), git (to pull repo),
# systemctl (to restart services), true (for connection tests).
security.sudo.extraRules = [
{
users = [ "deploy" ];
commands = [
{ command = "/run/current-system/sw/bin/nixos-rebuild";
options = [ "NOPASSWD" ]; }
{ command = "/run/current-system/sw/bin/git";
options = [ "NOPASSWD" ]; }
{ command = "/run/current-system/sw/bin/systemctl";
options = [ "NOPASSWD" ]; }
{ command = "/run/current-system/sw/bin/true";
options = [ "NOPASSWD" ]; }
];
}
];
# ── SSH hardening ─────────────────────────────────────
services.openssh = {
enable = true;
# openFirewall = false because we manage firewall at
# the provider level for the swarm nodes.
openFirewall = false;
settings = {
PasswordAuthentication = false;
KbdInteractiveAuthentication = false;
PermitRootLogin = "no";
X11Forwarding = false;
AllowTcpForwarding = "no";
AllowAgentForwarding = "no";
ClientAliveInterval = 300;
ClientAliveCountMax = 2;
MaxAuthTries = 3;
LogLevel = "VERBOSE";
};
};
services.fail2ban.enable = true;
# ── Firewall ──────────────────────────────────────────
# Disabled on swarm nodes because we use provider-level
# firewall rules (Vultr, Hetzner, etc). The provider
# firewall is easier to manage centrally and avoids
# conflicts with Docker's iptables manipulation.
# If your provider does not offer a firewall, enable this
# and open ports 22, 2377, 7946/tcp, 7946/udp, 4789/udp.
networking.firewall.enable = false;
# ── Kernel modules ────────────────────────────────────
# virtiofs: for the shared filesystem mount
# vxlan: Docker Swarm overlay networking uses VXLAN
# overlay: Docker overlay filesystem driver
# br_netfilter: required for iptables to see bridged traffic
boot.initrd.kernelModules = [ "virtiofs" ];
boot.kernelModules = [
"virtiofs" "vxlan" "overlay" "br_netfilter"
];
# ── Shared filesystem ─────────────────────────────────
# This is where persistent data lives. All three nodes
# mount the same filesystem. Your provider determines the
# mechanism: virtiofs (Vultr), NFS, GlusterFS, etc.
# Replace "your-mount-tag" with your actual mount tag/device.
fileSystems."/mnt/vfs" = {
device = "your-mount-tag";
fsType = "virtiofs";
options = [ "rw" "relatime" ];
};
# Create directory structure via systemd-tmpfiles.
# These directories are created on every boot if missing.
systemd.tmpfiles.rules = [
"d /mnt/vfs 0755 root root -"
"d /mnt/vfs/docker-volumes 0755 root root -"
"d /opt/swarm 0750 root root -"
"d /opt/swarm/stacks 0750 root root -"
];
# ── Activation script: copy stack files from repo ─────
# When nixos-rebuild runs, this script copies stack YAML
# files from the repo (which lives at /etc/nixos on the
# node) into /opt/swarm/stacks with correct ownership.
# The docker group ownership means the deploy user (who
# is in the docker group) can read these files.
system.activationScripts.swarmStacks.text = ''
set -euo pipefail
install -d -m 0750 -o root -g docker /opt/swarm
install -d -m 0750 -o root -g docker /opt/swarm/stacks
# Add one line per stack file:
install -m 0640 -o root -g docker \
${./swarm/myapp-stack.yml} \
/opt/swarm/stacks/myapp-stack.yml
'';
# ── Docker ────────────────────────────────────────────
virtualisation.docker = {
enable = true;
autoPrune = {
enable = true;
dates = "weekly";
flags = [ "--all" "--volumes" ];
};
daemon.settings = {
"log-driver" = "json-file";
"log-opts" = {
"max-size" = "10m";
"max-file" = "5";
};
# IMPORTANT: live-restore must be false for Swarm.
# Docker Swarm manages container lifecycle itself.
# live-restore conflicts with that and causes
# split-brain issues after daemon restarts.
"live-restore" = false;
};
};
# ── Packages ──────────────────────────────────────────
environment.systemPackages = with pkgs; [
git curl jq sops age
];
# ── Journald ──────────────────────────────────────────
services.journald.extraConfig = ''
Storage=persistent
SystemMaxUse=1G
'';
# ── sops-nix ──────────────────────────────────────────
# Each node has an age key at /var/lib/sops-nix/key.txt
# that can decrypt secrets.yaml. You generate this key
# during node provisioning (Step 8).
sops = {
defaultSopsFile = ./secrets.yaml;
defaultSopsFormat = "yaml";
age.keyFile = "/var/lib/sops-nix/key.txt";
};
})
];
};
in
{
nixosConfigurations = {
"swarm-1" = mkHost { hostName = "swarm-1"; };
"swarm-2" = mkHost { hostName = "swarm-2"; };
"swarm-3" = mkHost { hostName = "swarm-3"; };
};
};
}
Per-host networking
Each node needs its own VPC interface configuration. Create hosts/swarm-1/networking.nix:
{ ... }:
{
# Replace ens9 with your VPC interface name and the IP with
# the VPC private IP assigned to this node by your provider.
networking.interfaces.ens9.useDHCP = false;
networking.interfaces.ens9.ipv4.addresses = [
{ address = "10.8.96.2"; prefixLength = 20; }
];
}
Import this in the flake by adding it to the imports list:
imports = [
./hosts/${hostName}/hardware-configuration.nix
./hosts/${hostName}/networking.nix
];
Do the same for swarm-2 and swarm-3 with their respective VPC IPs.
Generate the lock file
From the repo root on your workstation:
nix --extra-experimental-features "nix-command flakes" flake lock
This resolves and pins all flake inputs (nixpkgs version, sops-nix version) in flake.lock. This file is what makes builds reproducible: every node will use the exact same package versions. Commit it to the repo.
Step 3: sops-nix configuration
Secrets (API keys, database passwords, tokens) live in the repo encrypted with age. Each node has its own age key that can decrypt them. Your workstation also has a recipient so you can encrypt new secrets locally.
Create .sops.yaml
keys:
# Your workstation admin recipient (from Step 0)
- &admin_age age1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Node recipients - placeholders until you provision each node
- &swarm1_age PLACEHOLDER_SWARM1
- &swarm2_age PLACEHOLDER_SWARM2
- &swarm3_age PLACEHOLDER_SWARM3
creation_rules:
- path_regex: secrets\.yaml$
key_groups:
- age:
- *admin_age
- *swarm1_age
- *swarm2_age
- *swarm3_age
The creation_rules block says: any file matching secrets.yaml must be encrypted to all four recipients. This means your workstation can decrypt (for editing) and each node can decrypt (for applying config).
Create secrets.yaml (plaintext placeholder)
swarm:
placeholder: "replace-me"
You will encrypt this after all three nodes are provisioned and their age recipients are filled in.
Step 4: GitHub Actions workflow
This is the CI/CD pipeline. On every push to main, it SSHes into each node, pulls the repo, runs nixos-rebuild switch, then auto-discovers and deploys all Swarm stacks.
Create .github/workflows/deploy.yml:
name: deploy
on:
push:
branches: ["main"]
# Only one deploy can run at a time. If you push twice quickly,
# the second push cancels the first. This prevents two rebuilds
# from racing on the same node.
concurrency:
group: deploy-swarm
cancel-in-progress: true
jobs:
deploy:
runs-on: ubuntu-latest
env:
SSH_OPTS: >-
-o IdentitiesOnly=yes -o BatchMode=yes -o StrictHostKeyChecking=yes
SCP_OPTS: -o StrictHostKeyChecking=yes
REMOTE_STACK_DIR: /tmp/swarm-stacks
steps:
- name: Checkout
uses: actions/checkout@v4
# Decode the base64-encoded CI deploy private key from GitHub secrets,
# write it to disk, and load it into ssh-agent.
- name: Setup SSH key
env:
DEPLOY_KEY_B64: ${{ secrets.CI_DEPLOY_PRIVATE_KEY_B64 }}
run: |
set -euo pipefail
mkdir -p "$HOME/.ssh"
chmod 700 "$HOME/.ssh"
SSH_KEY_PATH="$HOME/.ssh/swarm_ci_deploy_ed25519"
printf '%s' "$DEPLOY_KEY_B64" | tr -d '\r' | base64 -d > "$SSH_KEY_PATH"
chmod 600 "$SSH_KEY_PATH"
# Validate the key without leaking material (prints fingerprint only)
ssh-keygen -lf "$SSH_KEY_PATH"
eval "$(ssh-agent -s)"
ssh-add "$SSH_KEY_PATH"
echo "SSH_KEY_PATH=$SSH_KEY_PATH" >> "$GITHUB_ENV"
# Pinned known_hosts prevents MITM. These are the SSH host key
# fingerprints you collected from each node (Step 10).
- name: Add pinned known_hosts
env:
SWARM_KNOWN_HOSTS: ${{ secrets.SWARM_KNOWN_HOSTS }}
run: |
set -euo pipefail
mkdir -p "$HOME/.ssh"
chmod 700 "$HOME/.ssh"
printf '%s\n' "$SWARM_KNOWN_HOSTS" >> "$HOME/.ssh/known_hosts"
chmod 600 "$HOME/.ssh/known_hosts"
# ── NixOS rebuild on each node ──────────────────────────────
# The pattern is identical for each node:
# 1. SSH in as the deploy user
# 2. cd to /etc/nixos (which is the repo checkout)
# 3. git fetch + reset to origin/main (sudo because root owns /etc/nixos)
# 4. nixos-rebuild switch with the node's flake target
- name: Deploy NixOS swarm-1
env:
HOST: ${{ secrets.SWARM1_HOST }}
USER: ${{ secrets.SWARM_USER }}
TARGET: ${{ secrets.SWARM1_FLAKE_TARGET }}
run: |
set -euo pipefail
ssh -i "$SSH_KEY_PATH" $SSH_OPTS "${USER}@${HOST}" "bash -lc '
set -euo pipefail
cd /etc/nixos
sudo -n /run/current-system/sw/bin/git fetch --prune origin
sudo -n /run/current-system/sw/bin/git checkout -f main
sudo -n /run/current-system/sw/bin/git reset --hard origin/main
sudo -n /run/current-system/sw/bin/nixos-rebuild switch \
--flake \"path:/etc/nixos#${TARGET}\"
'"
- name: Deploy NixOS swarm-2
env:
HOST: ${{ secrets.SWARM2_HOST }}
USER: ${{ secrets.SWARM_USER }}
TARGET: ${{ secrets.SWARM2_FLAKE_TARGET }}
run: |
set -euo pipefail
ssh -i "$SSH_KEY_PATH" $SSH_OPTS "${USER}@${HOST}" "bash -lc '
set -euo pipefail
cd /etc/nixos
sudo -n /run/current-system/sw/bin/git fetch --prune origin
sudo -n /run/current-system/sw/bin/git checkout -f main
sudo -n /run/current-system/sw/bin/git reset --hard origin/main
sudo -n /run/current-system/sw/bin/nixos-rebuild switch \
--flake \"path:/etc/nixos#${TARGET}\"
'"
- name: Deploy NixOS swarm-3
env:
HOST: ${{ secrets.SWARM3_HOST }}
USER: ${{ secrets.SWARM_USER }}
TARGET: ${{ secrets.SWARM3_FLAKE_TARGET }}
run: |
set -euo pipefail
ssh -i "$SSH_KEY_PATH" $SSH_OPTS "${USER}@${HOST}" "bash -lc '
set -euo pipefail
cd /etc/nixos
sudo -n /run/current-system/sw/bin/git fetch --prune origin
sudo -n /run/current-system/sw/bin/git checkout -f main
sudo -n /run/current-system/sw/bin/git reset --hard origin/main
sudo -n /run/current-system/sw/bin/nixos-rebuild switch \
--flake \"path:/etc/nixos#${TARGET}\"
'"
# ── Auto-discover and deploy all Swarm stacks ───────────────
# This finds every *-stack.yml in the swarm/ directory, copies
# them to the manager node, and deploys each one.
- name: Deploy Swarm stacks (auto-discover)
env:
HOST: ${{ secrets.SWARM1_HOST }}
USER: ${{ secrets.SWARM_USER }}
# Add your stack-specific secrets here as env vars.
# Each one becomes available for variable substitution
# in stack YAML files (e.g. ${NEWT_ID} in the compose).
NEWT_ID: ${{ secrets.NEWT_ID }}
NEWT_SECRET: ${{ secrets.NEWT_SECRET }}
# Add more as needed for other stacks:
# DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
run: |
set -euo pipefail
# Find all stack files in the swarm/ directory
mapfile -t STACK_FILES < <(
find swarm -maxdepth 1 -type f \
\( -name '*-stack.yml' -o -name '*-stack.yaml' \) | sort
)
if [ "${#STACK_FILES[@]}" -eq 0 ]; then
echo "No stack files found under ./swarm/"
exit 1
fi
# Clean the remote staging directory
ssh -i "$SSH_KEY_PATH" $SSH_OPTS "${USER}@${HOST}" "bash -lc '
set -euo pipefail
mkdir -p \"${REMOTE_STACK_DIR}\"
rm -f \"${REMOTE_STACK_DIR}\"/*-stack.y*ml
rm -f \"${REMOTE_STACK_DIR}/deploy.env\"
'"
# Copy each stack file to the manager node
for f in "${STACK_FILES[@]}"; do
base="$(basename "$f")"
scp -i "$SSH_KEY_PATH" $SCP_OPTS \
"$f" "${USER}@${HOST}:${REMOTE_STACK_DIR}/${base}"
done
# Build an env file with all secrets. This file is copied
# to the node, sourced before deploy, then deleted.
ENV_LOCAL="$(mktemp)"
chmod 600 "$ENV_LOCAL"
{
printf 'export NEWT_ID=%q\n' "$NEWT_ID"
printf 'export NEWT_SECRET=%q\n' "$NEWT_SECRET"
# Add more secrets here matching your env vars above:
# printf 'export DB_PASSWORD=%q\n' "$DB_PASSWORD"
} > "$ENV_LOCAL"
scp -i "$SSH_KEY_PATH" $SCP_OPTS \
"$ENV_LOCAL" "${USER}@${HOST}:${REMOTE_STACK_DIR}/deploy.env"
rm -f "$ENV_LOCAL"
# Deploy all stacks on the manager node
ssh -i "$SSH_KEY_PATH" $SSH_OPTS "${USER}@${HOST}" "bash -lc '
set -euo pipefail
chmod 600 \"${REMOTE_STACK_DIR}/deploy.env\"
source \"${REMOTE_STACK_DIR}/deploy.env\"
# Create the overlay network if it does not exist.
# All stacks that need to talk to each other (or to
# the tunnel agent) should attach to this network.
docker network inspect pangolin >/dev/null 2>&1 \
|| docker network create -d overlay --attachable pangolin
# Deploy each stack file. The stack name is derived
# from the filename: myapp-stack.yml becomes "myapp".
for f in \"${REMOTE_STACK_DIR}\"/*-stack.yml \
\"${REMOTE_STACK_DIR}\"/*-stack.yaml; do
[ -e \"\$f\" ] || continue
base=\"\$(basename \"\$f\")\"
stack=\"\${base%-stack.yml}\"
stack=\"\${stack%-stack.yaml}\"
docker stack deploy -c \"\$f\" \"\$stack\"
done
# Clean up the secrets file
rm -f \"${REMOTE_STACK_DIR}/deploy.env\"
docker stack ls
'"
This auto-discover pattern means adding a new service to the cluster is:
- Create
swarm/my-new-service-stack.yml - If it needs secrets, add them to GitHub secrets and to the
deploy.envblock in the workflow - Push to
main
That is it. The pipeline finds the new file and deploys it.
Step 5: Commit and push
cd my-swarm-cluster
git init
git add -A
git commit -m "Initial NixOS Swarm cluster skeleton"
git branch -M main
git remote add origin [email protected]:your-org/my-swarm-cluster.git
git push -u origin main
The GitHub Actions workflow will fail at this point because the secrets are not configured yet and the nodes do not exist. That is expected.
Step 6: Provision the VPS nodes
For each of the three nodes, follow this sequence. Do node 1 first (it has extra steps), then repeat for nodes 2 and 3.
Create the VPS
Pick your provider (Vultr, Hetzner, DigitalOcean, etc). Requirements:
- All three nodes in the same region
- All three attached to the same VPC / private network
- Shared storage attached to all three (virtiofs, NFS volume, etc)
- NixOS ISO available (some providers offer it directly, others require you to upload the ISO)
Record for each node: the public IP and the VPC private IP.
Provider firewall rules
Set up provider-level firewall rules (this is why networking.firewall.enable = false in the flake):
Inbound from your workstation IP:
- TCP 22 (SSH)
Inbound between the 3 swarm node private IPs only (VPC internal):
- TCP 2377 (Swarm management)
- TCP 7946 (Swarm gossip)
- UDP 7946 (Swarm gossip)
- UDP 4789 (VXLAN overlay traffic)
Inbound from your reverse proxy (if it has a VPC IP):
- Scope to the proxy's IP. If using Pangolin with a tunnel (Newt), you do not need this because traffic arrives through the overlay network, not via IP.
Install NixOS from ISO
- Boot the VPS from the NixOS graphical ISO
- Install to disk with your preferred region/keyboard settings
- Create a temporary local user during install (e.g.
local). This user is throwaway; it gets replaced when you apply the flake. - Shutdown, detach the ISO, boot from disk
Bootstrap SSH access
Log into the VPS console (provider web console) as the temporary user. Edit /etc/nixos/configuration.nix and add these blocks inside the top-level { ... }:
services.openssh = {
enable = true;
openFirewall = false;
settings = {
PasswordAuthentication = false;
KbdInteractiveAuthentication = false;
PermitRootLogin = "no";
};
};
networking.firewall.enable = false;
Also add git, curl, jq, and age to environment.systemPackages. Set the hostname. Apply:
sudo nixos-rebuild switch
Add your admin SSH key
On the VPS console as the temporary user:
mkdir -p ~/.ssh && chmod 700 ~/.ssh
nano ~/.ssh/authorized_keys
# Paste your admin public key (from swarm_admin_ed25519.pub) on a single line
chmod 600 ~/.ssh/authorized_keys
Test from your workstation:
ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes local@NODE_PUBLIC_IP 'whoami'
If it prints local, SSH is working. You can now close the VPS console and work over SSH from here.
Step 7: Set up GitHub repo access on the node
The GitHub Actions workflow runs git fetch and git reset under sudo on the node, so root needs read access to the repo. Each node gets its own read-only deploy key.
SSH in as the temporary user, then become root:
ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes local@NODE_PUBLIC_IP
sudo -i
Generate a deploy key:
mkdir -p /root/.ssh && chmod 700 /root/.ssh
ssh-keygen -t ed25519 -a 64 -f /root/.ssh/repo_deploy_ed25519 -N ""
cat /root/.ssh/repo_deploy_ed25519.pub
In your GitHub repo settings, go to Settings > Deploy keys > Add deploy key:
- Title:
swarm-1-readonly(use a distinct name per node) - Paste the public key
- Leave "Allow write access" off
Configure root's SSH config for GitHub:
cat >/root/.ssh/config <<'EOF'
Host github.com
HostName github.com
User git
IdentityFile /root/.ssh/repo_deploy_ed25519
IdentitiesOnly yes
EOF
chmod 600 /root/.ssh/config
Test:
ssh -T [email protected] || true
You should see a message like "Hi your-org/my-swarm-cluster! You've successfully authenticated, but GitHub does not provide shell access."
Step 8: Replace /etc/nixos with the repo
Still as root on the node:
Back up the auto-generated hardware config
cp -a /etc/nixos/hardware-configuration.nix /root/hardware-configuration.nix
Copy the contents of this file to your workstation and commit it into the repo at hosts/swarm-1/hardware-configuration.nix. This file is unique to each machine (it describes disks, PCI devices, kernel modules needed for the specific hardware).
Clone the repo into /etc/nixos
rm -rf /etc/nixos
git clone [email protected]:your-org/my-swarm-cluster.git /etc/nixos
From this point, /etc/nixos is a live checkout of your repo. Every nixos-rebuild switch reads its config from this checkout.
Step 9: Generate the node's age key and apply the flake
Generate the age key for sops-nix
sudo -i
mkdir -p /var/lib/sops-nix && chmod 700 /var/lib/sops-nix
age-keygen -o /var/lib/sops-nix/key.txt
chmod 600 /var/lib/sops-nix/key.txt
The command prints the public recipient (starts with age1...). Copy it. On your workstation, replace the PLACEHOLDER_SWARM1 entry in .sops.yaml with this value. Commit the change.
Apply the flake
cd /etc/nixos
git pull # pick up the .sops.yaml change you just pushed
nixos-rebuild switch --flake "path:/etc/nixos#swarm-1"
This is the moment the node transitions from its temporary install config to the full production config defined in your flake. When it completes:
- The
adminanddeployusers exist - The temporary
localuser from install no longer exists (becauseusers.mutableUsers = falseand it is not defined in the flake) - Docker is running
- SSH is hardened
- fail2ban is active
- The shared filesystem is mounted
Verify
# Shared filesystem mounted
mount | grep /mnt/vfs
df -h /mnt/vfs
# Docker running
systemctl status docker --no-pager
docker info | head
# Admin user works (from workstation)
ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes admin@NODE_PUBLIC_IP 'whoami'
The last command should print admin. If it does, you can stop using the temporary user entirely. It does not exist anymore.
Step 10: Repeat for nodes 2 and 3
Follow Steps 6 through 9 for each remaining node. Remember:
- Each node gets its own
hardware-configuration.nixin the repo - Each node gets its own deploy key in GitHub (with a distinct title)
- Each node gets its own age key, and the recipient goes into
.sops.yaml
After all three nodes are provisioned and all age recipients are in .sops.yaml, encrypt the secrets file on your workstation:
nix --extra-experimental-features "nix-command flakes" shell nixpkgs#sops nixpkgs#age -c \
sops -e -i secrets.yaml
Commit and push:
git add -A
git commit -m "Add all sops recipients and encrypt secrets"
git push
Step 11: Pin SSH host keys for GitHub Actions
You need to tell GitHub Actions what SSH host keys to expect from each node. This prevents man-in-the-middle attacks during deployment.
On each node, get the pinned host key line:
sudo -i
PUB_IP="$(curl -fsS https://api.ipify.org)"
printf "%s %s\n" "$PUB_IP" "$(cat /etc/ssh/ssh_host_ed25519_key.pub)"
Copy the output from each node. Combine all three lines into one text blob (three lines total, one per node). This becomes the SWARM_KNOWN_HOSTS GitHub secret.
Step 12: Add GitHub Actions secrets
In your repo settings, go to Settings > Secrets and variables > Actions > New repository secret. Add:
| Secret | Value |
|---|---|
CI_DEPLOY_PRIVATE_KEY_B64 |
Base64-encoded CI deploy private key: base64 -w0 ~/.ssh/swarm_ci_deploy_ed25519 |
SWARM_KNOWN_HOSTS |
The three pinned host key lines from Step 11 |
SWARM_USER |
deploy |
SWARM1_HOST |
Node 1 public IP or DNS |
SWARM2_HOST |
Node 2 public IP or DNS |
SWARM3_HOST |
Node 3 public IP or DNS |
SWARM1_FLAKE_TARGET |
swarm-1 |
SWARM2_FLAKE_TARGET |
swarm-2 |
SWARM3_FLAKE_TARGET |
swarm-3 |
NEWT_ID |
From Pangolin dashboard (Step in Part 2) |
NEWT_SECRET |
From Pangolin dashboard (Step in Part 2) |
Push any commit to main. Watch the Actions run. All three nodes should rebuild successfully and report the stack deploy.
Step 13: Confirm VPC networking and initialise Docker Swarm
Before forming the Swarm cluster, confirm the VPC networking is correct.
Verify on each node
# Check the VPC interface has the right IP
ip -br a | egrep 'ens3|ens9'
# Confirm routing uses the VPC interface
ip route get 10.8.96.1
You want to see ens9 (or your VPC interface name) with a 10.x.x.x/20 address, and the route going via that interface.
Confirm cross-node connectivity
From node 1, ping node 2 and node 3 using their VPC private IPs:
ping -c 2 10.8.96.X # node 2
ping -c 2 10.8.96.Y # node 3
If pings fail, check your provider's VPC configuration and firewall rules.
Initialise Swarm on node 1
SSH in as admin:
ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes admin@SWARM1_PUBLIC_IP
Init Swarm using the VPC private IP (not the public IP). Swarm's inter-node traffic must go over the VPC:
docker swarm init --advertise-addr SWARM1_VPC_PRIVATE_IP
Get the manager join token
docker swarm join-token manager
Copy the full docker swarm join command from the output.
Join node 2
ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes admin@SWARM2_PUBLIC_IP
docker swarm join \
--token TOKEN_FROM_OUTPUT \
--advertise-addr SWARM2_VPC_PRIVATE_IP \
SWARM1_VPC_PRIVATE_IP:2377
The --advertise-addr flag tells this node to advertise its own VPC IP to the cluster.
Join node 3
Same as node 2 but with node 3's VPC IP:
ssh -i ~/.ssh/swarm_admin_ed25519 -o IdentitiesOnly=yes admin@SWARM3_PUBLIC_IP
docker swarm join \
--token TOKEN_FROM_OUTPUT \
--advertise-addr SWARM3_VPC_PRIVATE_IP \
SWARM1_VPC_PRIVATE_IP:2377
Validate
Back on node 1:
docker node ls
You should see three nodes, all Status: Ready, with Manager Status showing one Leader and two Reachable.
Step 14: Deploy your first Swarm stack
Create a stack file in the repo. Here is a minimal example:
# swarm/myapp-stack.yml
version: "3.9"
services:
myapp:
image: myapp/myapp:latest
networks:
pangolin:
aliases:
- myapp
deploy:
replicas: 2
restart_policy:
condition: any
update_config:
parallelism: 1
order: start-first
networks:
pangolin:
external: true
Key things to note:
- No published ports. The service does not expose any host ports. Traffic reaches it via the overlay network through the Pangolin tunnel. This is the pattern for all services behind the proxy.
networks: pangolin: external: truemeans this stack attaches to thepangolinoverlay network that the deploy pipeline creates. All stacks that need to communicate (with each other or with the tunnel agent) should use this network.order: start-firstmeans during a rolling update, Swarm starts the new container before stopping the old one. This gives you zero-downtime deployments.replicas: 2means two copies run across the cluster. Swarm distributes them across different nodes for redundancy.
Push to main. The pipeline auto-discovers the stack file, copies it to the manager, and runs docker stack deploy. Verify:
docker stack ls
docker service ls
docker service ps myapp_myapp
Using shared storage with stacks
For services that need persistent data (databases, config files, uploads), use the shared filesystem mount. Because /mnt/vfs is mounted on all three nodes, a service can run on any node and access the same data.
First, create the volume directories. You only need to do this once, on any node (the filesystem is shared):
sudo -i
install -d -m 0750 -o root -g docker /mnt/vfs/docker-volumes/myapp
install -d -m 0750 -o root -g docker /mnt/vfs/docker-volumes/myapp/data
install -d -m 0750 -o root -g docker /mnt/vfs/docker-volumes/myapp/config
Then reference them as bind mounts in the stack:
services:
myapp:
image: myapp/myapp:latest
volumes:
- type: bind
source: /mnt/vfs/docker-volumes/myapp/data
target: /app/data
- type: bind
source: /mnt/vfs/docker-volumes/myapp/config
target: /app/config
networks:
pangolin:
aliases:
- myapp
deploy:
replicas: 1
restart_policy:
condition: any
If you are migrating an existing service, copy its data into the bind mount directory before deploying the stack.
Deploying in-house apps from a private container registry
For custom applications that you build from source, the pipeline can build a Docker image, push it to a private registry, and deploy the stack with the new image tag.
Add the app source to the repo
apps/
my-internal-app/
Dockerfile
app.py
requirements.txt
Create the stack file with an image tag variable
# swarm/my-internal-app-stack.yml
version: "3.9"
services:
my-internal-app:
image: registry.example.com/my-org/my-internal-app:${APP_TAG}
command: ["python", "-u", "app.py"]
networks:
pangolin:
aliases:
- my-internal-app
deploy:
replicas: 1
restart_policy:
condition: any
update_config:
order: start-first
networks:
pangolin:
external: true
The ${APP_TAG} is set to the Git commit SHA by the pipeline, so every push produces a unique, traceable image tag.
Add GitHub secrets
Add REGISTRY_USERNAME and REGISTRY_API_KEY to your repo secrets.
Add build steps to the workflow
Insert these steps before the "Deploy Swarm stacks" step:
- name: Set image tag
run: echo "APP_TAG=${GITHUB_SHA}" >> "$GITHUB_ENV"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to private registry
uses: docker/login-action@v3
with:
registry: registry.example.com
username: ${{ secrets.REGISTRY_USERNAME }}
password: ${{ secrets.REGISTRY_API_KEY }}
- name: Build and push my-internal-app
uses: docker/build-push-action@v6
with:
context: ./apps/my-internal-app
push: true
tags: |
registry.example.com/my-org/my-internal-app:${{ env.APP_TAG }}
Then in the stack deploy step, add registry login on the Swarm manager so it can pull the private image. The key detail is using a temporary Docker config directory and --with-registry-auth:
# Inside the "Deploy Swarm stacks" SSH block, before the deploy loop:
export DOCKER_CONFIG="$(mktemp -d)"
echo "$REGISTRY_API_KEY" | docker login registry.example.com \
-u "$REGISTRY_USERNAME" --password-stdin
# In the deploy loop, use --with-registry-auth:
docker stack deploy --with-registry-auth -c "$f" "$stack"
# After the loop, clean up:
docker logout registry.example.com || true
rm -rf "$DOCKER_CONFIG"
unset DOCKER_CONFIG
This keeps registry credentials ephemeral. They exist in a temp directory for the duration of the deploy, then get deleted. The --with-registry-auth flag tells Swarm to propagate the credentials to all nodes so they can pull the image.
Part 2: The Pangolin Reverse Proxy VPS
The Pangolin VPS is a separate machine that handles inbound HTTPS traffic and routes it into the Swarm cluster through a WireGuard tunnel. It uses the exact same GitOps pattern: NixOS + flake + sops-nix + GitHub Actions, in its own repo.
If you have already read my Pangolin post, you know how Pangolin works at a high level. This section covers deploying it on NixOS with full GitOps, which is different from the quick-install Docker Compose approach in that post.
Repo layout
.
├── flake.nix
├── flake.lock
├── hardware-configuration.nix
├── secrets.yaml
├── .sops.yaml
├── pangolin/
│ ├── docker-compose.yml
│ └── config/
│ ├── config.yml
│ └── traefik/
│ ├── traefik_config.yml
│ └── dynamic_config.yml
└── .github/
└── workflows/
└── deploy.yml
The Pangolin flake
This flake differs from the Swarm flake in a few important ways:
- It runs a single host, not three
- The NixOS firewall is enabled (this VPS faces the internet directly, unlike the Swarm nodes which use provider-level rules)
- It includes kernel sysctl hardening (IP forwarding for Docker, plus security tunables)
- It defines a systemd service that runs
docker-compose up -dto start Pangolin - It uses sops-nix to inject the Pangolin
SERVER_SECRETas an environment variable
{
description = "Pangolin reverse proxy on hardened NixOS with GitOps deploy";
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-25.11";
sops-nix.url = "github:Mic92/sops-nix";
sops-nix.inputs.nixpkgs.follows = "nixpkgs";
};
outputs = { self, nixpkgs, sops-nix }:
let
system = "x86_64-linux";
mkHost = { hostName }:
nixpkgs.lib.nixosSystem {
inherit system;
modules = [
sops-nix.nixosModules.sops
({ config, pkgs, lib, ... }:
{
imports = [
./hardware-configuration.nix
];
system.stateVersion = "25.11";
networking.hostName = hostName;
time.timeZone = "Europe/London";
nix.settings = {
experimental-features = [ "nix-command" "flakes" ];
warn-dirty = false;
};
boot.loader.grub.enable = true;
boot.loader.grub.devices = [ "/dev/vda" ];
boot.loader.grub.useOSProber = false;
# ── Kernel hardening ──────────────────────────────────
# ip_forward is required for Docker networking.
# The rest disables ICMP redirects, source routing,
# enables SYN cookies, and turns on reverse path filtering.
boot.kernel.sysctl = {
"net.ipv4.ip_forward" = 1;
"net.ipv4.conf.all.accept_redirects" = 0;
"net.ipv4.conf.default.accept_redirects" = 0;
"net.ipv4.conf.all.send_redirects" = 0;
"net.ipv4.conf.default.send_redirects" = 0;
"net.ipv4.conf.all.accept_source_route" = 0;
"net.ipv4.conf.default.accept_source_route" = 0;
"net.ipv4.tcp_syncookies" = 1;
"net.ipv4.conf.all.rp_filter" = 1;
"net.ipv4.conf.default.rp_filter" = 1;
};
# ── Users ─────────────────────────────────────────────
users.mutableUsers = false;
security.sudo.wheelNeedsPassword = true;
users.users.admin = {
isNormalUser = true;
description = "Admin";
extraGroups = [ "wheel" "docker" ];
openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAA... your-admin-pubkey"
];
hashedPassword = "$6$...your-hash-here";
};
users.users.deploy = {
isNormalUser = true;
description = "CI Deploy User";
extraGroups = [ "wheel" "docker" ];
openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAA... your-ci-deploy-pubkey"
];
};
security.sudo.extraRules = [
{
users = [ "deploy" ];
commands = [
{ command = "/run/current-system/sw/bin/nixos-rebuild";
options = [ "NOPASSWD" ]; }
{ command = "/run/current-system/sw/bin/git";
options = [ "NOPASSWD" ]; }
{ command = "/run/current-system/sw/bin/true";
options = [ "NOPASSWD" ]; }
{ command = "/run/current-system/sw/bin/systemctl";
options = [ "NOPASSWD" ]; }
];
}
];
# ── SSH hardening ─────────────────────────────────────
services.openssh = {
enable = true;
# openFirewall = true here because the NixOS firewall
# is enabled on this VPS (unlike the Swarm nodes).
openFirewall = true;
settings = {
PasswordAuthentication = false;
KbdInteractiveAuthentication = false;
PermitRootLogin = "no";
X11Forwarding = false;
AllowTcpForwarding = "no";
AllowAgentForwarding = "no";
ClientAliveInterval = 300;
ClientAliveCountMax = 2;
MaxAuthTries = 3;
LogLevel = "VERBOSE";
};
};
services.fail2ban.enable = true;
# ── Firewall ──────────────────────────────────────────
# Unlike the Swarm nodes, the Pangolin VPS uses the
# NixOS firewall because it faces the internet directly.
networking.firewall = {
enable = true;
allowedTCPPorts = [
22 # SSH
80 # HTTP (Let's Encrypt ACME + redirect)
443 # HTTPS (Traefik)
];
allowedUDPPorts = [
51820 # Gerbil: site tunnels (Newt connections)
21820 # Gerbil: client tunnels (optional)
];
allowPing = false;
};
# ── Docker ────────────────────────────────────────────
virtualisation.docker = {
enable = true;
autoPrune = {
enable = true;
dates = "weekly";
flags = [ "--all" "--volumes" ];
};
daemon.settings = {
"log-driver" = "json-file";
"log-opts" = {
"max-size" = "10m";
"max-file" = "5";
};
# live-restore is fine here (no Swarm on this node)
"live-restore" = true;
};
};
environment.systemPackages = with pkgs; [
git curl jq openssl sops age docker-compose
];
services.journald.extraConfig = ''
Storage=persistent
SystemMaxUse=1G
'';
# ── sops-nix ──────────────────────────────────────────
sops = {
defaultSopsFile = ./secrets.yaml;
defaultSopsFormat = "yaml";
age.keyFile = "/var/lib/sops-nix/key.txt";
# Decrypt the pangolin server_secret from secrets.yaml
secrets."pangolin/server_secret" = { };
};
# Create an env file from the decrypted secret.
# Pangolin reads SERVER_SECRET from the environment.
sops.templates."pangolin.env" = {
content = ''
SERVER_SECRET=${config.sops.placeholder."pangolin/server_secret"}
'';
owner = "root";
group = "root";
mode = "0400";
};
# ── Pangolin directories ──────────────────────────────
systemd.tmpfiles.rules = [
"d /var/lib/pangolin 0750 root root -"
"d /var/lib/pangolin/db 0750 root root -"
"d /var/lib/pangolin/letsencrypt 0750 root root -"
"d /var/lib/pangolin/logs 0750 root root -"
"d /var/lib/pangolin/traefik 0750 root root -"
"d /opt/pangolin 0750 root root -"
];
# Copy compose and config files from the repo into
# their runtime locations on every nixos-rebuild.
system.activationScripts.pangolinFiles.text = ''
set -euo pipefail
install -d -m 0750 /opt/pangolin
install -m 0640 \
${./pangolin/docker-compose.yml} \
/opt/pangolin/docker-compose.yml
install -d -m 0750 /var/lib/pangolin
install -d -m 0750 /var/lib/pangolin/traefik
install -m 0640 \
${./pangolin/config/config.yml} \
/var/lib/pangolin/config.yml
install -m 0640 \
${./pangolin/config/traefik/traefik_config.yml} \
/var/lib/pangolin/traefik/traefik_config.yml
install -m 0640 \
${./pangolin/config/traefik/dynamic_config.yml} \
/var/lib/pangolin/traefik/dynamic_config.yml
'';
# ── Pangolin systemd service ──────────────────────────
# This runs docker-compose up/down as a systemd service,
# so Pangolin starts on boot and stops cleanly on shutdown.
# The EnvironmentFile points to the sops-rendered env file
# containing the decrypted SERVER_SECRET.
systemd.services.pangolin = {
description = "Pangolin (docker compose)";
after = [ "network-online.target" "docker.service" ];
wants = [ "network-online.target" "docker.service" ];
wantedBy = [ "multi-user.target" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
WorkingDirectory = "/opt/pangolin";
EnvironmentFile =
config.sops.templates."pangolin.env".path;
ExecStart =
"${pkgs.docker-compose}/bin/docker-compose up -d";
ExecStop =
"${pkgs.docker-compose}/bin/docker-compose down";
TimeoutStartSec = "300";
};
};
})
];
};
in
{
nixosConfigurations = {
"proxy-1" = mkHost { hostName = "proxy-1"; };
};
};
}
What the sops integration does
The important part here is the sops.templates block. When nixos-rebuild switch runs:
- sops-nix decrypts
secrets.yamlusing the node's age key - It extracts the value at
pangolin/server_secret - It renders the
pangolin.envtemplate, substituting the placeholder with the real secret - The rendered file lands at a path on disk with
mode 0400(root read-only) - The
systemd.services.pangolinservice reads this file as itsEnvironmentFile - Docker Compose gets
SERVER_SECRETas an environment variable
The secret never appears in the repo in plaintext. It is encrypted at rest, decrypted only on the node, and injected into the container environment at runtime.
Pangolin Docker Compose
Create pangolin/docker-compose.yml:
services:
pangolin:
image: fosrl/pangolin:latest
container_name: pangolin
restart: always
environment:
- SERVER_SECRET=${SERVER_SECRET}
volumes:
- /var/lib/pangolin:/app/config
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3001/api/v1/"]
interval: "3s"
timeout: "3s"
retries: 15
gerbil:
image: fosrl/gerbil:latest
container_name: gerbil
restart: always
depends_on:
pangolin:
condition: service_healthy
command:
- --reachableAt=http://gerbil:3004
- --generateAndSaveKeyTo=/var/config/key
- --remoteConfig=http://pangolin:3001/api/v1/
volumes:
- /var/lib/pangolin:/var/config
cap_add:
- NET_ADMIN
- SYS_MODULE
ports:
- 51820:51820/udp
- 21820:21820/udp
- 443:443
- 80:80
traefik:
image: traefik:v3.4.0
container_name: traefik
restart: always
network_mode: service:gerbil
depends_on:
pangolin:
condition: service_healthy
command:
- --configFile=/etc/traefik/traefik_config.yml
volumes:
- /var/lib/pangolin/traefik:/etc/traefik:ro
- /var/lib/pangolin/letsencrypt:/letsencrypt
- /var/lib/pangolin/traefik/logs:/var/log/traefik
networks:
default:
driver: bridge
name: pangolin
Note that Traefik uses network_mode: service:gerbil, meaning it shares Gerbil's network namespace. This is how Traefik can terminate TLS on ports that Gerbil binds.
Pangolin config files
Create pangolin/config/config.yml:
app:
dashboard_url: "https://proxy.example.com"
log_level: "info"
save_logs: true
log_failed_attempts: true
domains:
domain1:
base_domain: "proxy.example.com"
cert_resolver: "letsencrypt"
server:
trust_proxy: 1
gerbil:
base_endpoint: "proxy.example.com"
flags:
require_email_verification: false
disable_signup_without_invite: true
disable_user_create_org: true
Create pangolin/config/traefik/traefik_config.yml:
api:
insecure: false
dashboard: false
providers:
http:
endpoint: "http://pangolin:3001/api/v1/traefik-config"
pollInterval: "5s"
file:
filename: "/etc/traefik/dynamic_config.yml"
experimental:
plugins:
badger:
moduleName: "github.com/fosrl/badger"
version: "v1.3.0"
log:
level: "INFO"
format: "common"
certificatesResolvers:
letsencrypt:
acme:
httpChallenge:
entryPoint: web
email: "[email protected]"
storage: "/letsencrypt/acme.json"
caServer: "https://acme-v02.api.letsencrypt.org/directory"
entryPoints:
web:
address: ":80"
websecure:
address: ":443"
transport:
respondingTimeouts:
readTimeout: "30m"
http:
tls:
certResolver: "letsencrypt"
serversTransport:
insecureSkipVerify: true
ping:
entryPoint: "web"
Create pangolin/config/traefik/dynamic_config.yml:
http:
middlewares:
badger:
plugin:
badger:
disableForwardAuth: true
redirect-to-https:
redirectScheme:
scheme: https
routers:
main-app-router-redirect:
rule: "Host(`proxy.example.com`)"
service: next-service
entryPoints:
- web
middlewares:
- redirect-to-https
- badger
next-router:
rule: "Host(`proxy.example.com`) && !PathPrefix(`/api/v1`)"
service: next-service
entryPoints:
- websecure
middlewares:
- badger
tls:
certResolver: letsencrypt
api-router:
rule: "Host(`proxy.example.com`) && PathPrefix(`/api/v1`)"
service: api-service
entryPoints:
- websecure
middlewares:
- badger
tls:
certResolver: letsencrypt
ws-router:
rule: "Host(`proxy.example.com`)"
service: api-service
entryPoints:
- websecure
middlewares:
- badger
tls:
certResolver: letsencrypt
services:
next-service:
loadBalancer:
servers:
- url: "http://pangolin:3002"
api-service:
loadBalancer:
servers:
- url: "http://pangolin:3000"
Replace proxy.example.com throughout with your actual domain.
Pangolin GitHub Actions workflow
The deploy workflow for the Pangolin VPS is simpler than the Swarm one (single node, no stack discovery). Create .github/workflows/deploy.yml:
name: deploy
on:
push:
branches: ["main"]
concurrency:
group: deploy-proxy
cancel-in-progress: true
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup SSH key
env:
DEPLOY_KEY_B64: ${{ secrets.DEPLOY_SSH_PRIVATE_KEY_B64 }}
run: |
set -euo pipefail
mkdir -p ~/.ssh && chmod 700 ~/.ssh
printf '%s' "$DEPLOY_KEY_B64" | tr -d '\r' | base64 -d > ~/.ssh/deploy_key
chmod 600 ~/.ssh/deploy_key
ssh-keygen -lf ~/.ssh/deploy_key
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/deploy_key
- name: Add pinned known_hosts
env:
VPS_KNOWN_HOSTS: ${{ secrets.VPS_KNOWN_HOSTS }}
run: |
set -euo pipefail
mkdir -p ~/.ssh && chmod 700 ~/.ssh
printf '%s\n' "$VPS_KNOWN_HOSTS" >> ~/.ssh/known_hosts
chmod 600 ~/.ssh/known_hosts
- name: Deploy
env:
VPS_HOST: ${{ secrets.VPS_HOST }}
VPS_USER: ${{ secrets.VPS_USER }}
FLAKE_TARGET: ${{ secrets.FLAKE_TARGET }}
run: |
set -euo pipefail
ssh -i ~/.ssh/deploy_key \
-o IdentitiesOnly=yes -o BatchMode=yes -o StrictHostKeyChecking=yes \
"${VPS_USER}@${VPS_HOST}" "bash -lc '
set -euo pipefail
cd /etc/nixos
sudo -n /run/current-system/sw/bin/git fetch --prune origin
sudo -n /run/current-system/sw/bin/git checkout -f main
sudo -n /run/current-system/sw/bin/git reset --hard origin/main
sudo -n /run/current-system/sw/bin/nixos-rebuild switch \
--flake \"path:/etc/nixos#${FLAKE_TARGET}\"
'"
Provisioning the Pangolin VPS
The process is identical to provisioning a Swarm node (Steps 6-9 in Part 1), with these differences:
- Only one node to provision, not three
- Point DNS (
proxy.example.comA record) to this VPS's public IP before applying the flake, because Let's Encrypt needs to reach port 80 for ACME validation - The NixOS firewall is enabled (the flake handles the port rules)
- After applying the flake, Pangolin's systemd service starts automatically, pulling Docker images and bringing up the compose stack
After nixos-rebuild switch:
# Check Pangolin containers
docker ps
# Check logs if something is restarting
docker logs pangolin --tail 100
docker logs traefik --tail 100
Common first-run issues:
- DNS not pointing to the VPS yet (Let's Encrypt fails)
- Port 80 blocked by provider firewall (ACME validation fails)
SERVER_SECRETnot decrypted properly (checksopsconfig and age key)
Once the dashboard is up, open https://proxy.example.com/auth/initial-setup, create the admin account, and create your first Organisation.
Connecting the Swarm cluster to Pangolin
In the Pangolin dashboard, create a Site using Newt Tunnel. Pangolin generates a NEWT_ID and NEWT_SECRET. Add these as GitHub secrets in your Swarm cluster repo.
Then create the Newt stack in the Swarm repo:
# swarm/newt-stack.yml
version: "3.9"
services:
newt:
image: fosrl/newt:latest
environment:
PANGOLIN_ENDPOINT: "https://proxy.example.com"
NEWT_ID: "${NEWT_ID}"
NEWT_SECRET: "${NEWT_SECRET}"
networks:
- pangolin
deploy:
replicas: 1
restart_policy:
condition: any
networks:
pangolin:
external: true
Push to main. The pipeline deploys the Newt stack. In the Pangolin dashboard, the site status should move from Offline to Online.
Now create Resources in the Pangolin dashboard for each service you want to expose. The upstream target is the service's overlay network alias and port, e.g. http://myapp:8080. Pangolin routes HTTPS traffic through the tunnel to the Swarm overlay network, reaching the service without any published ports on the Swarm hosts.
Updating the OS
To update NixOS and all system packages across the entire infrastructure:
cd my-swarm-cluster
nix --extra-experimental-features "nix-command flakes" flake update
git add flake.lock
git commit -m "Update NixOS flake inputs"
git push
Do the same in the Pangolin repo. GitHub Actions picks up the new lock file and rebuilds each node with updated packages. That is the entire OS update process.
New laptop, who dis
If you change workstations, you need your SSH keys and optionally your age identity file (for editing encrypted secrets locally).
Restore SSH keys
Copy from your password manager or backup:
chmod 700 ~/.ssh
chmod 600 ~/.ssh/swarm_admin_ed25519 ~/.ssh/swarm_ci_deploy_ed25519
chmod 644 ~/.ssh/swarm_admin_ed25519.pub ~/.ssh/swarm_ci_deploy_ed25519.pub
Set up SSH config
Host swarm-1
HostName NODE1_PUBLIC_IP
User admin
IdentityFile ~/.ssh/swarm_admin_ed25519
IdentitiesOnly yes
Host swarm-2
HostName NODE2_PUBLIC_IP
User admin
IdentityFile ~/.ssh/swarm_admin_ed25519
IdentitiesOnly yes
Host swarm-3
HostName NODE3_PUBLIC_IP
User admin
IdentityFile ~/.ssh/swarm_admin_ed25519
IdentitiesOnly yes
Host proxy-1
HostName PROXY_PUBLIC_IP
User admin
IdentityFile ~/.ssh/swarm_admin_ed25519
IdentitiesOnly yes
Restore age identity (for sops editing)
Only needed if you decrypt or edit secrets.yaml locally:
mkdir -p ~/.config/sops/age && chmod 700 ~/.config/sops/age
cp /path/from_backup/keys.txt ~/.config/sops/age/keys.txt
chmod 600 ~/.config/sops/age/keys.txt
Test:
cd my-swarm-cluster
sops -d secrets.yaml >/dev/null && echo "OK"
If you lost the SSH keys
Generate new ones, update the public keys in both flake.nix files (Swarm and Pangolin repos), update the base64-encoded CI key in both GitHub secrets, push, and rebuild. The nodes recover because the config is in the repo. You will need console access (provider web console) to the nodes one time to apply the first rebuild if you cannot SSH in with the old keys.
Verify the cluster
ssh swarm-1 'docker node ls'
ssh swarm-1 'docker stack ls'
ssh swarm-1 'docker service ls'
Closing
The combination of NixOS, Docker Swarm, and GitOps gives you something that is hard to achieve with traditional setups: a production container cluster where the entire state, OS config, services, secrets, and reverse proxy, is version-controlled, reproducible, and deployable from a single git push.
NixOS handles the OS layer declaratively. Swarm handles container orchestration without the complexity tax of Kubernetes. Pangolin handles the ingress layer with identity-aware routing through WireGuard tunnels. GitHub Actions ties it all together so the only manual step after initial provisioning is committing code.
If a Swarm node dies, you provision a new VPS, clone the repo, generate an age key, apply the flake, and join the swarm. If the proxy VPS dies, you provision a new one, point DNS at it, clone the repo, apply the flake, and Pangolin rebuilds itself from config. The system recovers from the repo because the repo is the system.
Quite neat.