Docker compose struggling with permission setting up Hashicorp Vault

Summary

A deployment of Hashicorp Vault using Docker Compose on Debian failed due to filesystem permission and capability conflicts. The core issue manifested as a permission denied error when the container attempted to initialize its storage backend, followed by chown errors when attempting to fix permissions. The failure stems from a mismatch between the container’s non-root user (UID 1001), the host’s file ownership, and the container’s lack of Linux capabilities required to modify file attributes.

Root Cause

The failure is caused by three compounding factors:

Host-Centric UID/GID Mismatch: The user defined in the Docker Compose file (1001:1001) does not exist inside the Vault container’s /etc/passwd. When a UID is passed to Docker without a corresponding user definition, Linux treats it as a raw UID. However, filesystem operations (like chown) often require a valid user mapping to function correctly or rely on the container’s internal user database.
Missing Linux Capabilities: The Vault binary attempts to set extended attributes or capabilities (specifically CAP_SETFCAP) to secure its configuration files. The container is running with a default restricted capability set (or explicitly limited by the docker-compose definition) and lacks the privileges to perform these operations.
Volume Mount Permissions: The host directory /opt/vault-infra/config is owned by vault-system (UID 1001) with strict permissions (750). While the container user matches the UID, the container process fails to manipulate the local.json file created inside that volume due to the capability restriction mentioned above.

Why This Happens in Real Systems

In enterprise environments, this scenario is common due to Security Hardening and Compliance requirements:

Non-Root Execution: Security policies mandate that containers must run as non-root users to prevent privilege escalation. Images like Vault enforce this, but infrastructure teams often create dedicated system users on the host (e.g., vault-system) to track ownership and prevent data exfiltration.
Capability Restrictions: Many container runtimes (like gVisor or default Kubernetes PodSecurityPolicies) drop CAP_CHOWN and CAP_SETFCAP by default. Vault requires these to initialize its file-backed storage and ensure secrets are written with correct ownership, leading to startup failures if the environment is too restrictive.
Immutable Infrastructure vs. Stateful Data: The container image is immutable, but the data (/vault/data) is stateful. If the data directory is empty or permissions drift, the entrypoint script (often wrapping the binary) attempts to “fix” permissions, triggering the capability error.

Real-World Impact

Service Outage: Vault cannot start, preventing access to secrets, PKI infrastructure, and encryption keys for downstream applications.
Security Vulnerabilities: Attempting to bypass the issue by mounting volumes with :Z (SELinux relabeling) or using chmod 777 introduces security risks by over-permissive access or mislabeling data contexts.
Operational Toil: Engineers waste time debugging “Operation not permitted” errors that are misleading; the root cause is not the file itself, but the lack of capability to change its attributes.

Example or Code

The following docker-compose.yml resolves the issue by using the standard vault user (UID 1000) and dropping the unnecessary capability request that causes the crash.

services:
  vault:
    image: hashicorp/vault:1.21
    # Use the standard user defined in the official image (UID 1000)
    # or use "0:0" if you must manage permissions via the entrypoint script
    user: "1000:0" 
    cap_add:
      - IPC_LOCK
    volumes:
      # Ensure the host directory is chowned to 1000:1000 on the host first
      - /opt/vault-infra/tls:/vault/tls:ro
      - /opt/vault-infra/data:/vault/data
      - /opt/vault-infra/config:/vault/config
    environment:
      VAULT_LOCAL_CONFIG: |
        listener "tcp" {
          address = "0.0.0.0:8200"
          tls_cert_file = "/vault/tls/tls.crt"
          tls_key_file = "/vault/tls/tls.key"
        }
        storage "file" {
          path = "/vault/data"
        }
    command: server

How Senior Engineers Fix It

Senior engineers approach this by ensuring consistency between host and container identities and understanding the entrypoint logic:

Align UID/GIDs: Instead of forcing a custom UID (1001), they check the Dockerfile of the official image. If the image uses UID 1000, they create a host user with UID 1000 or chown the host directories to 1000.
Pre-configure Host Permissions: They proactively set permissions on the host before running docker compose up. This prevents the container from needing to run chown operations, removing the need for dangerous capabilities.
- sudo chown -R 1000:1000 /opt/vault-infra/data
- sudo chown -R 1000:1000 /opt/vault-infra/config
Modify Container User: They explicitly set user: "1000:1000" (or 1000:0 to allow group write) in the Compose file to match the host permissions.
Check Entrypoint Behavior: They know that Vault’s entrypoint tries to chown files if it detects permission issues. By fixing the host permissions, the entrypoint logic is bypassed, avoiding the capability error.

Why Juniors Miss It

Juniors often miss this because they treat the container and the host as completely separate silos:

Ignoring UID Mapping: They assume that setting user: "1001:1001" in Docker is enough, not realizing that Linux file operations depend on the underlying OS knowing that UID.
Over-reliance on “0777” Fixes: When they see “Permission Denied,” the instinct is often to chmod 777 the folder. This hides the root cause (UID mismatch) and creates a security hole.
Misunderstanding Capabilities: They see “Operation not permitted” and assume it’s a Docker bug or a filesystem mount issue, rather than understanding that the process lacks the specific Linux capability (CAP_CHOWN or CAP_SETFCAP) to perform the requested system call.
Not Checking the Image Docs: They don’t check the official Hashicorp documentation which explicitly states the user ID the image runs as, trying to force their own custom ID instead.