Child processes escaping systemd cgroup scopes due to exec timing

Summary

A developer building a sandboxing tool failed to ensure that child processes were correctly placed within a systemd cgroup scope. While the parent process was successfully moved into the my-unit.scope, all subsequent processes spawned by bubblewrap (the sandbox) remained in the parent’s original cgroup. This resulted in a resource isolation failure, where the sandboxed processes were not subject to the memory limits or CPU constraints defined in the transient systemd unit.

Root Cause

The root cause is a misunderstanding of how pidfd attachment interacts with the process lifecycle and the execve syscall.

Attachment Timing: The code uses pidfd_open on its own PID and passes that to start_transient_unit. This attaches the current process to the cgroup.
The exec Trap: The developer calls command.exec(). In Rust’s std::process, exec (or execve at the OS level) replaces the current process image with the new program (bwrap).
Cgroup Lineage: While the process image changes, the PID remains the same, and the process stays in the cgroup. However, because bwrap is invoked with flags like --new-session and is designed to manage its own sub-processes, it relies on the kernel’s process hierarchy.
The Breakage: The primary issue is that the developer is attempting to “attach” a process to a scope via pidfds at the same time they are spawning the sandbox. When bwrap forks and executes its internal components, if the parent-child relationship is decoupled via namespace unsharing or if the exec happens in a way that the systemd manager doesn’t track the subsequent forks, the cgroup migration is not inherited by the new process tree because the scope was anchored to a specific pidfd that effectively “terminated” its role as a manager once the image was replaced.

Why This Happens in Real Systems

In high-performance or security-critical systems, this happens due to the decoupling of process identity (PID) and process image (binary).

Race Conditions: There is a narrow window between a process being created and being placed into a cgroup. If a process forks during this window, the child may inherit the “old” cgroup.
Namespace Transitions: Tools like bubblewrap or firejail use CLONE_NEWNS, CLONE_NEWPID, etc. These namespace transitions can complicate how the kernel tracks process group membership if the management layer (systemd) is only looking at a single file descriptor.
Atomic Operations: Moving a process into a cgroup is not an atomic operation relative to the fork/exec cycle unless handled by a specialized manager like systemd-run which handles the process lifecycle orchestration before the user code even runs.

Real-World Impact

Resource Exhaustion (OOM): If a sandbox is meant to be limited to 20MB of RAM but the child processes escape to the root cgroup, a single leaked loop can crash the entire host.
Security Bypass: If an attacker can spawn processes outside the controlled cgroup, they may bypass resource-based DoS protections or side-channel mitigations.
Observability Failure: Monitoring tools (Prometheus/Grafana) tracking app.slice will show zero activity for the sandbox, leading engineers to believe the system is idle when it is actually under heavy load.

Example or Code (if necessary and relevant)

To fix this, instead of trying to attach the current PID via pidfd, the developer should spawn the child and move the child’s PID into the cgroup, or better yet, let systemd manage the entire process tree by not using exec on the wrapper itself, but rather spawning the child as a member of the scope.

// INCORRECT: Replacing the wrapper process with the child via exec()
// This makes the wrapper's management of the scope fragile.
let err = command.exec(); 

// CORRECT APPROACH (Conceptual):
// 1. Create the scope via DBus.
// 2. Spawn the child process using standard fork/exec (do NOT use exec on the wrapper).
// 3. Use the systemd API to move the child's PID into the scope's cgroup.
// 4. Or, use systemd-run logic where the manager handles the entire lifecycle.

How Senior Engineers Fix It

Senior engineers approach this by ensuring ownership and lifecycle management are explicit:

Avoid exec in the Wrapper: Instead of turning the management tool into the sandboxed process, the management tool should remain a supervisor. It should use Command::spawn() to create a child, and then use the DBus API to move that child’s PID into the desired cgroup.
Use systemd-run Patterns: Instead of manual pidfd management, use the ControlGroup properties to ensure that any process spawned under the supervisor’s PID is explicitly tracked.
Post-Fork/Pre-Exec Logic: Use pre_exec hooks (on Unix) to perform necessary setup (like setting cgroup paths via /proc/self/cgroup) before the execve syscall replaces the memory image.
Verify via /proc: Always write integration tests that don’t just check if the process is running, but actually parse /proc/[pid]/cgroup to verify the effective cgroup membership.

Why Juniors Miss It

The exec Mental Model: Juniors often view exec as “running a command,” whereas seniors view exec as “destroying the current process and replacing it with another.” They miss the fact that the wrapper’s logic and state vanish upon exec.
Cgroup Hierarchy Ignorance: Juniors often treat cgroups like environment variables (inherited by everything), failing to realize that namespace unsharing and specific syscalls can break the expected inheritance chain.
API Surface Assumption: They assume that if the DBus call returns Ok, the task is complete. They fail to verify the side effects (the actual state of the kernel’s cgroup tree).