Summary
A developer building a sandboxing tool failed to ensure that child processes were correctly placed within a systemd cgroup scope. While the parent process was successfully moved into the my-unit.scope, all subsequent processes spawned by bubblewrap (the sandbox) remained in the parent’s original cgroup. This resulted in a resource isolation failure, where the sandboxed processes were not subject to the memory limits or CPU constraints defined in the transient systemd unit.
Root Cause
The root cause is a misunderstanding of how pidfd attachment interacts with the process lifecycle and the execve syscall.
- Attachment Timing: The code uses
pidfd_openon its own PID and passes that tostart_transient_unit. This attaches the current process to the cgroup. - The
execTrap: The developer callscommand.exec(). In Rust’sstd::process,exec(orexecveat the OS level) replaces the current process image with the new program (bwrap). - Cgroup Lineage: While the process image changes, the PID remains the same, and the process stays in the cgroup. However, because
bwrapis invoked with flags like--new-sessionand is designed to manage its own sub-processes, it relies on the kernel’s process hierarchy. - The Breakage: The primary issue is that the developer is attempting to “attach” a process to a scope via
pidfdsat the same time they are spawning the sandbox. Whenbwrapforks and executes its internal components, if the parent-child relationship is decoupled via namespace unsharing or if theexechappens in a way that the systemd manager doesn’t track the subsequent forks, the cgroup migration is not inherited by the new process tree because the scope was anchored to a specificpidfdthat effectively “terminated” its role as a manager once the image was replaced.
Why This Happens in Real Systems
In high-performance or security-critical systems, this happens due to the decoupling of process identity (PID) and process image (binary).
- Race Conditions: There is a narrow window between a process being created and being placed into a cgroup. If a process forks during this window, the child may inherit the “old” cgroup.
- Namespace Transitions: Tools like
bubblewraporfirejailuseCLONE_NEWNS,CLONE_NEWPID, etc. These namespace transitions can complicate how the kernel tracks process group membership if the management layer (systemd) is only looking at a single file descriptor. - Atomic Operations: Moving a process into a cgroup is not an atomic operation relative to the
fork/execcycle unless handled by a specialized manager likesystemd-runwhich handles the process lifecycle orchestration before the user code even runs.
Real-World Impact
- Resource Exhaustion (OOM): If a sandbox is meant to be limited to 20MB of RAM but the child processes escape to the root cgroup, a single leaked loop can crash the entire host.
- Security Bypass: If an attacker can spawn processes outside the controlled cgroup, they may bypass resource-based DoS protections or side-channel mitigations.
- Observability Failure: Monitoring tools (Prometheus/Grafana) tracking
app.slicewill show zero activity for the sandbox, leading engineers to believe the system is idle when it is actually under heavy load.
Example or Code (if necessary and relevant)
To fix this, instead of trying to attach the current PID via pidfd, the developer should spawn the child and move the child’s PID into the cgroup, or better yet, let systemd manage the entire process tree by not using exec on the wrapper itself, but rather spawning the child as a member of the scope.
// INCORRECT: Replacing the wrapper process with the child via exec()
// This makes the wrapper's management of the scope fragile.
let err = command.exec();
// CORRECT APPROACH (Conceptual):
// 1. Create the scope via DBus.
// 2. Spawn the child process using standard fork/exec (do NOT use exec on the wrapper).
// 3. Use the systemd API to move the child's PID into the scope's cgroup.
// 4. Or, use systemd-run logic where the manager handles the entire lifecycle.
How Senior Engineers Fix It
Senior engineers approach this by ensuring ownership and lifecycle management are explicit:
- Avoid
execin the Wrapper: Instead of turning the management tool into the sandboxed process, the management tool should remain a supervisor. It should useCommand::spawn()to create a child, and then use the DBus API to move that child’s PID into the desired cgroup. - Use
systemd-runPatterns: Instead of manualpidfdmanagement, use theControlGroupproperties to ensure that any process spawned under the supervisor’s PID is explicitly tracked. - Post-Fork/Pre-Exec Logic: Use
pre_exechooks (on Unix) to perform necessary setup (like setting cgroup paths via/proc/self/cgroup) before theexecvesyscall replaces the memory image. - Verify via
/proc: Always write integration tests that don’t just check if the process is running, but actually parse/proc/[pid]/cgroupto verify the effective cgroup membership.
Why Juniors Miss It
- The
execMental Model: Juniors often viewexecas “running a command,” whereas seniors viewexecas “destroying the current process and replacing it with another.” They miss the fact that the wrapper’s logic and state vanish uponexec. - Cgroup Hierarchy Ignorance: Juniors often treat cgroups like environment variables (inherited by everything), failing to realize that namespace unsharing and specific syscalls can break the expected inheritance chain.
- API Surface Assumption: They assume that if the DBus call returns
Ok, the task is complete. They fail to verify the side effects (the actual state of the kernel’s cgroup tree).