Resolving Ghost Disk Space Issues After a Failed WoeUSB Mount

Summary

A production incident occurred where a utility tool, WoeUSB, caused a storage discrepancy on a Linux system. During the execution of a write operation to a USB device, the process hung during the “Mounting filesystem…” phase. This resulted in the consumption of all available disk space. Crucially, even after terminating the process and rebooting, the filesystem reported the space as occupied according to df, despite the physical blocks being marked as free in partition management tools like GParted.

Root Cause

The primary cause is a desynchronization between the kernel’s filesystem metadata and the actual block device state, likely triggered by an incomplete mount operation or a stale mount point that bypassed standard cleanup routines.

  • Unresolved Mount Points: The process attempted to mount an intermediate filesystem in /tmp. When the process was killed, the mount remained active in the kernel’s mount table, even if the underlying directory appeared empty.
  • Filesystem Metadata Inconsistency: The “hanging” state suggests the kernel was waiting on an I/O operation that never completed. This can lead to a state where the Superblock or the Inode Bitmap is in an inconsistent state.
  • The “Ghost Space” Phenomenon: Because the mount was technically still active in a semi-broken state, the kernel continued to account for the space used by the “phantom” filesystem, preventing df from seeing the capacity as available.

Why This Happens in Real Systems

In complex distributed or local systems, this occurs due to the decoupling of the Process Lifecycle and the Kernel Mount Lifecycle.

  • Signal Handling Failures: When a user sends SIGINT (Ctrl+C), the application receives the signal, but if the application is blocked in an Uninterruptible Sleep (D state) waiting for I/O, it cannot execute its cleanup handlers.
  • Mount Namespace Leaks: Modern Linux systems use namespaces. If a process creates a mount within a specific namespace and then crashes, that mount can persist in a “zombie” state within the kernel’s VFS (Virtual File System) layer.
  • Filesystem Journaling Latency: If the system is under heavy I/O pressure, the transition of blocks from “allocated” to “free” in the journal might not be committed before the system is interrupted or rebooted.

Real-World Impact

  • Service Outages: If this occurs on a root partition or a data partition hosting a database, the entire system may stop accepting writes, leading to cascading failures.
  • Monitoring Noise: Standard monitoring tools like df will report a Disk Full alert, while low-level disk usage tools might report the disk is healthy, leading to increased Mean Time to Recovery (MTTR).
  • Data Corruption Risk: Forcing a reboot during a metadata inconsistency event increases the risk of filesystem corruption on the host machine.

Example or Code (if necessary and relevant)

To identify the “ghost” mount point that is consuming the space, a senior engineer would bypass df and look directly at the kernel’s mount table and the filesystem’s status.

# 1. Check for active mounts that might be hiding space
mount | grep /tmp

# 2. Check for processes stuck in Uninterruptible Sleep (D state)
ps aux | awk '$8 ~ /D/'

# 3. Forcefully unmount the suspected directory (use lazy unmount)
sudo umount -l /tmp/woeusb_mount_point

# 4. Check for filesystem errors using fsck (requires unmounting the disk)
sudo fsck -f /dev/sdX

How Senior Engineers Fix It

Senior engineers move beyond df and look at the source of truth: the Kernel and the Block Device.

  • Lazy Unmounting: Instead of a standard umount, use umount -l (lazy unmount). This detaches the filesystem from the hierarchy immediately and cleans up all references as soon as the device is no longer busy.
  • Kernel Inspection: Use cat /proc/mounts to see exactly what the kernel thinks is mounted, which is more reliable than the mount command in a broken state.
  • Filesystem Check (fsck): If a reboot doesn’t fix the metadata, the engineer will boot into a Live environment to run fsck to rebuild the inode bitmaps and block maps.
  • Checking for D-State Processes: They identify processes in Uninterruptible Sleep using ps and realize that these processes cannot be killed by SIGKILL and must be addressed by resolving the underlying I/O block or rebooting properly.

Why Juniors Miss It

  • Tool Over-reliance: Juniors often rely solely on df -h. When df gives an answer that contradicts their intuition (or another tool), they assume the tool is lying rather than the filesystem being inconsistent.
  • The “Kill” Fallacy: There is a common misconception that kill -9 solves all problems. Juniors fail to realize that killing a process does not undo the side effects the process had on the Kernel’s VFS layer.
  • Ignoring Mount States: They treat a directory as a simple folder, forgetting that a directory can be a mount point that acts as a gateway to an entirely different filesystem structure.

Leave a Comment