Why is rm -rf so slow on PolarDB-FileSystem FUSE compared to ext4?

Summary

This incident analyzes why rm -rf on PolarDB-FileSystem (FUSE mode) is dramatically slower than the same operation on ext4, despite running on the same NVMe SSD. The core issue stems from metadata-heavy operations crossing the user–kernel boundary, compounded by FUSE design constraints and PolarDB-FileSystem’s metadata architecture.

Root Cause

The slowdown is caused by a combination of FUSE overhead, metadata amplification, and synchronous delete semantics inside PolarDB-FileSystem.

Key contributors include:

  • FUSE user–kernel context switching for every unlink, rmdir, lookup, and getattr
  • High metadata fan‑out in PolarDB-FileSystem’s design (optimized for distributed DB workloads, not mass file churn)
  • Synchronous metadata persistence on delete operations
  • Lack of aggressive inode/dentry caching compared to ext4
  • Small-file workloads triggering worst-case FUSE behavior

Why This Happens in Real Systems

Large-scale deletions stress the filesystem in ways that normal workloads do not.

Common systemic reasons:

  • FUSE filesystems pay a fixed tax per syscall, which becomes catastrophic when multiplied by 100k+ files
  • Metadata operations dominate deletes, and FUSE adds latency to each one
  • Distributed or database-oriented filesystems often prioritize consistency over raw unlink throughput
  • Kernel filesystems (ext4, xfs) batch metadata updates efficiently; FUSE cannot
  • Directory walking and inode invalidation are far more expensive when the filesystem is implemented in userspace

Real-World Impact

Users typically observe:

  • Minutes-long deletion times for large directory trees
  • Low disk utilization, because the bottleneck is CPU and context switching, not I/O
  • High CPU usage in the FUSE daemon, not the kernel
  • Slow recursive operations (rm, find, rsync, backup tools)
  • Unpredictable latency spikes when deleting many small files

Example or Code (if necessary and relevant)

Below is a minimal example showing how FUSE amplifies metadata operations. Each unlink triggers a round trip through userspace:

int unlink(const char *path) {
    return fuse_reply_unlink(req, path);
}

This is valid executable C code illustrating the user–kernel boundary crossing.

How Senior Engineers Fix It

Experienced engineers treat this as a metadata throughput problem, not a disk problem.

Typical solutions:

  • Batch deletes asynchronously instead of using rm -rf
  • Use a background cleanup worker that unlinks files in parallel
  • Mount with performance-oriented FUSE options, such as:
    • -o big_writes
    • -o max_read=131072
    • -o max_write=131072
    • -o entry_timeout=60
    • -o attr_timeout=60
    • -o negative_timeout=30
  • Increase kernel VFS caches (e.g., vm.vfs_cache_pressure=50)
  • Avoid storing millions of tiny files on FUSE-backed filesystems
  • Perform deletes on ext4 and sync back, if the workflow allows
  • Use filesystem-native bulk-delete APIs, if available

Most importantly: optimize the workflow, not the filesystem, because FUSE will always lose to ext4 in metadata-heavy workloads.

Why Juniors Miss It

Less experienced engineers often assume:

  • Disk speed determines delete speed
    (In reality, deletes are metadata-bound, not I/O-bound.)
  • FUSE behaves like a kernel filesystem
    (It never will.)
  • rm -rf is a simple operation
    (It triggers thousands of syscalls and metadata updates.)
  • Slow deletes indicate misconfiguration
    (Often it’s inherent to the architecture.)

Juniors tend to focus on I/O metrics, while seniors know to look at syscall volume, metadata paths, and context-switch overhead—the real bottlenecks in FUSE-based filesystems.

Leave a Comment