Why is rm -rf so slow on PolarDB-FileSystem FUSE compared to ext4?

Summary

This incident analyzes why rm -rf on PolarDB-FileSystem (FUSE mode) is dramatically slower than the same operation on ext4, despite running on the same NVMe SSD. The core issue stems from metadata-heavy operations crossing the user–kernel boundary, compounded by FUSE design constraints and PolarDB-FileSystem’s metadata architecture.

Root Cause

The slowdown is caused by a combination of FUSE overhead, metadata amplification, and synchronous delete semantics inside PolarDB-FileSystem.

Key contributors include:

FUSE user–kernel context switching for every unlink, rmdir, lookup, and getattr
High metadata fan‑out in PolarDB-FileSystem’s design (optimized for distributed DB workloads, not mass file churn)
Synchronous metadata persistence on delete operations
Lack of aggressive inode/dentry caching compared to ext4
Small-file workloads triggering worst-case FUSE behavior

Why This Happens in Real Systems

Large-scale deletions stress the filesystem in ways that normal workloads do not.

Common systemic reasons:

FUSE filesystems pay a fixed tax per syscall, which becomes catastrophic when multiplied by 100k+ files
Metadata operations dominate deletes, and FUSE adds latency to each one
Distributed or database-oriented filesystems often prioritize consistency over raw unlink throughput
Kernel filesystems (ext4, xfs) batch metadata updates efficiently; FUSE cannot
Directory walking and inode invalidation are far more expensive when the filesystem is implemented in userspace

Real-World Impact

Users typically observe:

Minutes-long deletion times for large directory trees
Low disk utilization, because the bottleneck is CPU and context switching, not I/O
High CPU usage in the FUSE daemon, not the kernel
Slow recursive operations (rm, find, rsync, backup tools)
Unpredictable latency spikes when deleting many small files

Example or Code (if necessary and relevant)

Below is a minimal example showing how FUSE amplifies metadata operations. Each unlink triggers a round trip through userspace:

int unlink(const char *path) {
    return fuse_reply_unlink(req, path);
}

This is valid executable C code illustrating the user–kernel boundary crossing.

How Senior Engineers Fix It

Experienced engineers treat this as a metadata throughput problem, not a disk problem.

Typical solutions:

Batch deletes asynchronously instead of using rm -rf
Use a background cleanup worker that unlinks files in parallel
Mount with performance-oriented FUSE options, such as:
- -o big_writes
- -o max_read=131072
- -o max_write=131072
- -o entry_timeout=60
- -o attr_timeout=60
- -o negative_timeout=30
Increase kernel VFS caches (e.g., vm.vfs_cache_pressure=50)
Avoid storing millions of tiny files on FUSE-backed filesystems
Perform deletes on ext4 and sync back, if the workflow allows
Use filesystem-native bulk-delete APIs, if available

Most importantly: optimize the workflow, not the filesystem, because FUSE will always lose to ext4 in metadata-heavy workloads.

Why Juniors Miss It

Less experienced engineers often assume:

Disk speed determines delete speed
(In reality, deletes are metadata-bound, not I/O-bound.)
FUSE behaves like a kernel filesystem
(It never will.)
rm -rf is a simple operation
(It triggers thousands of syscalls and metadata updates.)
Slow deletes indicate misconfiguration
(Often it’s inherent to the architecture.)

Juniors tend to focus on I/O metrics, while seniors know to look at syscall volume, metadata paths, and context-switch overhead—the real bottlenecks in FUSE-based filesystems.