Summary
This incident analyzes why rm -rf on PolarDB-FileSystem (FUSE mode) is dramatically slower than the same operation on ext4, despite running on the same NVMe SSD. The core issue stems from metadata-heavy operations crossing the user–kernel boundary, compounded by FUSE design constraints and PolarDB-FileSystem’s metadata architecture.
Root Cause
The slowdown is caused by a combination of FUSE overhead, metadata amplification, and synchronous delete semantics inside PolarDB-FileSystem.
Key contributors include:
- FUSE user–kernel context switching for every unlink, rmdir, lookup, and getattr
- High metadata fan‑out in PolarDB-FileSystem’s design (optimized for distributed DB workloads, not mass file churn)
- Synchronous metadata persistence on delete operations
- Lack of aggressive inode/dentry caching compared to ext4
- Small-file workloads triggering worst-case FUSE behavior
Why This Happens in Real Systems
Large-scale deletions stress the filesystem in ways that normal workloads do not.
Common systemic reasons:
- FUSE filesystems pay a fixed tax per syscall, which becomes catastrophic when multiplied by 100k+ files
- Metadata operations dominate deletes, and FUSE adds latency to each one
- Distributed or database-oriented filesystems often prioritize consistency over raw unlink throughput
- Kernel filesystems (ext4, xfs) batch metadata updates efficiently; FUSE cannot
- Directory walking and inode invalidation are far more expensive when the filesystem is implemented in userspace
Real-World Impact
Users typically observe:
- Minutes-long deletion times for large directory trees
- Low disk utilization, because the bottleneck is CPU and context switching, not I/O
- High CPU usage in the FUSE daemon, not the kernel
- Slow recursive operations (rm, find, rsync, backup tools)
- Unpredictable latency spikes when deleting many small files
Example or Code (if necessary and relevant)
Below is a minimal example showing how FUSE amplifies metadata operations. Each unlink triggers a round trip through userspace:
int unlink(const char *path) {
return fuse_reply_unlink(req, path);
}
This is valid executable C code illustrating the user–kernel boundary crossing.
How Senior Engineers Fix It
Experienced engineers treat this as a metadata throughput problem, not a disk problem.
Typical solutions:
- Batch deletes asynchronously instead of using
rm -rf - Use a background cleanup worker that unlinks files in parallel
- Mount with performance-oriented FUSE options, such as:
-o big_writes-o max_read=131072-o max_write=131072-o entry_timeout=60-o attr_timeout=60-o negative_timeout=30
- Increase kernel VFS caches (e.g.,
vm.vfs_cache_pressure=50) - Avoid storing millions of tiny files on FUSE-backed filesystems
- Perform deletes on ext4 and sync back, if the workflow allows
- Use filesystem-native bulk-delete APIs, if available
Most importantly: optimize the workflow, not the filesystem, because FUSE will always lose to ext4 in metadata-heavy workloads.
Why Juniors Miss It
Less experienced engineers often assume:
- Disk speed determines delete speed
(In reality, deletes are metadata-bound, not I/O-bound.) - FUSE behaves like a kernel filesystem
(It never will.) rm -rfis a simple operation
(It triggers thousands of syscalls and metadata updates.)- Slow deletes indicate misconfiguration
(Often it’s inherent to the architecture.)
Juniors tend to focus on I/O metrics, while seniors know to look at syscall volume, metadata paths, and context-switch overhead—the real bottlenecks in FUSE-based filesystems.