Summary
This postmortem analyzes a common performance pitfall: assuming that write(), writev(), or even splice() represent the full set of practical terminal‑output strategies in Linux userspace. The incident stemmed from benchmarking a lightweight libc and overlooking several kernel‑supported mechanisms that reduce syscall overhead for small writes. The result was misleading performance data and incorrect architectural conclusions.
Root Cause
The root cause was over‑reliance on traditional POSIX syscalls and failure to account for modern Linux I/O paths that reduce syscall frequency or bypass user–kernel copies.
Key contributing factors:
- Assuming write() is the baseline without validating alternatives like
io_uring. - Misunderstanding terminal device behavior, especially line buffering and TTY throttling.
- Benchmarking small writes (<100 bytes) where syscall overhead dominates.
- Ignoring batching and submission‑queue–based I/O models.
Why This Happens in Real Systems
Real systems frequently fall into this trap because:
- Legacy habits: write() has been the default for decades.
- TTYs are deceptively slow, masking the benefits of advanced I/O paths.
- Small writes amplify syscall overhead, making naive benchmarks misleading.
- Developers assume zero‑copy is impossible without kernel patches, overlooking existing mechanisms.
Real-World Impact
This misunderstanding leads to:
- Incorrect performance conclusions about libc or terminal throughput.
- Over‑engineering (e.g., custom buffering layers) when simpler solutions exist.
- Misguided optimization efforts that ignore kernel‑supported batching.
- Inconsistent benchmark results across terminals, PTYs, and pipes.
Example or Code (if necessary and relevant)
Below is a minimal example of using io_uring for batched writes, which reduces syscall overhead compared to write() for small messages:
#include
#include
#include
int main() {
struct io_uring ring;
io_uring_queue_init(8, &ring, 0);
const char *msg = "hello\n";
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe, STDOUT_FILENO, msg, strlen(msg), -1);
io_uring_submit(&ring);
io_uring_wait_cqe(&ring, NULL);
io_uring_queue_exit(&ring);
return 0;
}
How Senior Engineers Fix It
Experienced engineers approach this by:
- Evaluating syscall‑reducing mechanisms such as:
- io_uring (submission queue batching, fewer syscalls)
- memfd + splice() pipelines for zero‑copy between FDs
- user‑space buffering to coalesce small writes
- Benchmarking against PTYs, real TTYs, and pipes to understand device‑specific behavior.
- Measuring syscall rate, not just throughput.
- Avoiding micro‑benchmarks that misrepresent real workloads.
Why Juniors Miss It
Juniors often overlook these issues because:
- They assume write() is optimal for terminal output.
- They are unaware that io_uring works with TTYs, even if not always faster.
- They focus on raw throughput, not syscall overhead.
- They rarely consider device‑specific behavior (TTY throttling, canonical mode).
- They underestimate how small writes distort benchmark results.
The key takeaway: terminal output performance is dominated by syscall overhead, not data movement, and modern Linux provides multiple mechanisms to mitigate that—if you know where to look.