High CPU Usage in Multi-Process GStreamer Video Sharing Architecture
Summary
A GStreamer-based video streaming system on i.MX8 experienced 360% CPU utilization when sharing raw camera frames across producer and multiple consumer processes. The architecture used shmsink/shmsrc for inter-process communication, resulting in severe performance degradation due to memory copy overhead and lack of zero-copy buffer sharing. The root cause was inefficient raw frame distribution across process boundaries without leveraging hardware-accelerated buffer sharing mechanisms available on the platform.
Root Cause
The excessive CPU usage stemmed from fundamental architectural flaws in how raw video frames were shared between processes:
Primary causes:
- Memory copy overhead:
shmsink/shmsrcperforms full buffer copies for each consumer rather than zero-copy sharing - Independent per-branch processing: Each consumer independently performed scaling and framerate conversion on identical raw frames
- No DMABUF sharing: Hardware-accelerated dmabuf buffers were not exported across process boundaries
- Redundant computation: Multiple processes performed identical frame transformations (scaling, color space conversion)
Technical breakdown:
- Raw NV12 1080p frame size: ~3MB per frame
- Memory bandwidth per frame: 3MB × 30 fps = 90MB/s minimum per consumer
- With 3 consumers: 270MB/s aggregate memory traffic
- Each
shmsrcinstance triggers separate memory allocation and copy operations
Why This Happens in Real Systems
This pattern emerges frequently in embedded multimedia systems due to misaligned abstractions between GStreamer elements and hardware capabilities.
Common contributing factors:
- GStreamer portability vs. platform optimization: Generic
shmsink/shmsrcelements prioritize cross-platform compatibility over zero-copy efficiency - Documentation gap: Developers often overlook platform-specific buffer sharing mechanisms like
dmabufexport/import - Premature abstraction: Using sockets/files for IPC before considering kernel-level buffer sharing
- Mental model mismatch: Assuming shared memory means zero-copy without verifying element implementation details
Why the intuition fails:
Expected: shmsink → kernel shared memory region → all consumers map same physical pages
Reality: Each shmsrc creates separate mmap → copies data → individual process buffers
The socket-path interface suggests zero-copy sharing but internally manages copies for process isolation and GStreamer buffer management.
Real-World Impact
The performance degradation had cascading effects throughout the system:
Immediate impacts:
- CPU saturation preventing real-time processing
- Thermal throttling on embedded platform
- Unreliable RTSP streaming under load
- Inability to scale beyond 3 consumers
Business consequences:
- Scalability ceiling: Could not add more consumers without dropping frames
- Power consumption: High CPU usage reduced battery life (if battery-powered)
- Latency jitter: Frame drops and processing delays affected real-time requirements
- Maintenance burden: CPU headroom required constant monitoring
Cost implications:
- Required hardware upgrade to higher-tier SoC
- Development time spent firefighting performance issues
- Customer dissatisfaction with streaming reliability
Example or Code
Problematic Architecture (High CPU)
# Producer pipeline - captures and shares raw frames
gst-launch-1.0 v4l2src device=/dev/video3 io-mode=dmabuf ! \
video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
queue ! shmsink socket-path=/tmp/cam.sock wait-for-connection=false sync=false
# Consumer 1 - Recording (independent processing)
gst-launch-1.0 shmsrc socket-path=/tmp/cam.sock is-live=true do-timestamp=true ! \
video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
videoscale ! videorate ! filesink location=record.mp4
# Consumer 2 - RTSP streaming (duplicate scaling/conversion)
gst-launch-1.0 shmsrc socket-path=/tmp/cam.sock is-live=true do-timestamp=true ! \
video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
videoscale ! videorate ! v4l2h264enc ! h264parse ! rtph264pay ! udpsink host=...
Optimized Architecture (Low CPU)
# Producer with dmabuf export - enables zero-copy sharing
gst-launch-1.0 v4l2src device=/dev/video3 io-mode=dmabuf ! \
video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
queue ! \
imxvideoconvert_g2d ! \ # Use i.MX G2D for format conversion
imxvideomixer ! \ # Hardware-accelerated mixing
v4l2h264enc ! \ # Single encode point
tee name=t ! \
queue ! h264parse ! matroskamux ! filesink location=recording.mkv sync=false async=false \
t. ! queue ! h264parse ! rtph264pay ! udpsink host=127.0.0.1 port=5000 sync=false async=false \
t. ! queue ! h264parse ! rtph264pay ! udpsink host=127.0.0.1 port=5002 sync=false async=false
NXP-Specific Zero-Copy Solution
// Export dmabuf from v4l2 driver
int export_dmabuf_to_prime(int v4l2_fd, int index, uint32_t *dma_fd) {
struct v4l2_export_dmabuf expbuf = {0};
expbuf.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
expbuf.index = index;
expbuf.flags = O_CLOEXEC;
if (ioctl(v4l2_fd, VIDIOC_EXPBUFDMABUF, &expbuf) == -1) {
return -1;
}
*dma_fd = expbuf.fd;
return 0;
}
// Import dmabuf in consumer process
int import_dmabuf_to_v4l2(int v4l2_fd, uint32_t dma_fd) {
struct v4l2_import_dmabuf impbuf = {0};
impbuf.fd = dma_fd;
return ioctl(v4l2_fd, VIDIOC_IMPBUFDMABUF, &impbuf);
}
How Senior Engineers Fix It
Senior engineers address this through platform-aware architecture redesign rather than micro-optimizations:
Immediate fixes:
- Consolidate encode operations: Single hardware encoder with
teefor distribution instead of multiple encode chains - Leverage i.MX plugins: Use
imxvideoconvert_g2d,imxvpudec, and other NXP-optimized elements - Implement dmabuf sharing: Export dmabuf file descriptors through Unix domain sockets instead of raw frame copying
Architectural improvements:
- Push work to hardware: Move scaling/color conversion to GPU/VPU when available
- Pipeline reconfiguration: Dynamic pipeline reconstruction using
gst_pad_add_probeinstead of statictee - Memory layout optimization: Ensure contiguous NV12 format matches hardware requirements
NXP-specific solutions:
# Use imx hardware scaler and encoder pipeline
gst-launch-1.0 -e v4l2src device=/dev/video3 io-mode=dmabuf ! \
video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
imxvideoconvert_g2d ! \
video/x-raw,width=1280,height=720 ! \
v4l2h264enc target-bitrate=4096 ! \
tee name=stream_tee ! \
queue max-size-buffers=0 max-size-time=0 ! \
h264parse ! matroskamux ! filesink location=output.mkv sync=false \
stream_tee. ! queue ! h264parse ! rtph264pay config-interval=1 ! \
udpsink host=239.255.0.1 port=5000 auto-multicast=true sync=false
Why Juniors Miss It
Junior engineers often fall into several conceptual traps that prevent identifying the core issue:
Common misconceptions:
- “Shared memory means zero-copy”: Assuming
shmsrc/shmsinkprovide zero-copy semantics without verifying implementation - Micro-benchmarking bias: Measuring individual pipeline stages instead of end-to-end memory flow
- Abstraction leakage ignorance: Not understanding how GStreamer caps affect buffer allocation
- Platform blindness: Using generic elements instead of platform-optimized alternatives
Debugging missteps:
- Focusing on encoder efficiency while ignoring upstream memory bottlenecks
- Measuring CPU usage per process rather than aggregate system memory traffic
- Testing with single consumer instead of realistic multi-consumer load
- Assuming socket-based IPC is inherently lightweight for large payloads
Knowledge gaps:
- DMABUF fundamentals: Not knowing how to export/import dmabuf file descriptors
- i.MX ecosystem: Unaware of
imxplugin family and hardware acceleration capabilities - V4L2 memory models: Missing familiarity with
VIDIOC_EXPBUFDMABUFand buffer sharing ioctl calls - Performance profiling: Lacking tools to trace memory bandwidth vs. CPU utilization patterns