Eliminating 360% CPU Overhead in GStreamer Multi‑Process Streaming

High CPU Usage in Multi-Process GStreamer Video Sharing Architecture

Summary

A GStreamer-based video streaming system on i.MX8 experienced 360% CPU utilization when sharing raw camera frames across producer and multiple consumer processes. The architecture used shmsink/shmsrc for inter-process communication, resulting in severe performance degradation due to memory copy overhead and lack of zero-copy buffer sharing. The root cause was inefficient raw frame distribution across process boundaries without leveraging hardware-accelerated buffer sharing mechanisms available on the platform.

Root Cause

The excessive CPU usage stemmed from fundamental architectural flaws in how raw video frames were shared between processes:

Primary causes:

  • Memory copy overhead: shmsink/shmsrc performs full buffer copies for each consumer rather than zero-copy sharing
  • Independent per-branch processing: Each consumer independently performed scaling and framerate conversion on identical raw frames
  • No DMABUF sharing: Hardware-accelerated dmabuf buffers were not exported across process boundaries
  • Redundant computation: Multiple processes performed identical frame transformations (scaling, color space conversion)

Technical breakdown:

  • Raw NV12 1080p frame size: ~3MB per frame
  • Memory bandwidth per frame: 3MB × 30 fps = 90MB/s minimum per consumer
  • With 3 consumers: 270MB/s aggregate memory traffic
  • Each shmsrc instance triggers separate memory allocation and copy operations

Why This Happens in Real Systems

This pattern emerges frequently in embedded multimedia systems due to misaligned abstractions between GStreamer elements and hardware capabilities.

Common contributing factors:

  • GStreamer portability vs. platform optimization: Generic shmsink/shmsrc elements prioritize cross-platform compatibility over zero-copy efficiency
  • Documentation gap: Developers often overlook platform-specific buffer sharing mechanisms like dmabuf export/import
  • Premature abstraction: Using sockets/files for IPC before considering kernel-level buffer sharing
  • Mental model mismatch: Assuming shared memory means zero-copy without verifying element implementation details

Why the intuition fails:

Expected: shmsink → kernel shared memory region → all consumers map same physical pages
Reality: Each shmsrc creates separate mmap → copies data → individual process buffers

The socket-path interface suggests zero-copy sharing but internally manages copies for process isolation and GStreamer buffer management.

Real-World Impact

The performance degradation had cascading effects throughout the system:

Immediate impacts:

  • CPU saturation preventing real-time processing
  • Thermal throttling on embedded platform
  • Unreliable RTSP streaming under load
  • Inability to scale beyond 3 consumers

Business consequences:

  • Scalability ceiling: Could not add more consumers without dropping frames
  • Power consumption: High CPU usage reduced battery life (if battery-powered)
  • Latency jitter: Frame drops and processing delays affected real-time requirements
  • Maintenance burden: CPU headroom required constant monitoring

Cost implications:

  • Required hardware upgrade to higher-tier SoC
  • Development time spent firefighting performance issues
  • Customer dissatisfaction with streaming reliability

Example or Code

Problematic Architecture (High CPU)

# Producer pipeline - captures and shares raw frames
gst-launch-1.0 v4l2src device=/dev/video3 io-mode=dmabuf ! \
  video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
  queue ! shmsink socket-path=/tmp/cam.sock wait-for-connection=false sync=false

# Consumer 1 - Recording (independent processing)
gst-launch-1.0 shmsrc socket-path=/tmp/cam.sock is-live=true do-timestamp=true ! \
  video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
  videoscale ! videorate ! filesink location=record.mp4

# Consumer 2 - RTSP streaming (duplicate scaling/conversion)
gst-launch-1.0 shmsrc socket-path=/tmp/cam.sock is-live=true do-timestamp=true ! \
  video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
  videoscale ! videorate ! v4l2h264enc ! h264parse ! rtph264pay ! udpsink host=...

Optimized Architecture (Low CPU)

# Producer with dmabuf export - enables zero-copy sharing
gst-launch-1.0 v4l2src device=/dev/video3 io-mode=dmabuf ! \
  video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
  queue ! \
  imxvideoconvert_g2d ! \  # Use i.MX G2D for format conversion
  imxvideomixer ! \        # Hardware-accelerated mixing
  v4l2h264enc ! \          # Single encode point
  tee name=t ! \
  queue ! h264parse ! matroskamux ! filesink location=recording.mkv sync=false async=false \
  t. ! queue ! h264parse ! rtph264pay ! udpsink host=127.0.0.1 port=5000 sync=false async=false \
  t. ! queue ! h264parse ! rtph264pay ! udpsink host=127.0.0.1 port=5002 sync=false async=false

NXP-Specific Zero-Copy Solution

// Export dmabuf from v4l2 driver
int export_dmabuf_to_prime(int v4l2_fd, int index, uint32_t *dma_fd) {
    struct v4l2_export_dmabuf expbuf = {0};
    expbuf.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
    expbuf.index = index;
    expbuf.flags = O_CLOEXEC;

    if (ioctl(v4l2_fd, VIDIOC_EXPBUFDMABUF, &expbuf) == -1) {
        return -1;
    }

    *dma_fd = expbuf.fd;
    return 0;
}

// Import dmabuf in consumer process
int import_dmabuf_to_v4l2(int v4l2_fd, uint32_t dma_fd) {
    struct v4l2_import_dmabuf impbuf = {0};
    impbuf.fd = dma_fd;

    return ioctl(v4l2_fd, VIDIOC_IMPBUFDMABUF, &impbuf);
}

How Senior Engineers Fix It

Senior engineers address this through platform-aware architecture redesign rather than micro-optimizations:

Immediate fixes:

  • Consolidate encode operations: Single hardware encoder with tee for distribution instead of multiple encode chains
  • Leverage i.MX plugins: Use imxvideoconvert_g2d, imxvpudec, and other NXP-optimized elements
  • Implement dmabuf sharing: Export dmabuf file descriptors through Unix domain sockets instead of raw frame copying

Architectural improvements:

  • Push work to hardware: Move scaling/color conversion to GPU/VPU when available
  • Pipeline reconfiguration: Dynamic pipeline reconstruction using gst_pad_add_probe instead of static tee
  • Memory layout optimization: Ensure contiguous NV12 format matches hardware requirements

NXP-specific solutions:

# Use imx hardware scaler and encoder pipeline
gst-launch-1.0 -e v4l2src device=/dev/video3 io-mode=dmabuf ! \
  video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
  imxvideoconvert_g2d ! \
  video/x-raw,width=1280,height=720 ! \
  v4l2h264enc target-bitrate=4096 ! \
  tee name=stream_tee ! \
  queue max-size-buffers=0 max-size-time=0 ! \
  h264parse ! matroskamux ! filesink location=output.mkv sync=false \
  stream_tee. ! queue ! h264parse ! rtph264pay config-interval=1 ! \
  udpsink host=239.255.0.1 port=5000 auto-multicast=true sync=false

Why Juniors Miss It

Junior engineers often fall into several conceptual traps that prevent identifying the core issue:

Common misconceptions:

  • “Shared memory means zero-copy”: Assuming shmsrc/shmsink provide zero-copy semantics without verifying implementation
  • Micro-benchmarking bias: Measuring individual pipeline stages instead of end-to-end memory flow
  • Abstraction leakage ignorance: Not understanding how GStreamer caps affect buffer allocation
  • Platform blindness: Using generic elements instead of platform-optimized alternatives

Debugging missteps:

  • Focusing on encoder efficiency while ignoring upstream memory bottlenecks
  • Measuring CPU usage per process rather than aggregate system memory traffic
  • Testing with single consumer instead of realistic multi-consumer load
  • Assuming socket-based IPC is inherently lightweight for large payloads

Knowledge gaps:

  • DMABUF fundamentals: Not knowing how to export/import dmabuf file descriptors
  • i.MX ecosystem: Unaware of imx plugin family and hardware acceleration capabilities
  • V4L2 memory models: Missing familiarity with VIDIOC_EXPBUFDMABUF and buffer sharing ioctl calls
  • Performance profiling: Lacking tools to trace memory bandwidth vs. CPU utilization patterns

Leave a Comment