Mitigating GPU/NPU contention for MediaPipe vision and INT4 LLM

Summary

The system experienced catastrophic performance degradation and thermal throttling when attempting to run two concurrent AI workloads: a MediaPipe vision pipeline and an INT4 quantized LLM. While both models functioned in isolation, their simultaneous execution created a hardware resource contention scenario. Specifically, both pipelines were competing for the same GPU compute units and shared memory bandwidth, leading to a rapid increase in SoC (System on Chip) temperature, which triggered the OS to throttle clock speeds and eventually terminate the application to protect hardware integrity.

Root Cause

The failure stems from unmanaged hardware abstraction layers. On mobile SoCs, the distinction between the NPU (Neural Processing Unit) and the GPU is often fluid or poorly partitioned by high-level frameworks.

Compute Resource Overlap: Both MediaPipe and the LLM inference engine were defaulting to the GPU delegate for acceleration, causing massive scheduling contention.
Memory Bandwidth Saturation: The INT4 LLM requires massive throughput for weight loading, while MediaPipe requires high-frequency, low-latency access for frame processing. The interconnect bus became a bottleneck.
Thermal Runaway: Continuous high-utilization of the GPU/NPU without staggered execution led to thermal saturation, forcing the kernel to downclock the entire SoC.

Why This Happens in Real Systems

In production, developers often rely on “Auto-Delegate” features provided by frameworks like TensorFlow Lite or MediaPipe.

Abstraction Leaks: High-level APIs hide the fact that “GPU Acceleration” might consume the same Execution Units (EUs) needed by the NPU for tensor math.
Oversubscription: Most mobile ML frameworks are designed for single-task optimization. When two heavy models run, the Linux OOM (Out of Memory) killer or the Thermal Governor views the application as a rogue process consuming excessive power.
Implicit Dependencies: Modern mobile chips use a Unified Memory Architecture (UMA). Even if you use different processors (NPU vs GPU), they compete for the same LPDDR memory controller.

Real-World Impact

Unstable Frame Rates: The vision pipeline’s FPS dropped from 30 to <5, making real-time monitoring useless.
Increased Latency: LLM reasoning time increased by 4x due to thermal throttling.
Application Crashes: The mobile OS terminated the background process to prevent device overheating.
Poor User Experience: Significant device heat makes the hardware uncomfortable to hold.

Example or Code

# Concept: Explicitly assigning different delegates to avoid GPU contention

import mediapipe as mp
import tensorflow as tf

# 1. MediaPipe: Force to GPU for high-frequency vision tasks
# This uses the GPU for rapid, small-tensor transformations
vision_pipeline = mp.solutions.face_detection.FaceDetection(
    model_selection=1,
    min_detection_confidence=0.5
)
# Note: In actual implementation, ensure the GPU delegate is explicitly enabled

# 2. LLM: Force to NPU/DSP via TFLite Delegate to save GPU for Vision
# This prevents the LLM from stealing the GPU's compute units
interpreter_options = tf.lite.InterpreterOptions()
npu_delegate = tf.lite.experimental.load_delegate('libnnapi_delegate.so') 
interpreter_options.add_delegate(npu_delegate)

llm_interpreter = tf.lite.Interpreter(
    model_path="int4_quantized_llm.tflite",
    experimental_delegates=[npu_delegate]
)
llm_interpreter.allocate_tensors()

How Senior Engineers Fix It

Senior engineers move away from “magic” auto-configuration and implement explicit resource partitioning.

Hard Delegate Assignment: Force the vision pipeline onto the GPU (optimized for spatial data) and the LLM onto the NPU/DSP (optimized for heavy integer math/INT4).
Execution Staggering: Implement a cooperative scheduler. Instead of running both models in a tight loop, run the vision pipeline every frame and the LLM every $N$ frames to allow the SoC to “breathe.”
Memory Pressure Management: Use zero-copy buffers (like Android Hardware Buffers) to pass data between the camera and the models without redundant CPU-side copies.
Thermal-Aware Throttling: Build an application-level thermal listener. If the device temperature exceeds a threshold, programmatically reduce the LLM’s inference frequency or switch to a smaller, non-quantized model.

Why Juniors Miss It

The “It Works on My Machine” Trap: Juniors often test on high-end flagship devices (e.g., iPhone 15 Pro or S24 Ultra) which have massive thermal headroom, masking the contention.
Over-reliance on Defaults: They assume that calling .use_gpu() or .use_npu() is enough, without realizing that multiple “optimizations” can fight for the same hardware.
Focusing on Accuracy over Throughput: They optimize for model precision (FP32) rather than system-level stability and power efficiency.
Ignoring the Interconnect: They treat the NPU and GPU as isolated islands, forgetting that they share the same thermal envelope and memory bus.