130 ms latency in torch.multiprocessing Queue for CUDA IPC

Summary

An investigation into high latency measurements during GPU tensor transfers via torch.multiprocessing.Queue revealed a significant discrepancy between perceived and actual IPC (Inter-Process Communication) speed. While the developer expected microsecond-scale latency for CUDA IPC handles, measurements consistently showed ~130ms per item. The investigation determined that the high latency was not a result of slow data transfer, but rather an instrumentation artifact caused by measuring the wrong side of the synchronization barrier.

Root Cause

The primary cause is the asymmetric nature of the mp.Queue.put() operation combined with how Python’s multiprocessing primitives handle serialization and buffer management.

Serialization Overhead: When queue.put() is called, the object (including the tuple, integers, and the CUDA tensor handle) must be pickled and written to the underlying pipe buffer.
Producer-Side Blocking: The ts_start timestamp was recorded before the put() call, but the measurement of latency occurred in the consumer after the get() call.
The Buffer Bottleneck: In the “high latency” implementation, the producer finishes its work and then the consumer processes items. However, mp.Queue uses a background thread to feed the pipe. The time recorded includes the time spent waiting for the internal feeder thread to actually move the pickled data from the local object to the OS pipe.
Clock Misalignment: The measurement was capturing the total duration of the serialization and queuing lifecycle, rather than the actual movement of the CUDA IPC handle across the wire.

Why This Happens in Real Systems

In production distributed training or high-throughput inference pipelines, this phenomenon is common due to:

Serialization Latency: Large metadata attached to tensors (dictionaries, complex objects) can dwarf the actual cost of passing a 64-bit CUDA memory handle.
Context Switching & GIL: In Python, the Global Interpreter Lock (GIL) can delay the background thread responsible for flushing the multiprocessing.Queue buffer to the OS, making the put() operation appear much slower than the data transfer itself.
System Buffer Pressure: If the consumer is slower than the producer, the OS pipe buffers fill up, causing the producer’s put() to block, effectively measuring backpressure instead of transfer latency.

Real-World Impact

Misleading Bottleneck Analysis: Engineers may waste weeks optimizing CUDA kernels or IPC mechanisms when the real bottleneck is CPU-side serialization or queue management overhead.
Suboptimal Pipeline Design: High perceived latency might lead to an unnecessary increase in the number of worker processes, which actually increases contention and lowers total throughput.
Incorrect Scaling Laws: Automated scaling logic based on “queue latency” will trigger prematurely if the latency metric includes serialization and thread-scheduling jitter.

Example or Code (if necessary and relevant)

The following code demonstrates the correct way to measure the true IPC transfer latency by isolating the timestamping to the moment the data is actually available in the consumer.

import torch
import torch.multiprocessing as mp
import time

def producer(queue, num_items):
    for i in range(num_items):
        # Create a 5MB tensor on GPU
        tensor = torch.randn(1280 * 1000, device="cuda")
        torch.cuda.synchronize()

        # We only put the tensor. 
        # Do NOT include the timestamp in the 'put' if you want to measure 
        # the queue's internal transfer time.
        queue.put(tensor)
    queue.put(None)

def consumer(queue):
    latencies = []
    while True:
        # Start timing EXACTLY when the item is retrieved from the pipe
        start_wait = time.perf_counter_ns()
        tensor = queue.get()
        if tensor is None:
            break
        end_wait = time.perf_counter_ns()

        # This measures the time the item spent sitting in the queue/pipe
        latencies.append((end_wait - start_wait) / 1e6)
    print(f"Average Transfer Latency: {sum(latencies)/len(latencies):.4f} ms")

if __name__ == "__main__":
    mp.set_start_method("spawn", force=True)
    q = mp.Queue()
    p = mp.Process(target=producer, args=(q, 10))
    c = mp.Process(target=consumer, args=(q,))
    p.start()
    c.start()
    p.join()
    c.join()

How Senior Engineers Fix It

Senior engineers approach this by decoupling the measurement from the mechanism:

Isolate the Variable: Instead of measuring (Consumer_Time - Producer_Start_Time), they measure (Consumer_Arrival_Time - Producer_Completion_Time).
Use Hardware Counters: For GPU-to-GPU transfers, they rely on NVIDIA Nsight Systems or nvprof to see the actual hardware-level memory copy events rather than relying on Python-level timestamps.
Zero-Copy Verification: They verify that CUDA IPC is actually being used by checking if the storage().data_ptr() remains consistent or by using tools that monitor memory handle creation.
Profiling the Feeder Thread: They recognize that multiprocessing.Queue is not a direct pipe but a managed buffer with a background thread, and they account for the non-deterministic scheduling of that thread.

Why Juniors Miss It

Assuming “Linear” Execution: Juniors often assume that queue.put() is a synchronous, atomic operation that immediately places data in the consumer’s hands.
Trusting Micro-benchmarks: They take the first high number they see in a print statement as a literal truth of the system’s performance, without questioning the instrumentation overhead.
Ignoring the “Hidden” Threads: They overlook the fact that Python’s multiprocessing module spawns auxiliary threads to manage the queue, which are subject to the OS scheduler and the GIL.
Mixing Logic and Measurement: They include the preparation time (creating the tensor, synchronizing CUDA) within the latency window, effectively measuring the entire loop iteration rather than the communication step.