Broken NVIDIA/CUDA install

Summary

A CUDA/NVIDIA driver upgrade on Ubuntu 24.04 resulted in a dependency hell and a NVML version mismatch. The system entered a state where apt refused to operate due to broken package dependencies (libnvidia-compute vs libnvidia-cfg1), and the existing NVIDIA SMI tool failed with Failed to initialize NVML: Driver/library version mismatch. This typically occurs because the kernel driver version loaded in memory differs from the library versions installed on the filesystem.

Root Cause

The primary cause is a mismatch between the user-space CUDA libraries (installed via apt) and the kernel-mode NVIDIA driver (loaded via DKMS or a previous install).

Specifically:

  • Library Version Drift: The nvidia-smi binary attempted to link against NVML library version: 590.44, but the running kernel module likely corresponds to a different version.
  • apt/dpkg State Corruption: The libnvidia-compute package explicitly requires a matching libnvidia-cfg1 version. When partial upgrades or conflicting repositories (e.g., mixing Ubuntu official drivers with NVIDIA direct drivers) are used, dpkg sees a dependency constraint that cannot be satisfied, blocking all further package management operations.

Why This Happens in Real Systems

This is a classic “state mismatch” scenario common in Linux graphics stacks:

  • Incomplete Uninstalls: When users attempt to remove old drivers to install new ones via apt, residual configuration files or locked kernel modules often remain.
  • Multi-Repository Conflicts: Ubuntu 24.04’s apt sources might contain a different CUDA version than the NVIDIA runfile installer or a third-party PPA. If you run apt upgrade after installing via the NVIDIA .run file, it often breaks the user-space binaries without touching the kernel module.
  • Kernel Updates: A kernel update often triggers a DKMS rebuild. If the rebuild fails silently (missing headers, gcc version issues), the new kernel boots with the old driver, but apt installs the new user-space libraries.

Real-World Impact

  • System Deadlock: The user cannot install, remove, or repair packages because apt is locked by unmet dependencies.
  • Compute Failure: All CUDA-dependent workloads (AI training, rendering, scientific computing) immediately crash.
  • Time Sink: The error messages (Depends: libnvidia-cfg1 vs it is not going to be installed) are misleading; running apt --fix-broken install often fails or removes critical packages, leading to further damage.

Example or Code

This specific error is triggered when the driver API version does not match the client library version.

// Simplified conceptual check within NVML
// When nvidia-smi runs, it calls nvmlInitWithFlags(0)
// The driver returns a version mismatch if the kernel module version
// (compiled against specific headers) doesn't match the library version.

#include 

void check_compatibility() {
    nvmlReturn_t result = nvmlInit();
    if (result == NVML_ERROR_MISMATCH) {
        // This is essentially what the user sees:
        // "Failed to initialize NVML: Driver/library version mismatch"
    }
}

How Senior Engineers Fix It

Seniors avoid “patching” apt and instead perform a state reset.

  1. Purge, Don’t Remove: Do not just apt remove. You must purge all NVIDIA packages to remove configuration remnants.
    sudo apt-get purge '*nvidia*' '*cuda*' '*cudnn*'
  2. Clean Artifacts: Manually remove leftover folders and update the package cache.
    sudo rm -rf /usr/local/cuda*
    sudo apt autoremove
    sudo apt autoclean
    sudo apt update
  3. Reinstall via Preferred Method (Clean Slate):
    • Option A (Official Repo): Re-add the correct NVIDIA key and repository for 24.04, then install nvidia-driver-550 (or latest).
    • Option B (Runfile): Download the official .run file and use --uninstall first if necessary, then reinstall.

Why Juniors Miss It

  • Fear of Purging: Juniors are often afraid to apt purge core drivers, fearing it will break the OS, and instead try to force apt install -f (fix-broken) which rarely works for major version mismatches.
  • Ignoring Kernel Versions: They fail to check if the running kernel matches the one the driver was compiled against (uname -r).
  • Mixing Tools: They try to fix a cuda apt install by using pip or conda to install different versions of PyTorch/TensorFlow, which only masks the underlying system driver issue.