MKL module not found while trying to run Atomate2 lithium insertion workflow on VASP

Summary

The issue reported—a “module not found” error for MKL despite the module being loaded—when running an Atomate2 workflow on an HPC cluster is a classic Environment and Dependency Misconfiguration. The failure occurs not because the code is wrong, but because the runtime environment (Python, libraries, and system environment variables) is not correctly bridging the gap between the user script, the workflow manager (Atomate2), and the underlying binary (VASP).

The root cause is typically a disconnect between the Python interpreter’s environment and the shell environment used to launch VASP binaries, often exacerbated by hardcoded execution commands in workflow frameworks. Even if module load mkl is executed in the shell, Python subprocess calls or specific library linking may fail to inherit these variables correctly.

Root Cause

The specific failure stems from two primary mechanisms working in concert:

  • Environment Inheritance Failure: When run_locally is called (or when Atomate2/Jobflow delegates a job), it spawns a subprocess to execute vasp_std. If the system’s dynamic linker cannot find the MKL libraries (e.g., libmkl_rt.so), the execution fails immediately. This usually happens because the LD_LIBRARY_PATH set by module load mkl is not propagated to the child process, or the Python environment is using a different interpreter that lacks access to the system paths.
  • The run_vasp_cmd Override: In the provided code, the line BaseVaspMaker.run_vasp_cmd = ["vasp_std"] forces Atomate2 to bypass its own environment detection logic. It attempts to execute vasp_std directly from the system PATH. If vasp_std is a wrapper script that relies on specific shell initialization (like .bashrc) to set up MKL, and the Python subprocess uses sh instead of bash, the MKL initialization is skipped.

Primary causes include:

  • Missing LD_LIBRARY_PATH: The MKL library directory is not in the dynamic library search path.
  • Shell Mismatch: The VASP executable is a wrapper script requiring bash, but the Python process invokes sh.
  • Container/Module Isolation: The compute node environment differs from the login node (common in Slurm/PBS systems), causing modules loaded in the interactive shell to vanish in the batch context.

Why This Happens in Real Systems

In High-Performance Computing (HPC), software is rarely monolithic. VASP relies on heavy linear algebra libraries like Intel MKL, which are often installed in isolated paths. Modular environments (Lmod, Environment Modules) manage these paths by dynamically appending to LD_LIBRARY_PATH and PATH.

Workflow engines like Atomate2 abstract the execution of commands. However, they often spawn processes using Python’s subprocess.Popen. By default, Popen may not emulate a full interactive login shell. Consequently, the module load commands that the user typed in their terminal are effectively lost when the Python script spawns a child process to run VASP. The Dynamic Linker runs inside that child process, looks for MKL, fails to find it, and the OS kills the process before it even writes an output file.

Real-World Impact

  • Blocked Computation: The workflow fails silently or with cryptic dynamic linker errors. No vasprun.xml or OUTCAR is generated, making debugging difficult.
  • Wasted Resources: In HPC environments, time allocated on compute nodes is finite. A misconfigured environment wastes allocation hours while performing no actual physics calculations.
  • Software Fragility: Relying on global environment variables (LD_LIBRARY_PATH) creates “works on my machine” scenarios. A script that runs on a login node often fails in a batch job context.

Example or Code

To debug this, we often use a “Wrapper Script” approach. This ensures the environment is explicitly set before VASP runs.

1. Create a wrapper script (run_vasp_wrapper.sh):
This script forces the loading of the MKL module explicitly within its execution context.

#!/bin/bash
# Explicitly load the MKL module required for VASP execution
module load intel/2022.2  # Example specific module version

# Verify the library path is set (for logging/debugging)
echo "LD_LIBRARY_PATH is set to: $LD_LIBRARY_PATH"

# Execute the VASP binary passed as an argument
exec "$@"

2. Modify the Python Code:
Instead of forcing ["vasp_std"], point the workflow to the wrapper.

# ... (imports and setup) ...

# CRITICAL CHANGE: Use the wrapper script to ensure environment inheritance
# Ensure the script is executable: chmod +x run_vasp_wrapper.sh
BaseVaspMaker.run_vasp_cmd = ["/absolute/path/to/run_vasp_wrapper.sh", "vasp_std"]

# ... (rest of the code) ...

How Senior Engineers Fix It

Senior engineers do not rely on implicit environment inheritance. They ensure deterministic execution environments.

  1. Explicit Wrappers: As shown in the code block above, create a shell wrapper that loads the necessary environment modules (module load mkl) before executing the binary. This isolates the environment configuration from the Python code.

  2. Containerization (Singularity/Apptainer/Docker): The most robust fix is to package VASP and MKL into a container image. The workflow then executes the container, which guarantees that the libraries and paths are identical every time, regardless of the host OS or HPC login node configuration.

  3. Environment Forwarding: If using Python’s subprocess, one can capture the environment of the currently loaded modules and pass it explicitly to the subprocess call:

    import os
    import subprocess
    
    # Capture current environment (which includes module loads if run from the same shell)
    my_env = os.environ.copy()
    subprocess.run(["vasp_std"], env=my_env)
  4. HPC Workflow Managers: Utilizing managers like FireWorks or Prefect that are designed to generate proper batch scripts (Slurm/PBS) which inherently contain the module load commands at the top of the script file.

Why Juniors Miss It

Juniors often miss this issue because they conflate the Login Node environment with the Compute Node environment.

  • “It works in the terminal”: They run python script.py in their terminal. That terminal session has the environment variables loaded. They assume the Python script will magically pass those exact variables to any subprocess it spawns.
  • Abstracted Workflow: Frameworks like Atomate2 hide the execution mechanism. Juniors focus on the inputs (crystal structures, INCAR settings) and forget that the execution (running the binary) is a separate step that requires OS-level configuration.
  • Lack of Dynamic Linking Knowledge: They may not understand that vasp_std is not a binary directly, or if it is, it requires libmkl_rt.so to be discoverable by the OS loader immediately upon execution.