Foundry Python Library Repos for Dependency Packaging in Spark

Summary

A data engineer attempted to modularize their code by moving shared transformation logic from a Codified Transform Pipeline into a standalone Function Repository. Despite successfully publishing the repository and seeing it appear in the UI, the engineer was unable to import the functions into their main transformation pipeline. The failure occurred because the engineer treated a Function Repository as a standard Python library, failing to account for how Palantir Foundry manages dependency resolution and execution environments.

Root Cause

The failure stems from a misunderstanding of the Foundry Object Model and the lifecycle of different repository types:

Incompatible Repository Types: A Function Repository is designed to host User-Defined Functions (UDFs) meant for use in Workshop, Slate, or Contour. These are hosted in a specialized execution environment optimized for low-latency, single-row, or small-batch operations.
Dependency Isolation: Codified Transform Pipelines (Spark-based) require dependencies to be available in the Spark Driver and Executor classpath. Simply adding a repository name to requirements.txt or marking it as a library in the UI does not package the source code into a distributable Python Wheel (.whl) or JAR that Spark can distribute across a cluster.
Namespace Mismatch: Functions in a Function Repository are wrapped in a specific metadata layer to make them available to the Foundry UI, which prevents them from being imported as standard Python modules in a heavy-duty Spark environment.

Why This Happens in Real Systems

In large-scale data platforms, there is a strict distinction between Compute Paradigms:

Stream/Interactive Compute (UDFs): Optimized for user interaction, highly granular, and state-agnostic.
Batch Compute (Spark/Transformations): Optimized for high-throughput, distributed data processing, and heavy-duty shuffling.

Systems often provide a “seamless” UI experience that makes it look like all code is globally available, but under the hood, the Runtime Environment for a Spark job is a heavy container that cannot dynamically “reach out” and pull code from a specialized UDF service at runtime without a formal build/package step.

Real-World Impact

Code Duplication: Engineers resort to “copy-paste” programming, leading to logic drift where two pipelines meant to perform the same task begin to diverge.
Maintenance Debt: Fixing a bug in a core transformation logic requires manual updates across dozens of repositories.
Deployment Friction: Teams waste significant engineering hours attempting to “force” incompatible architectures to communicate, delaying production releases.

Example or Code (if necessary and relevant)

To fix this, the shared logic must be moved to a Python Library Repository (not a Function Repository) and properly packaged.

# Correct approach: Inside a dedicated Python Library Repository
# This code will be built into a .whl file and distributed to Spark

def calculate_standardized_metric(value: float, multiplier: float) -> float:
    """
    This function is now part of a distributable package.
    """
    if value is None:
        return 0.0
    return value * multiplier

# In the Codified Transform Pipeline:
# 1. Add the library path to the project dependencies
# 2. Import as a standard module

from my_shared_utils import calculate_standardized_metric

def my_transform(input_df):
    return input_df.withColumn(
        "standardized_val", 
        calculate_standardized_metric("raw_val", 1.5)
    )

How Senior Engineers Fix It

A senior engineer resolves this by implementing a Tiered Dependency Strategy:

Decouple Logic from Runtime: Move all reusable logic into a Python Library Repository (or a dedicated Git-based submodule).
Formal Packaging: Ensure the library is built into a Python Wheel (.whl). This ensures that the code is immutable and versioned.
Dependency Injection via Build System: Instead of using requirements.txt (which is for PyPI/external packages), use the platform’s internal Build/Publish mechanism to link the library to the transformation project.
Environment Alignment: Ensure the shared library uses Spark-compatible syntax. If the shared library uses libraries only available in UDF environments (like specific Foundry UI helpers), it will crash the Spark executors.

Why Juniors Miss It

The “Magic” Fallacy: Juniors often assume that if the platform shows a repository in a list, it is “available” for any use case. They mistake UI visibility for Runtime availability.
Confusing UDFs with Libraries: They treat a UDF as a general-purpose function, not realizing that a UDF is a highly specialized, “wrapped” object designed for specific UI-driven compute.
Ignoring the Execution Context: Juniors focus on the syntax of the import (import x) rather than the infrastructure required to make x exist on a remote Spark worker node.