Parallelizing REST-API requests in Databricks

Summary

A developer encountered performance bottlenecks when sequentially fetching data from a REST API for a large list of IDs in Databricks. The initial solution involved a synchronous loop, which was optimized using Python’s ThreadPoolExecutor to parallelize I/O-bound network requests, significantly reducing latency. The inquiry now focuses on determining the most efficient pattern within the Spark ecosystem: specifically, comparing ThreadPoolExecutor, RDD map operations, Pandas UDFs (vectorized vs. iterator), and full-dataframe ingestion. Key Takeaway: For I/O-bound external API calls within Spark, ThreadPoolExecutor often provides the highest throughput with the least overhead, whereas Pandas UDFs are optimized for CPU-bound data transformation, not network latency.

Root Cause

The root cause of the confusion and potential sub-optimal performance lies in the mismatch between Spark’s execution model (optimized for data parallelism) and the latency profile of external REST APIs (network I/O).

  • Sequential Execution: The initial loop approach suffers from blocking I/O, where the driver waits for each request to complete before initiating the next.
  • Execution Overhead: While ThreadPoolExecutor fixes the blocking issue, wrapping Spark UDFs (whether standard Python, Pandas, or RDD map) adds serialization and JVM-to-Python (Python UDF) overhead.
  • Pandas Misapplication: Pandas UDFs utilize Apache Arrow for high-speed data transfer between the JVM and Python. However, for a single ID per row, the cost of Arrow serialization often exceeds the gain unless the API response processing is computationally heavy.

Why This Happens in Real Systems

In distributed data engineering, teams often default to “Spark-native” solutions for problems that are fundamentally I/O-bound.

  • Abstraction Leaks: Developers assume that using a Spark UDF will automatically distribute network calls efficiently. In reality, a standard Python UDF (or RDD map) spins up a Python worker, executes a single request, and returns, creating significant overhead per request.
  • The “Wait” Problem: Spark executors are designed to execute tasks quickly. Long-running network calls (high latency) tie up executor slots, potentially causing skew or starvation if not managed via thread pools.

Real-World Impact

Using the wrong approach leads to:

  • Increased Latency: Jobs taking minutes instead of seconds due to serialization overhead.
  • Resource Inefficiency: Holding expensive Spark executor resources (cores/memory) idle while waiting for network responses.
  • Connection Limits: Hitting API rate limits or connection throttling if parallelism isn’t controlled.

Example or Code

Below are the three primary approaches: Pandas UDF (Iterator), Pandas UDF (Series), and the recommended ThreadPoolExecutor implementation.

1. Pandas UDF (Iterator) – Recommended for Large Payloads

This approach batches requests to minimize serialization overhead.

from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType
import pandas as pd
import requests

# Define the Iterator-based Pandas UDF
@pandas_udf(StringType())
def fetch_api_data_iterator(ids: pd.Series) -> pd.Series:
    # 'ids' is a chunk of the input column (a batch)
    results = []
    for id_val in ids:
        # Simulate API call
        response = requests.get(f"https://api.example.com/items/{id_val}")
        results.append(response.text)
    return pd.Series(results)

# Usage
# df.withColumn("data", fetch_api_data_iterator(col("id")))

2. Pandas UDF (Series) – Simple but less flexible

Standard Pandas UDF processes the whole series at once.

@pandas_udf(StringType())
def fetch_api_data_series(ids: pd.Series) -> pd.Series:
    # Map applies the function to every element
    return ids.map(lambda x: requests.get(f"https://api.example.com/items/{x}").text)

3. ThreadPoolExecutor (The Winner for I/O)

This runs on the driver or driver-managed threads. It avoids Spark overhead entirely for the network part.

import concurrent.futures
import requests
import pandas as pd

def fetch_single(id_val):
    try:
        return requests.get(f"https://api.example.com/items/{id_val}").text
    except Exception:
        return None

# Assume 'id_list' is a list or Pandas Series of IDs
def parallel_fetch(id_list):
    with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
        results = list(executor.map(fetch_single, id_list))
    return results

How Senior Engineers Fix It

Senior engineers prioritize latency over framework purity.

  1. Avoid Spark UDFs for Pure Network I/O: They recognize that Spark UDFs add overhead for moving data between the JVM and Python. If the logic is “get URL -> return response,” that overhead hurts performance.
  2. Use ThreadPoolExecutor on the Driver: For datasets that fit in memory (e.g., <100k rows), the fastest method is to collect the IDs to the driver (or keep them there if generated there) and use a ThreadPoolExecutor to parallelize requests.
  3. Batching: If the API supports it, they look for batch endpoints to send multiple IDs in one request, reducing the total number of round trips.
  4. Async/Await (Alternative): For extremely high throughput requirements, they might switch to aiohttp with asyncio on the driver, though ThreadPoolExecutor is usually sufficient and simpler.

Why Juniors Miss It

Juniors often over-rely on Spark’s distributed capabilities.

  • “Spark is for everything”: They assume distributed execution is always faster. They don’t realize that for I/O, a single powerful machine with many threads is often faster than a distributed cluster due to network shuffle and task scheduling overhead.
  • Misunderstanding Pandas UDFs: They see “Vectorized” and think it applies to network latency. They don’t realize that while the data transfer is vectorized (Arrow), the execution is still row-by-row (or batch-by-batch) and the Python process still has to wait for the network.