How to read/write in minio using spark?

Summary

This incident stemmed from a misconfigured Spark–MinIO integration where Spark could not resolve the S3A endpoint hostname, resulting in the error “hostname cannot be null”. Although the MinIO service was reachable, Spark’s internal Hadoop S3A client never received a valid endpoint due to missing or incorrect configuration propagation inside the containerized environment.

Root Cause

The failure was triggered by a combination of configuration gaps:

  • Spark never received a resolvable hostname for the S3A endpoint (obj_storage:9000).
  • Docker Compose service name resolution was not available inside Spark’s JVM layer because the Spark container was started with sleep infinity and not through the Compose-managed entrypoint.
  • Missing Hadoop AWS dependencies or mismatched versions can silently break S3A initialization.
  • Bucket name not included in the endpoint URL, which S3A sometimes requires for path-style access.
  • MinIO environment variables not passed to Spark executors, causing inconsistent configuration across the cluster.

The result: Hadoop’s S3A client attempted to parse the endpoint, received null, and threw the hostname error.

Why This Happens in Real Systems

This class of failure is extremely common in distributed storage setups:

  • Spark uses the Hadoop S3A client, not the Python layer, so misconfigurations propagate silently.
  • Docker networking behaves differently depending on how containers are started, especially when bypassing Compose’s default entrypoints.
  • MinIO requires strict S3A configuration, and small deviations (SSL flags, path-style access, endpoint formatting) break resolution.
  • Version mismatches between Hadoop, Spark, and AWS SDK jars frequently cause unexpected S3A failures.

Real-World Impact

When this occurs in production:

  • Jobs fail to read/write data, halting pipelines.
  • Executors crash repeatedly, causing cluster instability.
  • Retries amplify load, sometimes overwhelming MinIO or Spark.
  • Debugging becomes slow, because S3A errors are notoriously cryptic.

Example or Code (if necessary and relevant)

A corrected Spark configuration typically looks like this:

spark = (
    SparkSession.builder
    .appName("spark-minio")
    .config("spark.hadoop.fs.s3a.endpoint", "http://obj_storage:9000")
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin")
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.driver.extraClassPath", "/opt/spark/jars/*")
    .config("spark.executor.extraClassPath", "/opt/spark/jars/*")
    .getOrCreate()
)

And the read path must include a bucket that actually exists:

df = spark.read.csv("s3a://client-files/movie_review.csv", header=True, inferSchema=True)

How Senior Engineers Fix It

Experienced engineers approach this systematically:

  • Validate Docker DNS resolution using ping obj_storage inside the Spark container.
  • Ensure Hadoop AWS + AWS SDK jars match the Spark Hadoop version.
  • Verify MinIO endpoint formatting, especially path-style access.
  • Check that Spark executors inherit the same S3A configs as the driver.
  • Inspect the Hadoop debug logs (fs.s3a.*) to confirm endpoint parsing.
  • Use AWS CLI or mc inside the container to confirm connectivity before involving Spark.

Why Juniors Miss It

This issue is subtle because:

  • The error message (“hostname cannot be null”) is misleading, giving no hint about S3A configuration.
  • Juniors assume Python-level configs apply directly, not realizing Spark delegates to Hadoop.
  • They trust Docker Compose networking implicitly, unaware that custom entrypoints bypass Compose DNS initialization.
  • They underestimate version compatibility issues, especially with Hadoop’s S3A client.
  • They test MinIO externally (Postman) and assume Spark sees the same network environment.

The combination of misleading logs, hidden configuration layers, and container networking nuances makes this a classic trap for less experienced engineers.

Leave a Comment