Summary
During a routine infrastructure migration, a production Nominatim instance experienced a fatal crash while attempting to merge two distinct OSM (OpenStreetMap) datasets—Zimbabwe and South Africa—into a single Docker container. The process failed during the indexing phase with a “repeated indexing” error. This crash resulted in a corrupted database state, requiring a full rollback of the ingestion pipeline to prevent data loss.
Root Cause
The failure stems from a violation of the unique constraint assumptions required by the Nominatim indexing engine. When merging two disparate geographic datasets, several factors contribute to the “repeated indexing” error:
- Overlapping Geometries: Even if the countries are distinct, the boundary datasets or shared administrative tags may cause the engine to perceive the same feature twice.
- Shared ID Collisions: If the datasets were processed or filtered through different pipelines before merging, there is a risk of ID collisions in the underlying OSM metadata.
- Non-Atomic Ingestion: Attempting to merge large datasets within a single Docker container without proper staging or deduplication leads to race conditions or memory exhaustion during the heavy CPU/IO operations of the indexing phase.
- Index Corruption: The indexing error is often a symptom of the engine attempting to write an entry for a spatial index key that has already been locked or written by a previous process in the same transaction.
Why This Happens in Real Systems
In large-scale production environments, this is a classic Data Integrity vs. Infrastructure Constraint conflict:
- Monolithic Containerization: Developers often attempt to shove multiple high-density datasets into a single container to simplify orchestration, ignoring the resource contention and fault domain risks.
- Implicit Assumptions: Engineering teams often assume that “Country A” and “Country B” are mutually exclusive, forgetting that OSM data often contains cross-border features (roads, rivers, or administrative boundaries) that overlap.
- Resource Starvation: Nominatim’s indexing is extremely resource-intensive. When scaling the dataset size by 2x, the complexity of the spatial index doesn’t just double; it grows non-linearly, often hitting Docker memory limits (OOM) or disk I/O bottlenecks that manifest as “indexing errors.”
Real-World Impact
- Service Downtime: The geocoding API becomes unavailable for both regions during the failed import.
- Data Corruption: A failed index build can leave the PostgreSQL database in an inconsistent state, making simple “restarts” impossible.
- Operational Overhead: Senior engineers must perform manual database cleanups and re-provision volumes, leading to significant MTTR (Mean Time To Recovery).
Example or Code
To avoid these errors, you must ensure the data is cleaned and the container resources are explicitly managed. Use a pre-processing script to ensure no duplicate OSM IDs exist before the import.
# Pre-import check: Ensure no duplicate OSM IDs exist in the combined PBF
osmosis --read-pbf file1.pbf \
--read-pbf file2.pbf \
--merge \
--write-pbf merged_output.pbf
# Docker run command with explicit resource limits and volume persistence
docker run -d \
--name nominatim_service \
--memory="32g" \
--cpus="8" \
-v /data/nominatim:/var/lib/postgresql/data \
mediagis/nominatim:4.4 \
import --osm-file /tmp/merged_output.pbf
How Senior Engineers Fix It
Senior engineers approach this by decoupling the Ingestion Pipeline from the Serving Layer:
- The Staging Pattern: Never merge directly in the production container. Use a separate ephemeral worker container to perform the
osmosismerge and deduplication. - Atomic Swaps: Once the merged dataset is successfully indexed in a staging environment, create a new Docker volume and use a blue-green deployment to swap the production container to the new data.
- Schema Validation: Run a validation pass on the merged
.pbffiles to identify overlapping administrative boundaries or conflicting tags before the heavy indexing starts. - Resource Isolation: Ensure the Docker daemon has explicit cgroup limits for memory and I/O to prevent a failing index process from crashing the entire host machine.
Why Juniors Miss It
- The “One Container” Fallacy: Juniors often view a Docker container as a “magic box” that can handle any amount of data, failing to account for the linear relationship between data size and RAM requirements.
- Ignoring Data Interdependency: They assume geographic datasets are perfectly isolated, overlooking the topological overlaps inherent in global map data.
- Lack of Idempotency: They attempt to “fix” a failed import by running the command again, rather than realizing the database state is already polluted and requires a fresh start.