How to tell Bson to parse an integer values into BaseInt64

Summary

Key Issue: The default BSON decoder in the MongoDB Java driver always interprets numeric JSON literals as BsonInt32, even when the values exceed the 32-bit integer range. This causes silent data truncation or precision loss when parsing JSON strings directly into BsonDocument objects without explicit type specification.

Root Cause

The root cause lies in the BsonDocument.parse(String json) method’s internal BsonReader implementation. When the parser encounters a numeric literal in a JSON string (e.g., 5, -100, 1234567), it performs a default type inference logic based on the magnitude and format of the number:

  1. Small Integers: Values within the signed 32-bit integer range (-2^31 to 2^31-1) are automatically deserialized into BsonInt32.
  2. Default Behavior: The parse method lacks an overload or a configuration parameter (like a BsonReaderSettings with a specific Int64 preference flag) to override this heuristic for standard integers.

Even if a value is technically capable of being an Int64, the parser defaults to Int32 for performance and compatibility reasons with older BSON specifications, leaving the developer to manually handle the type conversion if strict 64-bit integrity is required.

Why This Happens in Real Systems

This behavior persists in modern systems due to historical constraints and optimization strategies:

  • Historical BSON Spec: The BSON specification originally emphasized Int32 as the standard integer type for space efficiency. The parser was designed to default to this type for the most common use cases.
  • JSON Compatibility: Standard JSON does not distinguish between 32-bit and 64-bit integers. Parsers must guess the intended type, and Int32 is the safest default for smaller numbers to prevent unnecessary memory overhead.
  • Backward Compatibility: Changing the default parsing behavior of BsonDocument.parse would break existing applications that rely on specific BsonInt32 instances for equality checks or downstream serialization logic.

Real-World Impact

  • Silent Data Corruption: If a numeric ID exceeds Integer.MAX_VALUE (2,147,483,647) but is parsed from a string, the BsonInt32 constructor will wrap the value, leading to incorrect data storage or retrieval.
  • Serialization Mismatches: When mapping a BsonDocument to a POJO using Codec registries, a BsonInt32 might not match the expected long or Long fields in the Java object, causing CodecConfigurationException.
  • Logic Errors in Comparison: Developers performing numeric comparisons on BsonValue objects may get incorrect results if BsonInt32 is compared against BsonInt64 without explicit type checking.

Example or Code

The following code demonstrates the issue where an integer value is parsed as BsonInt32 instead of BsonInt64.

import org.bson.BsonDocument;
import org.bson.BsonValue;

public class BsonParsingExample {
    public static void main(String[] args) {
        // A JSON string containing an integer
        String documentString = "{ 'id' : 5 }";

        // Parsing the string directly
        BsonDocument expected = BsonDocument.parse(documentString);
        BsonValue idValue = expected.get("id");

        // This will print org.bson.BsonInt32
        System.out.println(idValue.getClass().getName());
    }
}

To ensure the value is treated as a 64-bit integer, you must use the BsonInt64 wrapper explicitly or modify the JSON input.

import org.bson.BsonDocument;
import org.bson.BsonInt64;

public class BsonCorrectExample {
    public static void main(String[] args) {
        // Explicitly creating a BsonInt64
        BsonDocument doc = new BsonDocument("id", new BsonInt64(5L));

        // This will print org.bson.BsonInt64
        System.out.println(doc.get("id").getClass().getName());
    }
}

How Senior Engineers Fix It

Senior engineers address this issue by bypassing the default string parsing heuristic and enforcing strict types:

  1. Avoid BsonDocument.parse for Ambiguous Data: Stop using BsonDocument.parse(String) for JSON strings containing numeric IDs or large numbers. It is unsafe for strict typing.
  2. Use BsonInt64 Explicitly: When constructing documents programmatically, wrap numbers in new BsonInt64(value) instead of relying on the constructor to infer the type.
  3. JSON Transformation: If parsing is strictly required, manipulate the JSON string before parsing to include a type marker (if using Extended JSON) or parse into a Map first and handle the conversion manually.
  4. Custom Codec: Register a custom Codec<BsonDocument> that overrides the default decoding behavior to prefer Int64 over Int32 during the deserialization phase.

Why Juniors Miss It

  • Assumption of JSON Standard: Junior developers often assume BSON parsing behaves exactly like standard JSON parsers (e.g., Jackson or Gson), where numbers are often mapped to Long or BigInteger automatically if needed. They are unaware of BSON’s strict type distinction between Int32 and Int64.
  • Lack of Awareness of “Silent” Defaults: They may not realize that BsonDocument.parse makes a decision on their behalf. They see the number in the string and assume the resulting BsonValue will hold the exact numeric value without checking the specific class type.
  • Over-reliance on getClass(): Juniors might check the value but not the class type specifically, missing the subtle difference between BsonInt32 and BsonInt64 until a specific edge case (like an ID larger than 2 billion) causes a failure.