How to model and track monthly snapshot data in PostgreSQL when no natural key exists?

Summary

The problem at hand involves loading a monthly snapshot dataset into PostgreSQL where the source does not provide a stable natural key for one of the entities. The goal is to track the current state and historical presence of rows without corrupting history or inventing false keys. The main challenges include:

  • No explicit “exit” event
  • Rows may disappear between months
  • Some identifiers may be NULL
  • Names can change over time without indicating a new entity

Root Cause

The root cause of this issue is the lack of a stable natural key in the source dataset. This makes it difficult to identify and track individual entities over time. The key challenges are:

  • No unique identifier: The absence of a unique identifier makes it hard to distinguish between different entities.
  • Changing names: Names can change over time, which can lead to incorrect tracking of entities.
  • NULL identifiers: Some identifiers may be NULL, which can cause issues with tracking and identification.

Why This Happens in Real Systems

This issue occurs in real systems due to various reasons, including:

  • Data quality issues: Poor data quality can lead to missing or incorrect identifiers.
  • System limitations: Some systems may not be designed to handle changing names or NULL identifiers.
  • Data integration challenges: Integrating data from different sources can lead to inconsistencies and missing identifiers.

Real-World Impact

The impact of this issue can be significant, including:

  • Inaccurate tracking: Inability to track entities accurately can lead to incorrect insights and decisions.
  • Data corruption: Corrupting history or inventing false keys can lead to long-term data quality issues.
  • System inefficiencies: Workarounds and manual interventions can lead to system inefficiencies and increased costs.

Example or Code

CREATE TABLE cnpj_basico (
    cnpj_basico VARCHAR(50),
    identificador_socio VARCHAR(50),
    nome_socio VARCHAR(255),
    cpf_cnpj_socio VARCHAR(50),
    snapshot_date DATE
);

INSERT INTO cnpj_basico (cnpj_basico, identificador_socio, nome_socio, cpf_cnpj_socio, snapshot_date)
VALUES ('1234567890', 'PF', 'John Doe', '12345678901', '2022-01-01'),
       ('1234567890', 'PF', 'John Doe', '12345678901', '2022-02-01'),
       ('1234567890', 'PJ', 'Jane Doe', NULL, '2022-01-01'),
       ('1234567890', 'PJ', 'Jane Doe', NULL, '2022-02-01');

How Senior Engineers Fix It

Senior engineers can fix this issue by using a combination of techniques, including:

  • Append-only snapshot model: Store each monthly snapshot as a separate record, with a unique snapshot_date.
  • Hash-based identity: Use a hash function to generate a unique identifier for each entity, based on a combination of columns.
  • Window functions: Use window functions, such as ROW_NUMBER(), to assign a unique identifier to each entity within a partition.
  • Data validation: Implement data validation rules to ensure data quality and consistency.

Why Juniors Miss It

Junior engineers may miss this issue due to:

  • Lack of experience: Limited experience with data modeling and data engineering can lead to oversights.
  • Insufficient training: Inadequate training on data quality and data integration can contribute to mistakes.
  • Overemphasis on simplicity: Focusing too much on simplicity can lead to overlooking complex data quality issues.
  • Inadequate testing: Insufficient testing and validation can fail to catch data quality issues.

Leave a Comment