Managing Firebase Analytics user_id retention to meet GDPR

Summary

During a routine audit of our data pipelines, we discovered a critical discrepancy between our Data Privacy Compliance documentation and the actual persistence of Personally Identifiable Information (PII) within our analytics stack. The engineering team assumed that applying a global Data Retention Policy to Firebase Analytics would automatically scrub all associated identifiers, including custom user_id parameters. This led to a misunderstanding of how Google Analytics for Firebase handles Identity Persistence versus Event Data.

Root Cause

The issue stems from a fundamental misunderstanding of the distinction between Event-level retention and User-level identity persistence.

  • Granularity Mismatch: The 14-month retention policy applies specifically to event-level data and user properties within the standard reporting interface.
  • Identity Decoupling: When a user_id is manually set, it acts as a primary key. While the events associated with that ID might be purged from certain reporting views after 14 months, the User Identity and its relationship to the underlying BigQuery export or the internal Google backend may persist differently.
  • Server-side vs. UI-side: The retention period shown in the Firebase/Google Analytics console refers to the accessibility of data for exploration and reporting, not necessarily the immediate hard-deletion of the identifier from Google’s distributed storage layers or exported datasets.

Why This Happens in Real Systems

In complex distributed systems, “deletion” is rarely a single atomic action.

  • Distributed Consistency: Data is replicated across multiple zones and regions. A “deletion” command often initiates a tombstone process rather than an immediate wipe.
  • Layered Architectures: Systems often have a Reporting Layer (where retention is strictly enforced for UX/compliance) and a Storage/Raw Layer (where data persists for longer periods to support machine learning or backfilling).
  • Abstraction Leaks: Third-party SaaS providers (like Firebase) abstract the underlying database. Users see the “Policy” (the abstraction) but cannot see the “Physical Deletion” (the implementation).

Real-World Impact

  • Compliance Violations: Failing to align technical data life cycles with GDPR/CCPA “Right to Erasure” requests.
  • Legal Liability: If a user requests data deletion, and the user_id remains recoverable via raw logs or BigQuery exports, the organization is in breach of contract.
  • Audit Failure: Discrepancies between the Data Protection Impact Assessment (DPIA) and actual system behavior during external audits.

Example or Code (if necessary and relevant)

// Incorrect Assumption: Setting this ensures data vanishes after 14 months
firebase.analytics().setUserId("user_12345_private_id");

// Correct Approach: Use an opaque, non-reversible surrogate key
// and implement a manual deletion trigger via Cloud Functions
const surrogateKey = crypto.createHash('sha256').update(realUserId).digest('hex');
firebase.analytics().setUserId(surrogateKey);

How Senior Engineers Fix It

Senior engineers move away from relying on “out-of-the-box” settings and implement Defense in Depth:

  • Pseudonymization: Never pass raw PII (emails, names) as user_id. Always use a salted hash or a UUID that has no meaning outside the system.
  • Dual-Track Deletion: Implement a “Delete User” workflow that triggers both the Firebase SDK cleanup and a custom Cloud Function to scrub the user from BigQuery and internal databases.
  • Data Lifecycle Management (DLM): Set explicit TTL (Time To Live) policies on the storage buckets and BigQuery tables where analytics data is exported, ensuring the raw data is purged regardless of the SaaS provider’s UI settings.

Why Juniors Miss It

  • Surface-Level Reading: Juniors often read the “Features” or “Settings” documentation (the what) but skip the “Data Processing” or “Security” documentation (the how).
  • The “Black Box” Fallacy: They assume that if a provider says “Data is deleted after X months,” the problem is entirely solved and handed off to the vendor.
  • Confusion of Scope: They fail to distinguish between Reporting Retention (how long I can see a graph) and Storage Retention (how long the bits exist on a disk).

Leave a Comment