Should resume skills be stored as JSON, BLOB, or normalized tables for analytics and search performance?

Summary

Technical Postmortem: Resume Skills Storage Architecture for Analytics and Search

The central issue is a database schema anti-pattern where skills are stored as a BLOB (binary large object) in a MySQL user_data table. This approach fundamentally breaks analytics, search performance, and scalability. The root cause is treating semi-structured data (skills) as an opaque binary payload rather than a normalized, queryable entity. The immediate impact is non-indexable data, preventing efficient filtering, aggregation, or full-text search on skills. The solution requires normalization into a relational schema or, for flexible schema requirements, JSON columns (MySQL 5.7+) with proper indexing.

Root Cause

  • Misuse of BLOB for Semi-Structured Data: BLOBs are designed for unstructured binary data (images, files), not for text-based lists like skills. Storing skills as a BLOB makes the data invisible to SQL query optimizers and index structures.
  • Violation of First Normal Form (1NF): The Actual_skills field violates atomicity. A single column should not contain a multi-valued attribute. This prevents relational integrity and efficient querying.
  • Lack of Indexing Strategy: BLOB data cannot be indexed effectively for partial matching or searching specific skills within the list. MySQL cannot scan a BLOB for a substring efficiently at scale.
  • Schema Rigidity: Storing data as a serialized binary string prevents any form of structured manipulation (e.g., adding skill levels, dates acquired) without parsing the entire BLOB for every row.

Why This Happens in Real Systems

  • Premature Optimization/Complexity Avoidance: Engineers often choose BLOB or a single text column to avoid the perceived complexity of creating a junction table (many-to-many relationship).
  • Rapid Prototyping Bias: During MVP development, developers prioritize writing application code over database normalization, assuming “we’ll fix it later.”
  • Misunderstanding of Data Types: Confusing “unstructured” (BLOB/TEXT) with “semi-structured” (JSON) or “structured” (Relational).
  • Legacy System Migration: Migrating from document stores (like MongoDB) directly to a single column in SQL without adapting the schema to relational strengths.

Real-World Impact

  • Analytics Paralysis: Impossible to run queries like “How many users know ‘Python’?” or “Find the average number of skills per user” without performing a full table scan and client-side parsing.
  • Poor Search Performance: Searching for a specific skill requires LIKE '%python%' on a BLOB column, which results in O(n) complexity and disables index usage, causing CPU spikes and slow response times.
  • Data Integrity Issues: No foreign key constraints can be applied to skills. Duplicate skills with typos (e.g., “Java” vs “Java Script”) proliferate, dirtying the dataset.
  • Vendor Lock-in: BLOB data is opaque to the database engine, making it difficult to leverage MySQL’s native full-text search or JSON functions.

Example or Code

-- Anti-Pattern (Current State)
CREATE TABLE user_data (
    ID INT AUTO_INCREMENT,
    Name VARCHAR(500),
    Actual_skills BLOB,
    Recommended_skills BLOB
);

-- Proposed Solution: Normalized Relational Schema
CREATE TABLE skills (
    skill_id INT AUTO_INCREMENT PRIMARY KEY,
    skill_name VARCHAR(100) UNIQUE
);

CREATE TABLE user_skills (
    user_id INT,
    skill_id INT,
    proficiency ENUM('Beginner', 'Intermediate', 'Expert'),
    PRIMARY KEY (user_id, skill_id),
    FOREIGN KEY (user_id) REFERENCES user_data(ID),
    FOREIGN KEY (skill_id) REFERENCES skills(skill_id)
);

-- Alternative Solution: JSON with Indexing (MySQL 5.7+)
ALTER TABLE user_data ADD COLUMN skills_json JSON;
CREATE INDEX idx_skill_name ON user_data((CAST(skills_json->>'$[*]' AS CHAR(50))));

How Senior Engineers Fix It

  • Implement Normalization: Senior engineers design a Many-to-Many relationship with a bridge table (user_skills) and a reference table (skills). This ensures data integrity and allows O(1) lookup times via indexed joins.
  • Utilize Inverted Indexes for Search: For high-performance search, they introduce a search engine (like Elasticsearch) or use MySQL Full-Text Search indexes on normalized data.
  • Schema Evolution Planning: They implement database migrations to transform existing BLOB data into the new normalized structure, ensuring zero downtime.
  • Leverage Native JSON Types: If the skills list requires dynamic attributes (e.g., weight, source) that don’t fit strict relational tables, they use MySQL’s JSON data type combined with Generated Columns to index specific attributes for analytics.

Why Juniors Miss It

  • Focus on Application Logic: Juniors often view the database as a “bucket” to dump data, focusing heavily on application-side parsing (e.g., Python/Java lists) rather than leveraging database capabilities.
  • Lack of Understanding of ACID: They underestimate how database constraints (Foreign Keys, Unique constraints) protect data integrity, leading to “garbage in, garbage out” scenarios.
  • Underestimating Scale: They optimize for the “write” path (easy insertion of a blob) without considering the “read” path complexity required for analytics.
  • SQL Limitations: They are unaware of the performance penalties of LIKE queries on large binary objects or the limitations of indexing BLOBs.