How to Stop Unicode Symbol Queries from Crashing MySQL Search

Summary

A production incident occurred where the search functionality was rendered unusable by mathematical alphanumeric symbols (Unicode characters that look like stylized letters). When a user entered strings such as 𝓡𝓲𝓡𝓡𝔂𝓬𝓻𝓸𝓯𝓽 (Mathematical Script) into the search bar, the resulting SQL query generated by the Ruby on Rails application caused a syntax error in MySQL. This prevented legitimate users from accessing the school search service, effectively causing a Denial of Service (DoS) for that specific feature.

Root Cause

The failure was caused by a mismatch between how the application layer handled Unicode character encoding and how the MySQL FULLTEXT index parsed those characters in BOOLEAN MODE.

  • Unicode Complexity: The input contained characters from the Mathematical Alphanumeric Symbols block (e.g., U+1D4B9). These are not “letters” in the linguistic sense but are symbols that mimic letters.
  • Parser Breakdown: When these specific multi-byte characters were passed into a MATCH...AGAINST clause in BOOLEAN MODE, the MySQL parser encountered a byte sequence it could not interpret within the expected grammar, leading to an unexpected $end syntax error.
  • Lack of Sanitization: The application blindly passed user-supplied UTF-8 strings into a complex, weighted SQL calculation without verifying if the characters were meaningful linguistic glyphs.

Why This Happens in Real Systems

In modern web applications, we often assume UTF-8 is a “safe” catch-all. We treat all Unicode as valid text, which is true for storage but false for complex query parsing.

  • Implicit Trust: Developers often trust that the database driver and the database engine will handle UTF-8 gracefully.
  • Grammar Conflicts: Database engines use specific internal grammars to parse BOOLEAN MODE operators (like +, -, *). When a multi-byte character is malformed or contains bytes that look like control characters to the parser, the state machine of the SQL parser breaks.
  • Normalization Issues: Different Unicode characters can look identical to humans (homoglyphs) but represent entirely different code points, making simple regex filters ineffective.

Real-World Impact

  • Service Unavailability: Legitimate search queries failed if they contained even a single “illegal” symbol.
  • Database Error Spikes: The application logs were flooded with Mysql::Error exceptions, potentially triggering false positive alerts in monitoring systems.
  • Resource Exhaustion: While this specific case was a syntax error, similar “bad” Unicode inputs can sometimes cause high CPU usage as the database attempts to perform complex collation/normalization on invalid sequences.

Example or Code

To prevent this, we must sanitize the input by filtering for Unicode Letter properties rather than specific ASCII ranges. This allows characters like Ζ° (Vietnamese) while stripping 𝓡 (Mathematical Script).

# Use the 'unicode_utils' or 'unicode' gem to filter by property
# This preserves linguistic characters (L category) while removing symbols (S category)

def sanitize_search_query(input)
  return "" if input.blank?

  # We want to keep characters that are categorized as 'Letters' in Unicode.
  # This includes Latin, Cyrillic, Vietnamese, Kanji, etc.
  # We exclude 'Symbols', 'Math', and 'Other' categories.

  input.chars.select { |char| char.match?(/\p{L}/) }.join
end

# Usage in Rails Controller/Service
sanitized_term = sanitize_search_query(params[:query])

if sanitized_term.present?
  @schools = School.search_by_term(sanitized_term)
else
  @schools = []
end

How Senior Engineers Fix It

A senior engineer looks for a structural defense-in-depth strategy rather than just a quick regex fix.

  • Input Validation at the Boundary: Implement a “Permissive but Safe” filter at the controller or service layer using Unicode Property Escapes (\p{L}).
  • Normalization: Use Unicode Normalization Form C (NFC) to ensure that characters are represented in their most standard, decomposed form before hitting the database.
  • Database Layer Protection: Ensure the database connection and the table collation are strictly set to utf8mb4 to prevent truncation or “junk” character issues.
  • Automated Regression Testing: Add test cases specifically for edge-case Unicode (emojis, mathematical symbols, and non-Latin alphabets) to ensure the parser doesn’t break during future migrations.

Why Juniors Miss It

  • The “ASCII-only” Trap: Juniors often suggest gsub(/[^a-zA-Z]/, ''), which breaks internationalization (i18n) by destroying names like trường.
  • Over-reliance on Frameworks: They assume that because Rails handles UTF-8 in models, the SQL parser will also handle it safely.
  • Ignoring the Parser: They view the error as a “data error” rather than a syntax error. They fail to realize that the input didn’t just return “no results,” it actually crashed the query execution.

Leave a Comment