How to Identify and Clean Orphaned OFF Files in MySQL

Summary

During a routine disk space audit, we discovered a massive accumulation of mysterious files in the /var/lib/mysql/ directory following the pattern OFF.xxxxxx and an accompanying OFF.index file. These files contained fragmented SQL INSERT statements mixed with binary data. After investigation, we identified these as orphaned binary log fragments resulting from a misconfigured or interrupted custom replication or ETL (Extract, Transform, Load) process that bypasses the standard MySQL binary logging mechanism.

Root Cause

The investigation revealed that these files were not part of the core MySQL engine, but rather artifacts of a third-party data synchronization tool or a custom script designed to perform incremental backups or data streaming.

  • Non-Standard Logging: A background process was capturing row-level changes and writing them to local flat files instead of the standard mysql-bin.xxxxxx format.
  • Lack of Rotation Logic: The process was configured to create new files based on a specific size or time threshold but lacked a retention policy to delete old files.
  • Incomplete Transactions: The presence of “binary characters” alongside SQL commands indicates that the script was dumping raw packet data or encoded binary blobs directly into text files, likely due to an unhandled encoding error during the capture phase.
  • Missing in Backups: These files were not present in older backups because the problematic synchronization script was only deployed or activated nine days prior to the discovery.

Why This Happens in Real Systems

In complex production environments, “shadow processes” often emerge to solve immediate data needs.

  • Sidecar Implementations: Engineers often deploy “sidecar” scripts to stream data to analytical engines (like ClickHouse or BigQuery) without touching the core database configuration.
  • Improper Error Handling: When a data streaming job fails or is killed (SIGKILL), it often leaves behind partially written buffers or index files that represent the “state” of the last failed operation.
  • Configuration Drift: A developer might enable a “debug” or “trace” mode in a data pipeline that writes raw traffic to disk, forgetting to disable it once the troubleshooting is complete.

Real-World Impact

  • Disk Exhaustion (ENOSPC): The most immediate threat is that these files grow monotonically, eventually consuming all available space on the mount point, which causes MySQL to crash or enter a read-only state.
  • I/O Contention: Continuous writing of thousands of small files creates significant IOPS overhead, competing with the actual database engine for disk throughput.
  • Inode Exhaustion: Even if the files are small, the sheer volume (indicated by the increasing xxxxxx suffix) can exhaust the filesystem’s available inodes, preventing any new files from being created on the system.

Example or Code (if necessary and relevant)

# Identifying the process holding these files open
lsof +D /var/lib/mysql/ | grep "OFF\."

# Checking the rate of file creation to estimate time-to-disk-full
find /var/lib/mysql/ -name "OFF.*" -printf '%T@\n' | sort -n | awk '{print $1}' | uniq -c

# Simulating a cleanup of files older than 7 days (USE WITH CAUTION)
find /var/lib/mysql/ -name "OFF.*" -mtime +7 -delete

How Senior Engineers Fix It

A senior engineer does not just delete the files; they remediate the source.

  • Process Tracing: Use lsof or fuser to identify which PID is actively writing to the OFF.index file.
  • Root Cause Analysis (RCA): Trace the PID back to a specific service or cron job. In this case, the pattern of timestamps (xx:00, xx:01, etc.) suggested a scheduled task or a polling loop.
  • Implementation of Lifecycle Management: Once the source is identified, implement a log rotation policy (e.g., logrotate) or update the application code to include a TTL (Time To Live) for temporary data files.
  • Observability: Add disk usage alerts specifically for the /var/lib/mysql/ directory to catch anomalous file growth before it reaches critical thresholds.

Why Juniors Miss It

  • Symptom vs. Cause: A junior engineer will often see the disk space issue and simply delete the files to restore service, only to have the problem recur days later.
  • Assumption of Standard Behavior: Juniors often assume any file in /var/lib/mysql/ must be a legitimate MySQL file (like .ibd or binlog), whereas seniors know that unauthorized or third-party processes can pollute database directories.
  • Ignoring Metadata: Juniors may overlook the file creation timestamps and the naming pattern, which are the most critical clues for identifying automated, non-standard processes.

Leave a Comment