Monitoring a folder with subfolders for newly added files using python

Summary

A file‑monitoring pipeline failed to detect newly added files in a dated folder hierarchy because the implementation relied on periodic polling and naive directory scans. The system missed events, processed files late, and occasionally reprocessed old files. The core issue was the absence of a real-time filesystem watcher capable of tracking deep subfolder structures.

Root Cause

The failure originated from an overly simplistic approach:

  • Polling every minute caused race conditions where files appeared between scans.
  • Recursive directory traversal was implemented inefficiently, leading to high I/O load and missed updates.
  • No event-driven mechanism (e.g., inotify, watchdog observers) was used to detect file creation.
  • Timestamp-based detection was unreliable because some files were written slowly or updated after creation.

Why This Happens in Real Systems

Systems that evolve organically often start with simple scripts that grow beyond their intended scale:

  • Polling seems easy, but it does not scale when subfolders grow into thousands of entries.
  • Filesystem events differ across OSes, so engineers avoid watchers and rely on brute-force scanning.
  • Slow writes cause files to appear incomplete, leading to skipped or corrupted processing.
  • High-frequency file creation overwhelms naive loops, especially on network-mounted storage.

Real-World Impact

The operational consequences were significant:

  • Delayed processing of time-sensitive data.
  • Duplicate processing when files were detected multiple times.
  • Increased CPU and disk usage due to repeated full-directory scans.
  • Silent data loss when files were created between polling intervals.

Example or Code (if necessary and relevant)

from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
import time
import os

class NewFileHandler(FileSystemEventHandler):
    def on_created(self, event):
        if not event.is_directory and event.src_path.endswith(".csv"):
            print("New file detected:", event.src_path)

observer = Observer()
observer.schedule(NewFileHandler(), path="mydata", recursive=True)
observer.start()

try:
    while True:
        time.sleep(1)
except KeyboardInterrupt:
    observer.stop()

observer.join()

How Senior Engineers Fix It

Experienced engineers replace polling with event-driven monitoring and add guardrails:

  • Use watchdog (cross-platform) or inotify (Linux) for real-time file events.
  • Enable recursive watching to track all date-based subfolders.
  • Add debounce logic to avoid processing partially written files.
  • Maintain a processed-file registry to prevent duplicates.
  • Implement backfill scans on startup to catch missed files.
  • Use asynchronous queues (e.g., asyncio, Kafka, Redis streams) to decouple detection from processing.

Why Juniors Miss It

Less experienced engineers often overlook deeper system behavior:

  • They assume polling is sufficient without considering scale or timing.
  • They underestimate filesystem complexity, especially with nested directories.
  • They do not anticipate race conditions from slow writes or concurrent producers.
  • They lack exposure to OS-level event APIs and rely solely on Python loops.

Would you like a version of this postmortem tailored for documentation, a team retrospective, or a production runbook?

Leave a Comment