Substrings in python

Summary

A candidate submitted a Python script intended to identify all strings from an input file that appear as a substring of at least one other distinct string in the list. The script uses nested loops and the in operator to perform this check. While the code syntactically functions, the underlying algorithmic approach and file handling are fundamentally flawed for production environments. The submission failed to address the scalability of the O(n²) approach and ignored critical resource management concerns.

Root Cause

The primary root causes for the rejection of this solution are Algorithmic Inefficiency and Lack of Resource Management.

Quadratic Complexity: The solution iterates through the list of strings once, and for each string, iterates through the entire list again. This results in O(n²) time complexity. For an input file containing merely a few thousand strings, the execution time grows exponentially, leading to a Denial of Service (DoS) condition on the host.
Unbounded Memory Usage: The solution reads the entire input file into memory (strings = [line.strip() for line in f]) and constructs a second list (result) containing matching strings. If the input file contains millions of lines, this will exhaust available RAM and crash the process.
Incorrect Result Collection Logic: The logic if s != other and s in other: result.append(s); break is fragile. While it technically captures the requirement, it fails to optimize the search or handle edge cases like empty strings (which match everything) or whitespace-only strings.

Why This Happens in Real Systems

Junior engineers often prioritize immediate correctness over systemic stability. They focus on “Does the code run?” rather than “Can the code survive real-world data?”

This behavior stems from:

Algorithm Blindness: A lack of understanding regarding Big O notation and how linear growth impacts real-world performance.
The “It Works on My Machine” Syndrome: Testing only with small, manually created files rather than generating dataset sizes that simulate production load.
Treating Code as Scripts vs. Services: Writing code that assumes infinite resources rather than writing code that manages resources (RAM, CPU, File Handles) responsibly.

Real-World Impact

If this code were deployed to a production environment, the consequences would be severe:

Service Degradation: The CPU usage would spike to 100% during the nested loops, causing latency spikes for other services running on the same host.
Out of Memory (OOM) Kill: Processing a standard log file (e.g., 100MB+) would likely cause the process to be terminated by the kernel for excessive memory usage.
Data Corruption Risk: If the process is killed mid-execution (due to timeout or OOM), the output file may be left in a partial state, leading to data loss or corruption for downstream consumers.
False Sense of Security: The code appears to “work” on test cases, allowing the bug to persist until a production incident occurs.

Example or Code

Below is the implementation provided by the candidate. It demonstrates the nested loop approach that causes the performance bottleneck.

# Read input file
with open("input.txt", "r") as f:
    strings = [line.strip() for line in f]

result = []

for s in strings:
    for other in strings:
        if s != other and s in other:
            result.append(s)
            break

# Write output file
with open("output.txt", "w") as f:
    for s in result:
        f.write(s + "\n")

How Senior Engineers Fix It

Senior engineers approach this problem by prioritizing efficiency and scalability. They would likely implement one of the following strategies:

Sorting and Windowing (O(n log n)):
- Sort the strings by length (shortest to longest).
- Iterate through the sorted list. Since a substring must be shorter than or equal to the containing string (and strictly shorter if distinct), we only need to compare strings against longer strings.
- Use a trie or efficient string matching algorithms if strict ordering isn’t enough.
Generator-Based Processing (Memory Efficiency):
- Do not load the whole file into memory. Instead, read lines lazily using generators to minimize RAM footprint.
Batch Processing:
- If the dataset is massive, process it in chunks or use streaming frameworks (like Apache Spark) designed for large-scale text processing.

Why Juniors Miss It

Juniors miss these issues because they lack the experience of failure.

Focus on Syntax: They spend cognitive energy ensuring the Python syntax (loops, if statements, file modes) is correct, neglecting the architectural implications.
Lack of Intuition for Scale: They struggle to visualize how 1 million items in a list translates to 1 trillion operations in a nested loop.
Missing Edge Case Testing: They rarely test with “large” files or edge cases (like massive strings or empty inputs) because they haven’t yet been burned by them in a professional setting.