Validating Repeated Semicolon-Delimited Patterns in Python

Summary

Goal: Validate that a Python string consists exclusively of one or more repetitions of a specific semicolon‑delimited pattern.
Solution: Use a anchored regular expression with a non‑capturing repeated group:

pattern = r'^(?:[^;]*;[^;]*;[^;]*;[012];?)+$'

When re.fullmatch(pattern, s) returns a match, the string is guaranteed to contain only valid pattern instances.


Root Cause

  • Original regex .*;.*;.*;[012];? is greedy and matches any characters, including semicolons that belong to adjacent repetitions.
  • Adding a + quantifier ((.*;.*;.*;[012];?)+) still allows the internal .* to consume characters from the next repetition, so the overall match can succeed even on malformed strings.

Key takeaway: .* is too permissive; we need to restrict each field to exclude the delimiter that separates repetitions.


Why This Happens in Real Systems

  • Over‑general quantifiers (.*, . +) are a common source of bugs in log parsers, configuration validators, and protocol decoders.
  • Engineers often rely on visual testing rather than formal boundary checks, leading to hidden acceptance of malformed input.
  • In production, such patterns can let corrupted messages slip through, causing downstream parsing errors or security issues.

Real-World Impact

  • Data corruption: Invalid configuration lines are stored, later causing device misconfiguration.
  • System crashes: Downstream code assumes a fixed number of fields and raises IndexError.
  • Security exposure: Attackers can inject extra semicolons to shift field boundaries, potentially bypassing validation checks.

Example or Code (if necessary and relevant)

import re

def is_valid(s: str) -> bool:
    # Each field: any chars except ';', followed by ';'
    # Fourth field: a single digit 0,1,2
    # Optional trailing ';' allowed per occurrence
    pattern = r'^(?:[^;]*;[^;]*;[^;]*;[012];?)+$'
    return re.fullmatch(pattern, s) is not None

# Valid
assert is_valid("P;QTMCFGUART1;W,460800;1;")
assert is_valid("P;QTMCFGUART1;W,460800;1;P;QTMCFGUART2;R;2")

# Invalid
assert not is_valid("P;QTMCFGUART1;W,460800;3;P;QTMCFGUART2;W,460800;1")
assert not is_valid("P,QTMCFGUART1,R,P,QTMCFGUART2,R,2")

How Senior Engineers Fix It

  • Anchor the pattern with ^ and $ to enforce whole‑string validation.
  • Replace .* with negated character classes ([^;]*) to stop consumption at the delimiter.
  • Use a non‑capturing group (?: … ) when the group’s value isn’t needed, keeping the regex efficient.
  • Prefer re.fullmatch over re.search/re.match for clarity.
  • Write unit tests covering edge cases (empty trailing semicolon, single repetition, maximum field length).

Why Juniors Miss It

  • Tend to copy‑paste naïve .* patterns without considering delimiter boundaries.
  • May not realize the difference between re.match, re.search, and re.fullmatch.
  • Often overlook the importance of anchoring (^…$) when the requirement is “the entire string must match.”
  • Lack of experience with negative character classes leads to over‑general solutions that pass invalid inputs.

Leave a Comment