Avoid eval with html.escape to Prevent Server RCE in Python

Summary

Avoid using eval() when traversing HTML.
html.escape only sanitises characters for HTML output; it does not protect against code injection when the escaped string is later fed to eval(). The safe approach is to treat the node identifier as data, not as executable code, and use a lookup table or a proper parser instead of eval.

Root Cause

eval() executes any Python expression contained in the string.
html.escape converts <, >, &, " and ' to HTML entities, but the resulting string is still a valid Python literal (e.g., 'a' → 'a').
An attacker can supply a string such as __import__('os').system('rm -rf /') which passes escape unchanged and gets executed.

Why This Happens in Real Systems

Developers treat identifiers or configuration values as code snippets.
Convenience of eval seems attractive for dynamic dispatch, but the boundary between data and code becomes blurred.
Lack of input validation and reliance on “HTML escaping” creates a false sense of security.

Real-World Impact

Remote code execution (RCE) on the server running the scraper.
Privilege escalation if the process runs with elevated rights.
Data exfiltration or destruction of logs, backups, or other assets.
Legal and compliance violations when compromised systems process user‑generated URLs.

Example or Code (if necessary and relevant)

# Unsafe version (original)
def discover_nodes(node, leaf_nodes=[]):
    node_value = eval(escape(node))          # <-- dangerous
    soup = BeautifulSoup(node_value, 'lxml')
    ...

# Safe version using a dictionary lookup
pages = {
    'root': '  ',
    'a': '',
    'b': '',
    'c': '',
    'd': '',
}

def discover_nodes(node, leaf_nodes=None):
    if leaf_nodes is None:
        leaf_nodes = []
    node_value = pages.get(node, '')
    soup = BeautifulSoup(node_value, 'lxml')
    links = soup.find_all('a')
    if not links:
        leaf_nodes.append(node)
        return leaf_nodes
    for link in links:
        discover_nodes(link['href'], leaf_nodes)
    return leaf_nodes

print(discover_nodes('root'))  # ['a', 'd', 'c']

How Senior Engineers Fix It

Eliminate eval: Replace it with a deterministic data structure (dict, DB, cache).
Validate all external inputs against a whitelist of allowed node identifiers.
Use typing and static analysis tools to catch unsafe dynamic execution.
Log and monitor any unexpected identifiers before processing.
If dynamic execution is truly required, sandbox the environment (e.g., ast.literal_eval for literals only).

Why Juniors Miss It

They often conflate escaping for HTML with escaping for code execution.
Limited exposure to threat modeling leads to trusting escape as a universal sanitizer.
The convenience of eval hides its risks, and junior developers may not be aware of the difference between data and code.
Lack of mentorship on secure coding patterns and code‑review feedback.