Batch custom tee function redirect to file is inconsistent

Summary

The batch tee function fails to handle raw binary byte streams correctly, resulting in data corruption where ASCII characters are replaced by garbled output (e.g., “Hello World” becomes “效汬潗汲⁤਍”). The root cause is a mismatch between the input encoding and the batch processing logic. The input stream is likely UTF-16 encoded (standard output from PowerShell or certain Windows commands), but the batch script reads it as if it were ASCII/ANSI and processes it using tokenization. This causes multi-byte characters to be misinterpreted, split across tokens, and reassembled incorrectly during the output phase.

Root Cause

The specific technical failures in the implementation are:

  • UTF-16 vs. ASCII Mismatch: The input stream entering the pipe is UTF-16 LE (Little Endian). In UTF-16, the character ‘H’ (ASCII 0x48) is represented as bytes 0x48 0x00. The batch command FIND /N /V "" echoes lines with a prefix like [1] followed by the line content. However, because the input contains null bytes (0x00) between every character, the FOR /F loop (which uses space and tab as default delimiters) sees the null bytes as delimiters, effectively breaking the data into garbage tokens.
  • Line Ending Corruption: The input likely contains Carriage Return + Line Feed (\r\n). The FIND command adds its own CR LF at the end. The FOR /F loop parses lines, but due to the encoding issues, it fails to correctly identify where the original line ends, often stripping or mangling the final carriage return.
  • Tokenization Logic Error: The line FOR /F "tokens=1* delims=]" ... attempts to strip the [1] prefix. However, due to the UTF-16 interleaving, the ] character (bytes 0x5D 0x00) appears differently in the stream. The parser fails to find the ] as a clean delimiter, leading to it processing the raw bytes.

Why This Happens in Real Systems

Windows batch scripting is a legacy technology built on the MS-DOS architecture, which predates Unicode standards. It natively operates on the active Code Page (ANSI).

  • PowerShell Integration: Modern Windows automation often involves piping data from PowerShell (powershell -c "..." | .\tee.bat) or WSL. PowerShell defaults to UTF-16 for pipeline communication with external native executables in certain contexts, or the console output is converted to UTF-16.
  • Byte-Stream Blindness: Batch file commands like FOR /F treat the input stream as a text file. If that stream contains multi-byte characters, batch does not natively decode them. It simply reads bytes. A single UTF-16 ‘A’ looks like two bytes: 0x41 0x00. Batch reads 0x41 (ASCII ‘A’), ignores the null 0x00 (often treated as whitespace or token separator), and then reads the next character, breaking the data structure.

Real-World Impact

  • Data Corruption: Files written by this script are unreadable by standard text editors (Notepad, VS Code) which expect valid text encoding.
  • Silent Failures: Since the console output looks correct (if the console handles the encoding conversion visually), the corruption is often not detected until the file is used in a downstream process.
  • Debugging Nightmare: The garbage characters (“效汬潗汲⁤਍”) are cryptic and do not immediately point to an encoding issue, leading developers to blame the pipe source or the ECHO command itself.

Example or Code

The following Python script simulates the exact byte stream that causes the failure. It writes “Hello World” in UTF-16LE (as many Windows tools do) and prints the hex values to show why the batch parser fails.

import os

# Simulate the input stream being UTF-16LE, which is common in Windows pipelines
# "Hello World" in UTF-16LE is:
# H(48 00) e(65 00) l(6c 00) l(6c 00) o(6f 00) (space 20 00)
# W(57 00) o(6f 00) r(72 00) l(6c 00) d(64 00) CR(0d 00) LF(0a 00)
input_bytes = b'H\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d\x00\r\x00\n\x00'

# Create the input file for the test
with open('utf16_input.txt', 'wb') as f:
    f.write(input_bytes)

# Demonstrate the byte corruption visually
print("--- Hex Dump of Input ---")
print(input_bytes.hex(' '))
print("\nWhen Batch processes this byte stream:")
print("It sees bytes: [48][00] -> 'H' then a 'delimiter' (00)")
print("It sees bytes: [65][00] -> 'e' then a 'delimiter' (00)")
print("It creates garbage because it concatenates data incorrectly.")

How Senior Engineers Fix It

Do not use Batch for text stream manipulation. Senior engineers recognize that Batch is insufficient for modern character encoding requirements.

  1. Use PowerShell: The native, supported way to do tee in Windows is using PowerShell’s Tee-Object.
    "Hello World" | Tee-Object -FilePath "output.txt"
  2. Use WSL/Linux Tools: If the environment supports it, use the native tee command from a Linux subsystem.
    echo "Hello World" | tee output.txt
  3. Hybrid Scripts: If a .bat file is mandatory, the senior fix is to have the batch file immediately invoke PowerShell to handle the I/O correctly, passing the arguments through.
    @echo off
    powershell -NoProfile -Command " & { [Console]::In.ReadToEnd() | Tee-Object -FilePath '%1' }"

    This bypasses the Batch parser entirely, relying on PowerShell’s robust stream handling.

Why Juniors Miss It

  • ASCII Assumption: Juniors often assume that text is just ASCII characters (bytes 0-127). They write scripts that work for “Test” but fail for “你好” or even complex ANSI inputs.
  • “It looks right on screen”: The console host (cmd.exe or Terminal) often automatically converts UTF-16 bytes back to readable characters for the display. The junior developer sees “Hello World” on the screen and assumes the pipe worked, missing that the file write operation (which bypasses the display rendering) is corrupting the data.
  • Over-reliance on Batch: They often view Batch as a capable scripting language for all tasks, not realizing it is a 30-year-old shell wrapper that has significant limitations regarding I/O and encoding.