Summary
The batch tee function fails to handle raw binary byte streams correctly, resulting in data corruption where ASCII characters are replaced by garbled output (e.g., “Hello World” becomes “效汬潗汲”). The root cause is a mismatch between the input encoding and the batch processing logic. The input stream is likely UTF-16 encoded (standard output from PowerShell or certain Windows commands), but the batch script reads it as if it were ASCII/ANSI and processes it using tokenization. This causes multi-byte characters to be misinterpreted, split across tokens, and reassembled incorrectly during the output phase.
Root Cause
The specific technical failures in the implementation are:
- UTF-16 vs. ASCII Mismatch: The input stream entering the pipe is UTF-16 LE (Little Endian). In UTF-16, the character ‘H’ (ASCII 0x48) is represented as bytes
0x48 0x00. The batch commandFIND /N /V ""echoes lines with a prefix like[1]followed by the line content. However, because the input contains null bytes (0x00) between every character, theFOR /Floop (which uses space and tab as default delimiters) sees the null bytes as delimiters, effectively breaking the data into garbage tokens. - Line Ending Corruption: The input likely contains Carriage Return + Line Feed (
\r\n). TheFINDcommand adds its ownCR LFat the end. TheFOR /Floop parses lines, but due to the encoding issues, it fails to correctly identify where the original line ends, often stripping or mangling the final carriage return. - Tokenization Logic Error: The line
FOR /F "tokens=1* delims=]" ...attempts to strip the[1]prefix. However, due to the UTF-16 interleaving, the]character (bytes0x5D 0x00) appears differently in the stream. The parser fails to find the]as a clean delimiter, leading to it processing the raw bytes.
Why This Happens in Real Systems
Windows batch scripting is a legacy technology built on the MS-DOS architecture, which predates Unicode standards. It natively operates on the active Code Page (ANSI).
- PowerShell Integration: Modern Windows automation often involves piping data from PowerShell (
powershell -c "..." | .\tee.bat) or WSL. PowerShell defaults to UTF-16 for pipeline communication with external native executables in certain contexts, or the console output is converted to UTF-16. - Byte-Stream Blindness: Batch file commands like
FOR /Ftreat the input stream as a text file. If that stream contains multi-byte characters, batch does not natively decode them. It simply reads bytes. A single UTF-16 ‘A’ looks like two bytes:0x41 0x00. Batch reads0x41(ASCII ‘A’), ignores the null0x00(often treated as whitespace or token separator), and then reads the next character, breaking the data structure.
Real-World Impact
- Data Corruption: Files written by this script are unreadable by standard text editors (Notepad, VS Code) which expect valid text encoding.
- Silent Failures: Since the console output looks correct (if the console handles the encoding conversion visually), the corruption is often not detected until the file is used in a downstream process.
- Debugging Nightmare: The garbage characters (“效汬潗汲”) are cryptic and do not immediately point to an encoding issue, leading developers to blame the pipe source or the
ECHOcommand itself.
Example or Code
The following Python script simulates the exact byte stream that causes the failure. It writes “Hello World” in UTF-16LE (as many Windows tools do) and prints the hex values to show why the batch parser fails.
import os
# Simulate the input stream being UTF-16LE, which is common in Windows pipelines
# "Hello World" in UTF-16LE is:
# H(48 00) e(65 00) l(6c 00) l(6c 00) o(6f 00) (space 20 00)
# W(57 00) o(6f 00) r(72 00) l(6c 00) d(64 00) CR(0d 00) LF(0a 00)
input_bytes = b'H\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d\x00\r\x00\n\x00'
# Create the input file for the test
with open('utf16_input.txt', 'wb') as f:
f.write(input_bytes)
# Demonstrate the byte corruption visually
print("--- Hex Dump of Input ---")
print(input_bytes.hex(' '))
print("\nWhen Batch processes this byte stream:")
print("It sees bytes: [48][00] -> 'H' then a 'delimiter' (00)")
print("It sees bytes: [65][00] -> 'e' then a 'delimiter' (00)")
print("It creates garbage because it concatenates data incorrectly.")
How Senior Engineers Fix It
Do not use Batch for text stream manipulation. Senior engineers recognize that Batch is insufficient for modern character encoding requirements.
- Use PowerShell: The native, supported way to do
teein Windows is using PowerShell’sTee-Object."Hello World" | Tee-Object -FilePath "output.txt" - Use WSL/Linux Tools: If the environment supports it, use the native
teecommand from a Linux subsystem.echo "Hello World" | tee output.txt - Hybrid Scripts: If a
.batfile is mandatory, the senior fix is to have the batch file immediately invoke PowerShell to handle the I/O correctly, passing the arguments through.@echo off powershell -NoProfile -Command " & { [Console]::In.ReadToEnd() | Tee-Object -FilePath '%1' }"This bypasses the Batch parser entirely, relying on PowerShell’s robust stream handling.
Why Juniors Miss It
- ASCII Assumption: Juniors often assume that text is just ASCII characters (bytes 0-127). They write scripts that work for “Test” but fail for “你好” or even complex ANSI inputs.
- “It looks right on screen”: The console host (
cmd.exeor Terminal) often automatically converts UTF-16 bytes back to readable characters for the display. The junior developer sees “Hello World” on the screen and assumes the pipe worked, missing that the file write operation (which bypasses the display rendering) is corrupting the data. - Over-reliance on Batch: They often view Batch as a capable scripting language for all tasks, not realizing it is a 30-year-old shell wrapper that has significant limitations regarding I/O and encoding.