Why Q-learning failed missing epsilon-greedy and reward loop bug
Summary A production-level reinforcement learning simulation failed its validation suite due to a logic error in the environment dynamics and action selection. Specifically, the implementation lacked an epsilon-greedy strategy for exploration, used a hardcoded action (always moving right), and implemented a faulty reward loop at the terminal state. This resulted in the agent converging to … Read more