PPO performance drops drastically when reducing observation from 180-dim lidar to 12-dim handcrafted state (Flappy Bird)

Summary

PPO performs dramatically better with a 180‑dimensional lidar observation because the high‑dimensional signal implicitly encodes temporal information, geometry, and future hazard structure, while the handcrafted 12‑dimensional state removes those cues. The agent is not failing because the state is small — it is failing because the compact state removes the features PPO relies on for stable credit assignment and action timing.

Root Cause

The performance drop comes from loss of implicit temporal and geometric structure when moving from lidar → 12D state.

Key missing elements in the 12D version:

No time‑to-collision information (PPO must infer it from velocity + positions across many steps).
No curvature or shape cues about pipe geometry (lidar gives a full contour).
No implicit velocity of the world (lidar encodes motion through changing ray distances).
No redundancy — lidar gives hundreds of correlated signals; 12D gives only one shot per feature.
Higher sensitivity to noise — small errors in CV detection or normalization distort the entire state.

The 12D state is fully observable, but not rich enough for PPO’s on‑policy learning dynamics.

Why This Happens in Real Systems

Real RL systems often show this exact pattern because:

High‑dimensional observations contain built‑in temporal structure (e.g., pixel motion, lidar drift).
Low‑dimensional states require the policy to learn its own internal dynamics model, which PPO is not good at.
On‑policy algorithms struggle with sparse temporal cues, especially when action timing is critical.
Handcrafted states remove redundancy, making the policy brittle and harder to optimize.

In other words: rich observations reduce the burden on the policy network.

Real-World Impact

When the state is too compact:

Policies plateau early because PPO cannot infer long‑horizon dependencies.
Exploration becomes ineffective — the agent cannot “see” future hazards early enough.
Policies become over‑reactive instead of predictive.
Training becomes more sensitive to hyperparameters, reward shaping, and noise.
Generalization collapses because the model memorizes specific geometric patterns instead of learning dynamics.

This is why your 12D agent caps at ~80–120 while the lidar agent reaches thousands.

Example or Code (if necessary and relevant)

A minimal example of the missing feature: time‑to‑gap.

time_to_gap = (pipe_x - bird_x) / abs(pipe_vx)

This single feature often boosts PPO performance more than doubling network size.

How Senior Engineers Fix It

Experienced RL engineers typically solve this by adding back temporal structure that the compact state removed.

Common fixes:

Add time‑to-collision features (TTC, TTC to top/bottom, TTC to next pipe).
Add delta features (Δgap_y, Δpipe_x, Δpipe_vx).
Add short history (stack last 2–4 states).
Use an LSTM policy so PPO can infer dynamics internally.
Add redundant geometry features (gap height, pipe slope, vertical clearance).
Reduce noise in CV‑derived features (smoothing, filtering, Kalman).

The key principle: compact does not mean minimal — it must still encode dynamics.

Why Juniors Miss It

Less experienced engineers often assume:

“If the state is fully observable, PPO should learn it.”
“Smaller state = easier learning.”
“Neural networks will infer dynamics automatically.”

But PPO is not a dynamics‑learning algorithm. It is an on‑policy, short‑horizon optimizer that depends heavily on:

Redundant cues
Smooth gradients
Predictive structure in the observation

Juniors underestimate how much implicit temporal information is baked into high‑dimensional observations.

If you want, I can generate a corrected 12D state design that matches lidar‑level performance.