Compute generalized advantages, deltas, and return targets accurately. Review trajectories with normalization and export tools. Built for reinforcement learning experiments, audits, and reporting workflows.
| Step | Reward | Value | Next Value | Done |
|---|---|---|---|---|
| 0 | 1.20 | 0.90 | 0.70 | 0 |
| 1 | 0.40 | 0.70 | 0.50 | 0 |
| 2 | -0.10 | 0.50 | 0.80 | 0 |
| 3 | 1.60 | 0.80 | 0.45 | 0 |
| 4 | 0.70 | 0.45 | 0.10 | 0 |
| 5 | 0.00 | 0.10 | 0.00 | 1 |
Use the example button to copy these values into the form instantly.
Temporal-difference residual: δt = rt + γV(st+1)(1 - dt) - V(st)
Generalized advantage estimate: At = δt + γλ(1 - dt)At+1
Value target: R̂t = At + V(st)
Optional normalization: A′t = (At - μ) / (σ + ε)
It computes temporal-difference deltas, raw advantages, optional normalized advantages, and value targets for a trajectory. These outputs help inspect variance, reward propagation, and critic consistency during reinforcement learning updates.
Separate next-state estimates let you inspect bootstrapping exactly as collected. That helps when trajectories are truncated, padded, or generated from batched environments where the next prediction is stored independently.
Use 1 for terminal or cutoff steps where bootstrapping stops. Use 0 for continuing steps. The mask prevents future value estimates from leaking across episode boundaries.
Normalization can stabilize policy-gradient updates by centering and scaling advantages. It is useful for comparison and training diagnostics, but value targets should still come from the raw, unnormalized advantage sequence.
Gamma controls how strongly future rewards matter. Values near 1 emphasize long horizons, while smaller values focus learning on near-term outcomes and reduce sensitivity to distant rewards.
Lambda tunes bias versus variance in the advantage estimate. Higher values preserve longer reward chains, while lower values rely more on one-step temporal-difference information.
Noisy curves usually reflect sparse rewards, unstable value estimates, or mixed episode boundaries. Check the done flags, verify sequence alignment, and compare raw versus normalized advantages.
Yes. The calculator matches the common GAE structure used in PPO, A2C, and related actor-critic methods, making it useful for debugging trajectories before training.
Important Note: All the Calculators listed in this site are for educational purpose only and we do not guarentee the accuracy of results. Please do consult with other sources as well.