Generalized Advantage Estimator Calculator

Compute generalized advantages, deltas, and return targets accurately. Review trajectories with normalization and export tools. Built for reinforcement learning experiments, audits, and reporting workflows.

Calculator Inputs

Trajectory Name

Gamma

Lambda

Normalization Epsilon

Decimals

Normalize advantages

Rewards Sequence

State Values

Next-State Values

Done Flags

Example Data Table

Step	Reward	Value	Next Value	Done
0	1.20	0.90	0.70	0
1	0.40	0.70	0.50	0
2	-0.10	0.50	0.80	0
3	1.60	0.80	0.45	0
4	0.70	0.45	0.10	0
5	0.00	0.10	0.00	1

Use the example button to copy these values into the form instantly.

Formula Used

Temporal-difference residual: δ_t = r_t + γV(s_t+1)(1 - d_t) - V(s_t)

Generalized advantage estimate: A_t = δ_t + γλ(1 - d_t)A_t+1

Value target: R̂_t = A_t + V(s_t)

Optional normalization: A′_t = (A_t - μ) / (σ + ε)

How to Use This Calculator

Enter a trajectory name if you want labeled exports.
Paste rewards, values, next-state values, and done flags in matching order.
Set gamma for discounting and lambda for the bias-variance tradeoff.
Enable normalization if you want centered and scaled displayed advantages.
Choose the output precision and submit the form.
Review deltas, raw advantages, displayed advantages, and return targets.
Use the chart to inspect reward propagation across timesteps.
Export the resulting table as CSV or PDF for reports.

FAQs

1. What does this calculator estimate?

It computes temporal-difference deltas, raw advantages, optional normalized advantages, and value targets for a trajectory. These outputs help inspect variance, reward propagation, and critic consistency during reinforcement learning updates.

2. Why provide next-state values separately?

Separate next-state estimates let you inspect bootstrapping exactly as collected. That helps when trajectories are truncated, padded, or generated from batched environments where the next prediction is stored independently.

3. What should go in the done list?

Use 1 for terminal or cutoff steps where bootstrapping stops. Use 0 for continuing steps. The mask prevents future value estimates from leaking across episode boundaries.

4. Should I normalize advantages?

Normalization can stabilize policy-gradient updates by centering and scaling advantages. It is useful for comparison and training diagnostics, but value targets should still come from the raw, unnormalized advantage sequence.

5. What gamma should I use?

Gamma controls how strongly future rewards matter. Values near 1 emphasize long horizons, while smaller values focus learning on near-term outcomes and reduce sensitivity to distant rewards.

6. What does lambda change in GAE?

Lambda tunes bias versus variance in the advantage estimate. Higher values preserve longer reward chains, while lower values rely more on one-step temporal-difference information.

7. Why might the chart look noisy?

Noisy curves usually reflect sparse rewards, unstable value estimates, or mixed episode boundaries. Check the done flags, verify sequence alignment, and compare raw versus normalized advantages.

8. Can I use this for PPO or A2C?

Yes. The calculator matches the common GAE structure used in PPO, A2C, and related actor-critic methods, making it useful for debugging trajectories before training.